53ce4f28fdb65f364afe920e3898eb24cb59192a
[qpalma.git] / doc / qpalma.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{}
14
15 \maketitle
16 %
17 %
18 %
19 \section{Intro}
20
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$, $454$ or
23 $SOLid$. We refer to the whole pipeline as the \QP pipeline and \QP
24 respectively.
25
26 Basically \QP assumes that you have the following data:
27 \begin{itemize}
28 \item The reads you want to align,
29 \item parts/full genomic sequences of an organism,
30 \item splice site scores predicted for the genomic sequences.
31 \end{itemize}
32
33 The project has two central configuration files: PipelineConf.py and
34 QPalmaConf.py where the former stores the pipeline settings and the latter the
35 alignment specific options.
36
37 The project results directory (\emph{result\_dir}) contains then the subdirectories
38 \begin{itemize}
39 \item \emph{mapping} with subdirs main and spliced
40 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
41 \item \emph{remapping}
42 \end{itemize}
43
44 %
45 %
46 %
47 \section{Pipeline}
48
49 The full pipline constist of $n$ steps:
50
51 \begin{enumerate}
52 \item Find alignment seeds using a fast suffix array method (vmatch) for
53 all given reads. This may take several rounds for subsets of the reads.
54 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
55 \item Use the \QPH to identify those reads that have a full seed but might be
56 spliced anyways.
57 \item Once we identified all potentially spliced reads we use \QPA to align
58 those to their seed regions.
59 \item One can choose between several post-processing steps in order to refine
60 the quality of the alignments via filtering.
61 \end{enumerate}
62
63 %
64 %
65 %
66 \section{Usage}
67
68 A typical run consists of
69 \begin{enumerate}
70 \item Set your parameters (read size, number of mismatches, etc.) in the
71 configuration files.
72 \item Start the QPalma pipeline via \emph{start\_pipeline}.
73 \end{enumerate}
74
75 \section{Options}
76
77 First one creates datasets using run-specific preprocessing tools. After
78 dataset creation one checks the sets for consistency using the
79 check\_dataset\_consistency.py script.
80
81 \section{QPalma Commands}
82
83 \subsection{check\_and\_init}
84
85 Performs sanity checking of the configurations file(s). Initializes needed
86 directories.
87 %
88 %
89 %
90 \section{Training}
91
92 QPalma needs some training examples
93 For the training you need:
94
95 \begin{itemize}
96 \item Training examples i.e. correct alignments (you can artificially generate those see \ref)
97 \item Splice site predictions
98 \item Flat files of the genomic sequences you want to align to
99 \end{itemize}
100
101 A training example is definded as a $4$-tuple, with elements:
102 \begin{enumerate}
103 \item A sequence information tuple
104 \item the read itself
105 \item the quality tuple
106 \item the alignment information
107 \end{enumerate}
108
109 A prediction example is defined as a $3$-tuple, with elements:
110 \begin{enumerate}
111 \item A sequence information tuple
112 \item the read itself
113 \item the quality tuple
114 \end{enumerate}
115
116 The sequence information tuple itself consists of
117 \begin{enumerate}
118 \item The read id
119 \item Chromosome/Contig id
120 \item Strand
121 \item Start of the genomic region we want to align to
122 \item Stop of the genomic region
123 \end{enumerate}
124
125
126
127 currentSeqInfo
128
129 %
130 %
131 %
132 \section{Data Standards}
133
134
135 \subsection{Read format and internal representation}
136
137 \begin{itemize}
138 \item Which nucleotide bears the score of the splice site ?
139 \item What exactly are the exon/intron boundaries pointing to (also are they 0/1-based) ?
140 \end{itemize}
141
142 We address the first question as follows:
143 \texttt{
144 ----gt-------ag------
145 * *
146 }
147 the score sits at the g's of the splice sites.
148
149
150 %
151 %
152 %
153 \section{Format Specifications}
154
155 The format of the file containing the mapped short reads is as follows. Each
156 line corresponds to one short read. Each line has six tab separated entries,
157 namely:
158 \begin{enumerate}
159 \item unique read id
160 \item chromosome/contig id
161 \item position of match in chromosome/contig
162 \item strand
163 \item read sequence
164 \item read quality
165 \end{enumerate}
166
167
168 \subsection{Splice Scores}
169
170 %
171 %
172 %
173 \end{document}