+ added configuration file parsing and checking functions
[qpalma.git] / doc / qpalma.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{}
14
15 \maketitle
16 %
17 %
18 %
19 \section{Intro}
20
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$, $454$ or
23 $SOLid$. We refer to the whole pipeline as the \QP pipeline and \QP
24 respectively.
25
26 Basically \QP assumes that you have the following data:
27 \begin{itemize}
28 \item The reads you want to align,
29 \item parts/full genomic sequences of an organism,
30 \item splice site scores predicted for the genomic sequences.
31 \end{itemize}
32
33 The project has two central configuration files: PipelineConf.py and
34 QPalmaConf.py where the former stores the pipeline settings and the latter the
35 alignment specific options.
36
37 The project results directory (\emph{result\_dir}) contains then the subdirectories
38 \begin{itemize}
39 \item \emph{mapping} with subdirs main and spliced
40 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
41 \item \emph{remapping}
42 \end{itemize}
43
44
45 Via the variable ``result\_dir'' you can specify where all of QPalma's data should reside.
46 This directory contains the following subdirectories:
47 \begin{itemize}
48 \item preprocessing
49 \item approximation
50 \item prediction
51 \item postprocessing, and
52 \item training
53 \end{itemize}
54
55 %
56 %
57 %
58 \section{Installation}
59
60 QPalma has the following requirements:
61 \begin{itemize}
62 \item Numpy
63 \item In order to use QPalma on a cluster you need the pythongrid package which
64 can be found under the following URL:
65 \item For training you need either one of the following optimization toolkits:
66 \begin{itemize}
67 \item CPLEX
68 \item CVXOPT
69 \item MOSEK
70 \end{itemize}
71 \end{itemize}
72
73
74 %
75 %
76 %
77 \section{Pipeline}
78
79 The full pipline consists of $n$ steps:
80
81 \begin{enumerate}
82 \item Find alignment seeds using a fast suffix array method (vmatch) for
83 all given reads. This may take several rounds for subsets of the reads.
84 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
85 \item Use the \QPH to identify those reads that have a full seed but might be
86 spliced anyways.
87 \item Once we identified all potentially spliced reads we use \QPA to align
88 those to their seed regions.
89 \item One can choose between several post-processing steps in order to refine
90 the quality of the alignments via filtering.
91 \end{enumerate}
92
93 %
94 %
95 %
96 \section{Usage}
97
98 A typical run consists of
99 \begin{enumerate}
100 \item Set your parameters (read size, number of mismatches, etc.) in the
101 configuration files.
102 \item Start the QPalma pipeline via \emph{start\_pipeline}.
103 \end{enumerate}
104
105 \section{Options}
106
107 First one creates datasets using run-specific preprocessing tools. After
108 dataset creation one checks the sets for consistency using the
109 check\_dataset\_consistency.py script.
110
111 \section{QPalma Commands}
112
113 \subsection{check\_and\_init}
114
115 Performs sanity checking of the configurations file(s). Initializes needed
116 directories.
117 %
118 %
119 %
120 \section{Training}
121
122 QPalma needs some training examples
123 For the training you need:
124
125 \begin{itemize}
126 \item Training examples i.e. correct alignments (you can artificially generate those see \ref)
127 \item Splice site predictions
128 \item Flat files of the genomic sequences you want to align to
129 \end{itemize}
130
131 A training example is definded as a $4$-tuple, with elements:
132 \begin{enumerate}
133 \item A sequence information tuple
134 \item the read itself
135 \item the quality tuple
136 \item the alignment information
137 \end{enumerate}
138
139 A prediction example is defined as a $3$-tuple, with elements:
140 \begin{enumerate}
141 \item A sequence information tuple
142 \item the read itself
143 \item the quality tuple
144 \end{enumerate}
145
146 The sequence information tuple itself consists of
147 \begin{enumerate}
148 \item The read id
149 \item Chromosome/Contig id
150 \item Strand
151 \item Start of the genomic region we want to align to
152 \item Stop of the genomic region
153 \end{enumerate}
154
155
156
157 currentSeqInfo
158
159 %
160 %
161 %
162 \section{Data Standards}
163
164
165 \subsection{Read format and internal representation}
166
167 \begin{itemize}
168 \item Which nucleotide bears the score of the splice site ?
169 \item What exactly are the exon/intron boundaries pointing to (also are they 0/1-based) ?
170 \end{itemize}
171
172 We address the first question as follows:
173 \texttt{
174 ----gt-------ag------
175 * *
176 }
177 the score sits at the g's of the splice sites.
178
179
180 %
181 %
182 %
183 \section{Format Specifications}
184
185 The format of the file containing the mapped short reads is as follows. Each
186 line corresponds to one short read. Each line has six tab separated entries,
187 namely:
188 \begin{enumerate}
189 \item unique read id
190 \item chromosome/contig id
191 \item position of match in chromosome/contig
192 \item strand
193 \item read sequence
194 \item read quality
195 \end{enumerate}
196
197
198 \subsection{Splice Scores}
199
200 %
201 %
202 %
203 \end{document}