+ update makefiles to fetch automatically valid Python includes and libs
[qpalma.git] / doc / qpalma-manual.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10 \newcommand{\qparam}[1]{{\bf #1}}
11
12 \title{QPalma Documentation}
13 \author{Fabio De Bona}
14 \date{October 2008}
15
16 \maketitle
17 %
18 %
19 %
20 \section{Intro}
21
22 \QP is an alignment tool targeted to align spliced reads produced by ``Next
23 Generation'' sequencing platforms such as \emph{Illumina Genome Analyzer} or
24 \emph{454}. The basic idea is to use an extended Smith-Waterman algorithm for
25 local alignments that uses the base quality information of the reads directly
26 in the alignment step. Optimal alignment parameters i.e. scoring matrices are
27 inferred using a machine learning technique similar to \emph{Support Vector
28 Machines}. For further details on \QP itself consult the paper
29 \cite{DeBona08}. For details about the learning method see
30 \cite{Tsochantaridis04}.
31
32 %
33 %
34 %
35 \section{Installation}
36
37 The following installation notes assume that you are working in a
38 \emph{Linux}/*NIX environment. \emph{Windows} or \emph{Mac OS} versions do not
39 exist and are not planned. You should have a working \emph{DRMAA}
40 (http://www.drmaa.org/) installation to submit jobs to a cluster. \QP is
41 written in C++ and Python and tested with the \emph{gcc} compiler suite and
42 python2.5.
43
44 \subsection{Dependencies}
45
46 \QP was designed to be as self-contained as possible. However it still has some
47 dependencies. In order to install \QP completely you will need:
48 \begin{itemize}
49 \item $SWIG$, the simple wrapper interface generator,
50 \item[$\rightarrow$] http://www.swig.org
51 \item $numpy$, a python package for numeric computations,
52 \item[$\rightarrow$] http://numpy.scipy.org/
53 \item Pythongrid a package for cluster computation, and
54 \item Genefinding tools which offers some data formats.
55 \end{itemize}
56 The latter two packages can be obtained from the \QP website. \\ \noindent
57 For training \QP you need one of the following optimization toolkits:
58 \begin{itemize}
59 \item CVXOPT (http://abel.ee.ucla.edu/cvxopt, Free)
60 \item MOSEK (http://www.mosek.com, Commercial)
61 \item CPLEX (http://www.ilog.com/products/cplex, Commercial)
62 \end{itemize}
63
64 \subsection{Step by step installation guide}
65
66 \begin{enumerate}
67 \item Install the Pythongrid and the Genefinding tool packages
68 \item Update your PYTHONPATH variable to point to the above packages
69 \item Unpack the QPalma tarball via
70 \item[$\rightarrow$] tar -xzvf QPalma-0.9.tar.gz
71 \item Enter the QPalma-0.9 directory and type:
72 \item[$\rightarrow$] python setup.py build
73 \end{enumerate}
74 \noindent
75 In order to check your installation you can run the script
76 test\_qpalma\_installation.py which can be found in the directory tests/ (Enter
77 the directory first as the script works with relative paths). This script
78 either reports a successful installation or generates an error log file. You
79 can send this error log file to qpalma@tuebingen.mpg.de.
80 \\ \noindent
81 In order to make a full test run with sample data you can run the script
82 test\_complete\_pipeline.py. this scrips executes the full pipeline with data
83 from a small artificial data set.
84
85 %
86 %
87 %
88 \section{Working with \QP}
89
90 We assume now that you have a successful \QP installation. When working with \QP
91 you usually need two commands:
92
93 \begin{itemize}
94 \item python qpalma\_pipeline.py train example.conf
95 \item python qpalma\_pipeline.py predict example.conf
96 \end{itemize}
97 \noindent
98 \QP has two modes \emph{predict} and \emph{train}. All settings are supplied in
99 a configuration file here example.conf. \\ \noindent
100
101 Basically \QP assumes that you have the following data:
102 \begin{itemize}
103 \item The reads you want to align together with their seed positions,
104 \item genomic sequences,
105 \item splice site scores predicted for the genomic sequences.
106 \end{itemize}
107
108 A typical run consists of
109 \begin{enumerate}
110 \item Find the seeds for your reads using another method (NOT VIA \QP). For example vmatch.
111 \item Set your parameters (read size, number of mismatches, etc.) in the
112 configuration files.
113 \item Train a parameter vector once for your organism.
114 \item Predict read alignments using the trained parameters.
115 \end{enumerate}
116 \noindent
117 Note: To obtain splice site scores we use the predictions of another tool of
118 our group \cite{SonnenburgSchweikertPhilipsBehrRaetsch07}. However this tool is
119 not available yet but you can use any splice site prediction tool you prefer as
120 long as you stick to the data format described below.
121 \\ \\ \noindent
122 \QP uses one central configuration file. The parameters you can set in this file are
123 described below.
124
125 \subsection{The configuration file}
126
127 In order to run \QP successfully you will need to create a configuration file.
128 All valid parameters for \QP are summarized below. You can also look at the
129 example\_configuration.conf file which is included in the distribution.
130 \\ \noindent
131 The parameters do not have to be in a particular order. In order to simplify
132 explanations we have grouped them according to their particular semantics.
133 \\ \noindent
134 Via the variable ``result\_dir'' you can specify where all of QPalma's data
135 should reside. During initialization of \QP the following sub directories of
136 result\_dir will be automatically created:
137 \begin{itemize}
138 \item preprocessing
139 \item approximation
140 \item dataset
141 \item prediction
142 \item alignment
143 \item postprocessing, and
144 \item training.
145 \end{itemize}
146 \noindent
147 Each of these directories will hold all data created in a particular step of the
148 \QP pipeline. The final alignments can be found in the alignment directory.
149 \\ \noindent
150 The first group consists of general settings for \QP:
151 \begin{itemize}
152 \item \qparam{perform\_checks}- Enables some checks and debugging output.
153 \item \qparam{platform} - IGA or $454$ for Illumina Genome Analyzer or Roche's $454$.
154 \item \qparam{result\_dir} - This is the toplevel result directory.
155 \item \qparam{spliced\_reads\_fn} - A file consisting of putative spliced reads (obtained for example via vmatch)
156 \item \qparam{unspliced\_reads\_fn} - A file consisting of putative unspliced reads (obtained for example via vmatch)
157 \item \qparam{num\_splits} - number of computing nodes
158 \item \qparam{prediction\_param\_fn} - For prediction needed. This is the location of the trained parameter file.
159 \end{itemize}
160 \noindent
161 The second group includes parameters needed for accessing the sequence data and
162 the splice site predictions:
163 \begin{itemize}
164 \item \qparam{genome\_dir} - The location of the genomic sequence files.
165 \item \qparam{acceptor\_scores\_loc} - The location of the acceptor scores files.
166 \item \qparam{donor\_scores\_loc} - The location of the acceptor scores files.
167 \item \qparam{genome\_file\_fmt} - A format string describing how valid file names are generated. See description below.
168 \item \qparam{splice\_score\_file\_fmt} - A format string describing how valid file names for the splice site scores are generated.
169 \item \qparam{allowed\_fragments} - A list of the form ``[1,2,4]'' describing the valid file name numbers. See description below.
170 \item \qparam{half\_window\_size} - Given an alignment seed position for a
171 given read we cut out the area [1500-seed\_pos,seed\_pos+1500].
172 \item \qparam{output\_format} - The output format can be either blat-like, ShoRe or
173 mGene.
174 \item \qparam{prb\_offset} - We expect the quality values to be saved as ascii
175 strings so the quality is calulated via ord(qchar)-prb\_offset ($ord(\cdot)$
176 returns the numeric ascii value). For example an 'h' would correspond to
177 quality $40$.
178 \end{itemize}
179 \noindent
180 The genomic sequence and the splice site scores file are accessed using the
181 format strings given above. This works as follows: Suppose we have two
182 chromosomes 1 \& 4 we want to align to.
183 The sequences are in flat files:
184 \begin{center}
185 /home/fabio/genomic\_sequence/chromosome\_1.flat \\ \noindent
186 /home/fabio/genomic\_sequence/chromosome\_4.flat
187 \end{center}
188 \noindent
189 Then we set genome\_dir to \emph{/home/fabio/genomic\_sequence},
190 genome\_file\_fmt to ``chromosome\_\%d.flat'' and allowed\_fragments to [1,4].
191 \\ \noindent
192 Parameters needed for training are:
193 \begin{itemize}
194 \item C - this is a parameter trading off between loss term and regularization (similar to C in SVM training).
195 \item enable\_quality\_scores - You can enable or disable the use of quality
196 scores via settings this parameter to True or False
197 \item enable\_splice\_scores - You can enable or disable the use of quality
198 scores via settings this parameter to True or False
199 \item enable\_intron\_length - You can enable or disable the use of quality
200 scores via settings this parameter to True or False
201 \item optimizer - Either one of CVXOPT, MOSEK or CPLEX.
202 \item numConstraintsPerRound - How many constraints per optimizer call should be generated (if unsure set to $50$).
203 \end{itemize}
204 \noindent
205 For training and prediction purposes you can set the number of support points
206 for the piecewise linear functions of the scoring matrix. This effectively
207 changes the total number of parameters of the model. That means that you can
208 only predict with a parameter vector which was trained with the same settings.
209 (We refer to piecewise linear functions as \emph{plifs}.)
210 \begin{itemize}
211 \item numAccSuppPoints - The number of support points for the plif scoring the acceptor scores.
212 \item numDonSuppPoints - The number of support points for the plif scoring the acceptor scores.
213 \item numLengthSuppPoints - The number of support points for the plif scoring the acceptor scores.
214 \item min\_intron\_len,max\_intron\_len - Set this to the range of minimum and maximum intron lengths.
215 \item min\_qual and max\_qual - Set this to the range of possible quality values. (Illumina GA usually [-5,40]).
216 \item min\_svm\_score and max\_svm\_score - As we work with normalized (in the
217 range [0.0,1.0]) splice site scores this is usually $0.0$ resp. $1.0$.
218 \end{itemize}
219 \noindent
220 The following last group of parameters includes ``low-level'' parameters. Just
221 keep the default values of these parameters. Possible extensions of \QP may use
222 them differently.
223 \begin{itemize}
224 \item remove\_duplicate\_scores
225 \item print\_matrix
226 \item matchmatrixCols, matchmatrixRows
227 \item numQualPlifs
228 \item numQualSuppPoints
229 \item totalQualSuppPoints
230 \item anzpath
231 \item iter\_steps
232 \end{itemize}
233
234 %
235 %
236 %
237 \section{File Formats / Specifications}
238
239 This section introduces all formats and conventions that are assumed to be met
240 by the users in order to make \QP work.
241
242 \subsection{Format of the configuration file}
243
244 The configuration file includes all settings \QP needs to perform an analysis.
245 This includes paths to files where the raw data exists as well as settings which
246 sequencing platform is being used,the number of cluster nodes to employ etc. .
247
248 Its values are in the form
249 \begin{center}
250 key = value
251 \end{center}
252 and ``\#'' for lines containing comments.
253
254 \subsection{Read format and internal representation}
255
256 The read input files for \QP contain the read sequences with their quality as
257 well as some information from the first mapping / seed region finding. The
258 format of the file containing the mapped short reads is as follows. Each line
259 corresponds to one short read. Each line has six tab-separated entries, namely:
260
261 \begin{enumerate}
262 \item unique read id
263 \item chromosome/contig id
264 \item position of match in chromosome/contig (0-based, relative to positive strand)
265 \item strand [D/P or +/-]
266 \item read sequence (in strand specific direction)
267 \item read quality (in strand specific direction)
268 \end{enumerate}
269
270 Strand specific direction means that \QP assumes that the reads are already in
271 their true orientation and the qualities as well.
272
273
274 \subsection{Training file format}
275
276 The read input files for \QP contain the read sequences with their quality as
277 well as some information from the first mapping / seed region finding. The
278 format of the file containing the mapped short reads is as follows. Each line
279 corresponds to one short read. Each line has six tab-separated entries, namely:
280
281 \begin{enumerate}
282 \item unique read id
283 \item chromosome/contig id
284 \item strand [D/P or +/-]
285 \item beginning of sequence fragment
286 \item end of sequence fragment
287 \item read sequence (in strand specific direction) with alignment information (see below)
288 \item read quality (in strand specific direction)
289 \item beginning of $1^{st}$ exon
290 \item end of $1^{st}$ exon
291 \item beginning of $2^{nd}$ exon
292 \item end of $2^{nd}$ exon
293 \end{enumerate}
294
295 Strand specific direction means that \QP assumes that the reads are already in
296 their true orientation and the qualities as well.
297 \\ \noindent
298 Alignment information means that an alignment of a read to a genomic sequence A
299 mismatch is encoded as $[AG]$ if $A$ is on the sequence and $G$ on the read
300 side. A gap is denoted by $[-X]$ resp. $[X-]$ denotinge a gap on the sequence
301 resp. read side with $X \in {A,C,G,T,N}$.
302
303
304 \subsection{Splice Scores}
305
306 As mentioned before the splice site scores where generated using a tool
307 described in \cite{SonnenburgSchweikertPhilipsBehrRaetsch07}. If you would like
308 to use your own splice site predictions you can create files according to the
309 format \QP uses. This splice site scores format is described below:
310 \\ \noindent
311 For each canonical acceptor ($AG$) and donor site ($GT$/$GC$) \QP expects a
312 score. The data is organized in files for each signal (acc/don) for each strand
313 (+/-). The information on positions of canonical splice sites and the
314 corresponding scores lies in separate files. Every chromosome or contig leads
315 then to $8$ files (acc/don, +/- and pos/score).
316 The position and score files are raw binary files containing the orderd positions
317 and the scores. The positions are stored as unsigned values and the scores as
318 floats. Note that you have to be careful when working in an inhomogeneous
319 cluster environment (endianness, data type size ). The positions are 1-based
320 and the assignment of positions and their scores is as follows: The acceptor
321 score positions are the positions right after the $AG$ and the donor score
322 positions are the positions right on the $G$ of the $GT$. For example:
323 \begin{center}
324 \begin{tabular}{ccccccccccc}
325 ... & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & ... \\
326 ... & w & g & t & x & y & z & a & g & v & ... \\
327 ... & & 0.2& & & & & & & 0.3 & ...
328 \end{tabular}
329 \end{center}
330 We supply a script ``spliceScoresConverter.py'' for conversion of ascii to
331 binary files. You can use this script as a template to make your own scoring
332 information files.
333
334
335 \subsection{Alignment result file}
336
337 The result file for the blat-like output format consists of lines with the following columns:
338
339 \begin{itemize}
340 \item id - the unique read id of the input file
341 \item chromosome/contig number
342 \item strand
343 \item start position of alignment in target
344 \item qStart
345 \item qEnd
346 \item tStart
347 \item tEnd
348 \item number of exons
349 \item qExonSizes
350 \item qStarts
351 \item qEnds
352 \item tExonSizes
353 \item tStarts
354 \item tEnds
355 \item number of matches in alignment
356 \item number of gaps in alignment
357 \end{itemize}
358
359 \section{Remarks}
360
361 The \QP project is licensed under the GPL. \\ \noindent
362 The official \QP project email address is:
363 \begin{center}
364 qpalma@tuebingen.mpg.de
365 \end{center}
366
367 %
368 % Bibliography
369 %
370
371 \begin{thebibliography}{1}
372
373 \bibitem[1]{DeBona08}
374 De~Bona~F.~and~Ossowski~S.~and~Schneeberger~K.~and~G.~R{\"a}tsch
375 \newblock Optimal Spliced Alignment of Short Sequence Reads
376 \newblock {\em ECCB 2008}
377
378 \bibitem[2]{Tsochantaridis04}
379 Ioannis~Tsochantaridis~and~Thomas~Hofmann~and~Thorsten~Joachims~and~Yasemin~Altun
380 \newblock Support Vector Machine Learning for Interdependent and Sturcutured Output Spaces
381 \newblock {\em Proceedings of the 16th International Conference on Machine Learning}, 2004
382
383 \bibitem[3]{SonnenburgSchweikertPhilipsBehrRaetsch07}
384 S{\"o}ren~Sonnenburg~and~Gabriele~Schweikert~and~Petra~Philips~and~Jonas~Behr~and~Gunnar~R{\"a}tsch
385 \newblock Accurate splice site prediction using support vector machines
386 \newblock {\em BMC Bioinformatics 2007, 8(Suppl 10):S7}, December 2007
387
388 \end{thebibliography}
389 %
390 %
391 %
392 \end{document}