89c6e32cbbef35d1ab3be4cf5f0100d9652daabd
[qpalma.git] / doc / qpalma-manual.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10 \newcommand{\qparam}[1]{{\bf #1}}
11
12 \title{QPalma Documentation}
13 \author{Fabio De Bona}
14 \date{October 2008}
15
16 \maketitle
17 %
18 %
19 %
20 \section{Intro}
21
22 \QP is an alignment tool targeted to align spliced reads produced by ``Next
23 Generation'' sequencing platforms such as \emph{Illumina Genome Analyzer} or
24 \emph{454}. The basic idea is to use an extended Smith-Waterman algorithm for
25 local alignments that uses the base quality information of the reads directly
26 in the alignment step. Optimal alignment parameters i.e. scoring matrices are
27 inferred using a machine learning technique similar to \emph{Support Vector
28 Machines}. For further details on \QP itself consult the paper
29 \cite{DeBona08}. For details about the learning method see
30 \cite{Tsochantaridis04}.
31
32 %
33 %
34 %
35 \section{Installation}
36
37 The following installation notes assume that you are working in a
38 \emph{Linux}/*NIX environment. \emph{Windows} or \emph{Mac OS} versions do not
39 exist and are not planned. You should have a working \emph{DRMAA}
40 (http://www.drmaa.org/) installation to submit jobs to a cluster. \QP is
41 written in C++ and Python and tested with the \emph{gcc} compiler suite and
42 python2.5.
43
44 \subsection{Dependencies}
45
46 \QP was designed to be as self-contained as possible. However it still has some
47 dependencies. In order to install \QP completely you will need:
48 \begin{itemize}
49 \item $SWIG$, the simple wrapper interface generator,
50 \item[$\rightarrow$] http://www.swig.org
51 \item $numpy$, a python package for numeric computations,
52 \item[$\rightarrow$] http://numpy.scipy.org/
53 \item Pythongrid a package for cluster computation, and
54 \item Genefinding tools which offers some data formats.
55 \end{itemize}
56 The latter two packages can be obtained from the \QP website. \\ \noindent
57 For training \QP you need one of the following optimization toolkits:
58 \begin{itemize}
59 \item CVXOPT (http://abel.ee.ucla.edu/cvxopt, Free)
60 \item MOSEK (http://www.mosek.com, Commercial)
61 \item CPLEX (http://www.ilog.com/products/cplex, Commercial)
62 \end{itemize}
63
64 \subsection{Step by step installation guide}
65
66 \begin{enumerate}
67 \item Install the Pythongrid and the Genefinding tool packages
68 \item Update your PYTHONPATH variable to point to the above packages
69 \item Unpack the QPalma tarball via
70 \item[$\rightarrow$] tar -xzvf QPalma-1.0.tar.gz
71 \item Enter the QPalma-1.0 directory and type:
72 \item[$\rightarrow$] python setup.py build
73 \end{enumerate}
74 \noindent
75 In order to check your installation you can run the script
76 test\_qpalma\_installation.py which can be found in the directory tests/ (Enter
77 the directory first as the script works with relative paths). This script
78 either reports a successful installation or generates an error log file. You
79 can send this error log file to qpalma@tuebingen.mpg.de.
80 \\ \noindent
81 In order to make a full test run with sample data you can run the script
82 test\_complete\_pipeline.py. this scrips executes the full pipeline with data
83 from a small artificial data set.
84
85 %
86 %
87 %
88 \section{Working with \QP}
89
90 We assume now that you have a successful \QP installation. When working with \QP
91 you usually need two commands:
92
93 \begin{itemize}
94 \item python qpalma\_pipeline.py train example.conf
95 \item python qpalma\_pipeline.py predict example.conf
96 \end{itemize}
97 \noindent
98 \QP has two modes \emph{predict} and \emph{train}. All settings are supplied in
99 a configuration file here example.conf. \\ \noindent
100
101 Basically \QP assumes that you have the following data:
102 \begin{itemize}
103 \item The reads you want to align together with their seed positions,
104 \item genomic sequences,
105 \item splice site scores predicted for the genomic sequences.
106 \end{itemize}
107
108 A typical run consists of
109 \begin{enumerate}
110 \item Find the seeds for your reads using another method (NOT VIA \QP). For example vmatch.
111 \item Set your parameters (read size, number of mismatches, etc.) in the
112 configuration files.
113 \item Train a parameter vector once for your organism.
114 \item Predict read alignments using the trained parameters.
115 \end{enumerate}
116 \noindent
117 Note: To obtain splice site scores we use the predictions of another tool of
118 our group \cite{SonnenburgSchweikertPhilipsBehrRaetsch07}. However this tool is
119 not available yet but you can use any splice site prediction tool you prefer as
120 long as you stick to the data format described below.
121 \\ \\ \noindent
122 \QP uses one central configuration file. The parameters you can set in this file are
123 described below.
124
125 \subsection{The configuration file}
126
127 In order to run \QP successfully you will need to create a configuration file.
128 All valid parameters for \QP are summarized below. You can also look at the
129 example\_configuration.conf file which is included in the distribution.
130 \\ \noindent
131 The parameters do not have to be in a particular order. In order to simplify
132 explanations we have grouped them according to their particular semantics.
133 \\ \noindent
134 Via the variable ``result\_dir'' you can specify where all of QPalma's data
135 should reside. During initialization of \QP the following sub directories of
136 result\_dir will be automatically created:
137 \begin{itemize}
138 \item preprocessing
139 \item approximation
140 \item dataset
141 \item prediction
142 \item alignment
143 \item postprocessing, and
144 \item training.
145 \end{itemize}
146 \noindent
147 Each of these directories will hold all data created in a particular step of the
148 \QP pipeline. The final alignments can be found in the alignment directory.
149 \\ \noindent
150 The first group consists of general settings for \QP:
151 \begin{itemize}
152 \item \qparam{perform\_checks}- Enables some checks and debugging output.
153 \item \qparam{platform} - IGA or $454$ for Illumina Genome Analyzer or Roche's $454$.
154 \item \qparam{result\_dir} - This is the toplevel result directory.
155 \item \qparam{spliced\_reads\_fn} - A file consisting of putative spliced reads (obtained for example via vmatch)
156 \item \qparam{unspliced\_reads\_fn} - A file consisting of putative unspliced reads (obtained for example via vmatch)
157 \item \qparam{num\_splits} - number of computing nodes
158 \item \qparam{prediction\_param\_fn} - For prediction needed. This is the location of the trained parameter file.
159 \end{itemize}
160 \noindent
161 The second group includes parameters needed for accessing the sequence data and
162 the splice site predictions:
163 \begin{itemize}
164 \item \qparam{genome\_dir} - The location of the genomic sequence files.
165 \item \qparam{acceptor\_scores\_loc} - The location of the acceptor scores files.
166 \item \qparam{donor\_scores\_loc} - The location of the acceptor scores files.
167 \item \qparam{genome\_file\_fmt} - A format string describing how valid file names are generated. See description below.
168 \item \qparam{splice\_score\_file\_fmt} - A format string describing how valid file names for the splice site scores are generated.
169 \item \qparam{allowed\_fragments} - A list of the form ``[1,2,4]'' describing the valid file name numbers. See description below.
170 \item \qparam{half\_window\_size} - Given an alignment seed position for a
171 given read we cut out the area [1500-seed\_pos,seed\_pos+1500].
172 \item \qparam{output\_format} - The output format can be either blat, ShoRe or
173 mGene.
174 \item \qparam{prb\_offset} - We expect the quality values to be saved as ascii
175 strings so the quality is calulated via ord(qchar)-prb\_offset ($ord(\cdot)$
176 returns the numeric ascii value). For example an 'h' would correspond to
177 quality $40$.
178 \end{itemize}
179 \noindent
180 The genomic sequence and the splice site scores file are accessed using the
181 format strings given above. This works as follows: Suppose we have two
182 chromosomes 1 \& 4 we want to align to.
183 The sequences are in flat files:
184 \begin{center}
185 /home/fabio/genomic\_sequence/chromosome\_1.flat \\ \noindent
186 /home/fabio/genomic\_sequence/chromosome\_4.flat
187 \end{center}
188 \noindent
189 Then we set genome\_dir to \emph{/home/fabio/genomic\_sequence},
190 genome\_file\_fmt to ``chromosome\_\%d.flat'' and allowed\_fragments to [1,4].
191 \\ \noindent
192 Parameters needed for training are:
193 \begin{itemize}
194 \item C - this is a parameter trading off between loss term and regularization (similar to C in SVM training).
195 \item enable\_quality\_scores - You can enable or disable the use of quality
196 scores via settings this parameter to True or False
197 \item enable\_splice\_scores - You can enable or disable the use of quality
198 scores via settings this parameter to True or False
199 \item enable\_intron\_length - You can enable or disable the use of quality
200 scores via settings this parameter to True or False
201 \item optimizer - Either one of CVXOPT, MOSEK or CPLEX.
202 \item numConstraintsPerRound - How many constraints per optimizer call should be generated (if unsure set to $50$).
203 \end{itemize}
204 \noindent
205 For training and prediction purposes you can set the number of support points
206 for the piecewise linear functions of the scoring matrix. This effectively
207 changes the total number of parameters of the model. That means that you can
208 only predict with a parameter vector which was trained with the same settings.
209 (We refer to piecewise linear functions as \emph{plifs}.)
210 \begin{itemize}
211 \item numAccSuppPoints - The number of support points for the plif scoring the acceptor scores.
212 \item numDonSuppPoints - The number of support points for the plif scoring the acceptor scores.
213 \item numLengthSuppPoints - The number of support points for the plif scoring the acceptor scores.
214 \item min\_intron\_len,max\_intron\_len - Set this to the range of minimum and maximum intron lengths.
215 \item min\_qual and max\_qual - Set this to the range of possible quality values. (Illumina GA usually [-5,40]).
216 \item min\_svm\_score and max\_svm\_score - As we work with normalized (in the
217 range [0.0,1.0]) splice site scores this is usually $0.0$ resp. $1.0$.
218 \end{itemize}
219 \noindent
220 The following last group of parameters includes ``low-level'' parameters. Just
221 keep the default values of these parameters. Possible extensions of \QP may use
222 them differently.
223 \begin{itemize}
224 \item remove\_duplicate\_scores
225 \item print\_matrix
226 \item matchmatrixCols, matchmatrixRows
227 \item numQualPlifs
228 \item numQualSuppPoints
229 \item totalQualSuppPoints
230 \item anzpath
231 \item iter\_steps
232 \end{itemize}
233
234 %
235 %
236 %
237 \section{File Formats / Specifications}
238
239 This section introduces all formats and conventions that are assumed to be met
240 by the users in order to make \QP work.
241
242 \subsection{Format of the configuration file}
243
244 The configuration file includes all settings \QP needs to perform an analysis.
245 This includes paths to files where the raw data exists as well as settings which
246 sequencing platform is being used,the number of cluster nodes to employ etc. .
247
248 Its values are in the form
249 \begin{center}
250 key = value
251 \end{center}
252 and ``\#'' for lines containing comments.
253
254
255 \subsection{Read format and internal representation}
256
257 The read input files for \QP contain the read sequences with their quality as
258 well as some information from the first mapping / seed region finding. The
259 format of the file containing the mapped short reads is as follows. Each line
260 corresponds to one short read. Each line has six tab-separated entries, namely:
261
262 \begin{enumerate}
263 \item unique read id
264 \item chromosome/contig id
265 \item position of match in chromosome/contig (0-based, relative to positive strand)
266 \item strand
267 \item read sequence (in strand specific direction)
268 \item read quality (in strand specific direction)
269 \end{enumerate}
270
271 Strand specific direction means that \QP assumes that the reads are already in
272 their true orientation and the qualities as well.
273
274 \subsection{Splice Scores}
275
276 As mentioned before the splice site scores where generated using a tool
277 described in \cite{SonnenburgSchweikertPhilipsBehrRaetsch07}. If you would like
278 to use your own splice site predictions you can create files according to the
279 format \QP uses. This splice site scores format is described below:
280 \\ \noindent
281 For each canonical acceptor ($AG$) and donor site ($GT$/$GC$) \QP expects a
282 score. The data is organized in files for each signal (acc/don) for each strand
283 (+/-). The information on positions of canonical splice sites and the
284 corresponding scores lies in separate files. Every chromosome or contig leads
285 then to $8$ files (acc/don, +/- and pos/score).
286 The position and score files are raw binary files containing the orderd positions
287 and the scores. The positions are stored as unsigned values and the scores as
288 floats. Note that you have to be careful when working in an inhomogeneous
289 cluster environment (endianness, data type size ). The positions are 1-based
290 and the assignment of positions and their scores is as follows: The acceptor
291 score positions are the positions right after the $AG$ and the donor score
292 positions are the positions right on the $G$ of the $GT$. For example:
293 \begin{center}
294 \begin{tabular}{ccccccccccc}
295 ... & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & ... \\
296 ... & w & g & t & x & y & z & a & g & v & ... \\
297 ... & & 0.2& & & & & & & 0.3 & ...
298 \end{tabular}
299 \end{center}
300 We supply a script ``spliceScoresConverter.py'' for conversion of ascii to
301 binary files. You can use this script as a template to make your own scoring
302 information files.
303
304 \section{Remarks}
305
306 The \QP project is licensed under the GPL. \\ \noindent
307 The official \QP project email address is:
308 \begin{center}
309 qpalma@tuebingen.mpg.de
310 \end{center}
311
312 %
313 % Bibliography
314 %
315
316 \begin{thebibliography}{1}
317
318 \bibitem[1]{DeBona08}
319 De~Bona~F.~and~Ossowski~S.~and~Schneeberger~K.~and~G.~R{\"a}tsch
320 \newblock Optimal Spliced Alignment of Short Sequence Reads
321 \newblock {\em ECCB 2008}
322
323 \bibitem[2]{Tsochantaridis04}
324 Ioannis~Tsochantaridis~and~Thomas~Hofmann~and~Thorsten~Joachims~and~Yasemin~Altun
325 \newblock Support Vector Machine Learning for Interdependent and Sturcutured Output Spaces
326 \newblock {\em Proceedings of the 16th International Conference on Machine Learning}, 2004
327
328 \bibitem[3]{SonnenburgSchweikertPhilipsBehrRaetsch07}
329 S{\"o}ren~Sonnenburg~and~Gabriele~Schweikert~and~Petra~Philips~and~Jonas~Behr~and~Gunnar~R{\"a}tsch
330 \newblock Accurate splice site prediction using support vector machines
331 \newblock {\em BMC Bioinformatics 2007, 8(Suppl 10):S7}, December 2007
332
333 \end{thebibliography}
334 %
335 %
336 %
337 \end{document}