78f011edfa1ec98aa0a2c958a2c51a75c67ed538
[qpalma.git] / doc / qpalma-manual.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{October 2008}
14
15 \maketitle
16 %
17 %
18 %
19 \section{Intro}
20
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as \emph{Illumina Genome Analyzer} or \emph{454}.
23 The basic idea is to use an extended Smith-Waterman algorithm for local
24 alignments that uses the base quality information of the reads directly in the
25 alignment step. Optimal alignment parameters i.e. scoring matrices are inferred
26 using a machine learning technique similar to \emph{Support Vector Machines}.
27 For further details on \QP itself consult the paper \cite{DeBona08}. For
28 details about the learning method see \cite{Tsochantaridis04}.
29
30 %
31 %
32 %
33 \section{Installation}
34
35 The following installation notes assume that you are working on a \emph{Linux}/*NIX
36 environment. \emph{Windows} or \emph{MacOS} versions do not exist and are not planned. You
37 should have a working \emph{DRMAA} (http://www.drmaa.org/) installation to
38 submit jobs to a cluster. \QP is written in C++ and Python and tested on the
39 \emph{gcc} compiler suite and python2.5.
40
41 \subsection{Dependencies}
42
43 \QP was designed to be as self-contained as possible. However it still has some dependencies.
44 In order to install \QP completely you will need:
45 \begin{itemize}
46 \item $SWIG$, the simple wrapper interface generator,
47 \item[$\rightarrow$] http://www.swig.org
48 \item $numpy$, a python package for numeric computations,
49 \item[$\rightarrow$] http://numpy.scipy.org/
50 \item Pythongrid a package for cluster computation, and
51 \item Genefinding tools which offers some data formats.
52 \end{itemize}
53 The latter two packages can be obtained from the \QP website. \\ \noindent
54 For training \QP you need one of the following optimization toolkits:
55 \begin{itemize}
56 \item CVXOPT (http://abel.ee.ucla.edu/cvxopt, Free)
57 \item MOSEK (http://www.mosek.com, Commercial)
58 \item CPLEX (http://www.ilog.com/products/cplex, Commercial)
59 \end{itemize}
60
61 \subsection{Step by step installation guide}
62
63 \begin{enumerate}
64 \item Install the Pythongrid and the Genefinding tool packages
65 \item Update your PYTHONPATH variable to poin to the above packages
66 \item Unpack the QPalma tarball via
67 \item[$\rightarrow$] tar -xzvf QPalma-1.0.tar.gz
68 \item Enter the QPalma-1.0 directory and type:
69 \item[$\rightarrow$] python setup.py build
70 \end{enumerate}
71 \noindent
72 In order to check your installation you can run the script
73 test\_qpalma\_installation.py which can be found in the directory tests/. This
74 script either reports a successful installation or generates an error log file.
75 You can send this error log file to qpalma@tuebingen.mpg.de.
76 \\ \noindent
77 In order to make a full test run with sample data you can run the script
78 test\_complete\_pipeline.py. This work on a small artificial dataset performs
79 some checks.
80
81 %
82 %
83 %
84 \section{Working with \QP}
85
86 I assume now that you have a successful \QP installation. When working with \QP
87 you usually only deal with two commands:
88
89 \begin{itemize}
90 \item python qpalma\_pipeline.py train example.conf
91 \item python qpalma\_pipeline.py predict example.conf
92 \end{itemize}
93 \noindent
94 \QP has two modes \emph{predict} and \emph{train} all settings are supplied in
95 a configuration file here example.conf. \\ \noindent
96
97 A typical run consists of
98 \begin{enumerate}
99 \item Set your parameters (read size, number of mismatches, etc.) in the
100 configuration files.
101 \end{enumerate}
102
103 Basically \QP assumes that you have the following data:
104 \begin{itemize}
105 \item The reads you want to align,
106 \item parts/full genomic sequences of an organism,
107 \item splice site scores predicted for the genomic sequences.
108 \end{itemize}
109
110 \QP has one central configuration file:
111
112 Suppose you have a
113
114 The project results directory (\emph{result\_dir}) contains then the subdirectories
115 \begin{itemize}
116 \item \emph{mapping} with subdirs main and spliced
117 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
118 \item \emph{remapping}
119 \end{itemize}
120
121 \subsection{The configuration file}
122
123 Via the variable ``result\_dir'' you can specify where all of QPalma's data should reside.
124 This directory contains the following subdirectories:
125 \begin{itemize}
126 \item preprocessing
127 \item approximation
128 \item prediction
129 \item postprocessing, and
130 \item training
131 \end{itemize}
132
133 %
134 %
135 %
136 \section{Pipeline}
137
138 The full pipline consists of $n$ steps:
139
140 \begin{enumerate}
141 \item Find alignment seeds using a fast suffix array method (vmatch) for
142 all given reads. This may take several rounds for subsets of the reads.
143 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
144 \item Use the \QPH to identify those reads that have a full seed but might be
145 spliced anyways.
146 \item Once we identified all potentially spliced reads we use \QPA to align
147 those to their seed regions.
148 \item One can choose between several post-processing steps in order to refine
149 the quality of the alignments via filtering.
150 \end{enumerate}
151
152 %
153 %
154 %
155 \section{File Formats / Specifications}
156
157 This section introduces all formats and conventions that are assumed to be met
158 by the users in order to make \QP work.
159
160 \subsection{Format of the configuration file}
161
162 The configuration file includes are settings \QP needs to perform an analysis.
163 This includes paths to file where the raw data exists as well as settings which
164 sequencing platform is being used,the number of cluster nodes to employ etc. .
165
166 Its values are in the form
167 \begin{center}
168 key = value
169 \end{center}
170 and ``\#'' for lines containing comments.
171
172 \subsection{Read format and internal representation}
173
174 The format of the file containing the mapped short reads is as follows. Each
175 line corresponds to one short read. Each line has six tab-separated entries,
176 namely:
177
178 \begin{enumerate}
179 \item unique read id
180 \item chromosome/contig id
181 \item position of match in chromosome/contig
182 \item strand
183 \item read sequence (in strand specific direction)
184 \item read quality (in strand specific direction)
185 \end{enumerate}
186
187 Strand specific direction means that \QP assumes that the reads are already in
188 their true orientation and the qualities as well. Internally there is no
189 reverse complementing taking place.
190
191 \subsection{Splice Scores}
192
193 The splice site scores where generated by the software... . If you would like
194 to use your own splice site predictions you can create files according to the
195 format \QP uses. This splice site scores format is described. For each
196 canonical acceptor ($AG$) and donor site ($GT$/$GC$) \QP expects a score. For every
197 chromosome or contig we have a four files. For each strand we have a binary
198 file containing the positions and a binary file containing the scores. The
199 positions are stored as unsigned values and the scores as floats. The
200 positions are 1-based and the assignment of positions and their scores is as
201 follows: The acceptor score positions are the positions right after the $AG$ and the
202 donor score positions are the positions right on the $G$ of the $GT$. For example:
203 \begin{center}
204 \begin{tabular}{ccccccccccc}
205 ... & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & ... \\
206 ... & w & g & t & x & y & z & a & g & v & ... \\
207 ... & & 0.2& & & & & & & 0.3 & ...
208 \end{tabular}
209 \end{center}
210 We supply a script for conversion of ascii to binary files. You can use this
211 script as a template to make your own scoring information files.
212
213 \section{Remarks}
214
215 The \QP project is licensed under the GPL. \\ \noindent
216 The official \QP project email address is:
217 \begin{center}
218 qpalma@tuebingen.mpg.de
219 \end{center}
220
221 %
222 % Bibliography
223 %
224
225 \begin{thebibliography}{1}
226
227 \bibitem[1]{DeBona08}
228 De~Bona~F.~and~Ossowski~S.~and~Schneeberger~K.~and~G.~R{\"a}tsch
229 \newblock Optimal Spliced Alignment of Short Sequence Reads
230 \newblock {\em ECCB 2008}
231
232 \bibitem[2]{Tsochantaridis04}
233 Ioannis~Tsochantaridis~and~Thomas~Hofmann~and~Thorsten~Joachims~and~Yasemin~Altun
234 \newblock Support Vector Machine Learning for Interdependent and Sturcutured Output Spaces
235 \newblock {\em Proceedings of the 16th International Conference on Machine Learning}, 2004
236
237 \end{thebibliography}
238 %
239 %
240 %
241 \end{document}