+ got rid of some legacy code
[qpalma.git] / doc / qpalma.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{}
14
15 \maketitle
16 %
17 %
18 %
19 \section{Intro}
20
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$ or $454$.
23 The basic idea is to use an extended Smith-Waterman algorithm for local
24 alignments that uses the base quality information of the reads directly in the
25 alignment step. Optimal alignment parameters i.e. scoring matrices are inferred
26 using a machine learning technique similar to \emph{Support Vector Machines}.
27 For further details on \QP itself consult the paper \cite{DeBona08}. For
28 details about the learning method see \cite{Tsochantaridis04}.
29 %We refer to the whole pipeline as the \QP pipeline and \QP respectively.
30
31 \section{Quicktour}
32
33 Basically \QP assumes that you have the following data:
34 \begin{itemize}
35 \item The reads you want to align,
36 \item parts/full genomic sequences of an organism,
37 \item splice site scores predicted for the genomic sequences.
38 \end{itemize}
39
40 \QP has one central configuration file:
41
42 Suppose you have a
43
44 The project results directory (\emph{result\_dir}) contains then the subdirectories
45 \begin{itemize}
46 \item \emph{mapping} with subdirs main and spliced
47 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
48 \item \emph{remapping}
49 \end{itemize}
50
51
52 Via the variable ``result\_dir'' you can specify where all of QPalma's data should reside.
53 This directory contains the following subdirectories:
54 \begin{itemize}
55 \item preprocessing
56 \item approximation
57 \item prediction
58 \item postprocessing, and
59 \item training
60 \end{itemize}
61
62 %
63 %
64 %
65 \section{Installation}
66
67 QPalma has the following requirements:
68 \begin{itemize}
69 \item Numpy
70 \item In order to use QPalma on a cluster you need the pythongrid package which
71 can be found under the following URL:
72 \item For training you need either one of the following optimization toolkits:
73 \begin{itemize}
74 \item CPLEX
75 \item CVXOPT
76 \item MOSEK
77 \end{itemize}
78 \end{itemize}
79
80
81 %
82 %
83 %
84 \section{Pipeline}
85
86 The full pipline consists of $n$ steps:
87
88 \begin{enumerate}
89 \item Find alignment seeds using a fast suffix array method (vmatch) for
90 all given reads. This may take several rounds for subsets of the reads.
91 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
92 \item Use the \QPH to identify those reads that have a full seed but might be
93 spliced anyways.
94 \item Once we identified all potentially spliced reads we use \QPA to align
95 those to their seed regions.
96 \item One can choose between several post-processing steps in order to refine
97 the quality of the alignments via filtering.
98 \end{enumerate}
99
100 %
101 %
102 %
103 \section{Usage}
104
105 A typical run consists of
106 \begin{enumerate}
107 \item Set your parameters (read size, number of mismatches, etc.) in the
108 configuration files.
109 \item Start the QPalma pipeline via \emph{start\_pipeline}.
110 \end{enumerate}
111
112 \section{Options}
113
114 First one creates datasets using run-specific preprocessing tools. After
115 dataset creation one checks the sets for consistency using the
116 check\_dataset\_consistency.py script.
117
118 \section{QPalma Commands}
119
120 \subsection{check\_and\_init}
121
122 Performs sanity checking of the configurations file(s). Initializes needed
123 directories.
124 %
125 %
126 %
127 \section{Training}
128
129 QPalma needs some training examples
130 For the training you need:
131
132 \begin{itemize}
133 \item Training examples i.e. correct alignments (you can artificially generate those see \ref)
134 \item Splice site predictions
135 \item Flat files of the genomic sequences you want to align to
136 \end{itemize}
137
138 A training example is definded as a $4$-tuple, with elements:
139 \begin{enumerate}
140 \item A sequence information tuple
141 \item the read itself
142 \item the quality tuple
143 \item the alignment information
144 \end{enumerate}
145
146 A prediction example is defined as a $3$-tuple, with elements:
147 \begin{enumerate}
148 \item A sequence information tuple
149 \item the read itself
150 \item the quality tuple
151 \end{enumerate}
152
153 The sequence information tuple itself consists of
154 \begin{enumerate}
155 \item The read id
156 \item Chromosome/Contig id
157 \item Strand
158 \item Start of the genomic region we want to align to
159 \item Stop of the genomic region
160 \end{enumerate}
161
162 %
163 %
164 %
165 \section{Format Specifications}
166
167 This section introduces all formats and conventions that are assumed to be met
168 by the users in order to make \QP work.
169
170 \subsection{Format of the configuration file}
171
172 The configuration file includes are settings \QP needs to perform an analysis.
173 This includes paths to file where the raw data exists as well as settings which
174 sequencing platform is being used,the number of cluster nodes to employ etc. .
175
176 Its values are in the form
177 \begin{center}
178 key = value
179 \end{center}
180 and ``#'' for lines containing comments.
181
182
183 \subsection{File Formats}
184 The format of the file containing the mapped short reads is as follows. Each
185 line corresponds to one short read. Each line has six tab separated entries,
186 namely:
187 \begin{enumerate}
188 \item unique read id
189 \item chromosome/contig id
190 \item position of match in chromosome/contig
191 \item strand
192 \item read sequence
193 \item read quality
194 \end{enumerate}
195
196 \subsection{Read format and internal representation}
197
198 \begin{itemize}
199 \item Which nucleotide bears the score of the splice site ?
200 \item What exactly are the exon/intron boundaries pointing to (also are they 0/1-based) ?
201 \end{itemize}
202
203 We address the first question as follows:
204 \texttt{
205 ----gt-------ag------
206 * *
207 }
208 the score is assumed to be saved at the position of the g's of the splice sites.
209
210 \subsection{Splice Scores}
211
212 The splice site scores where generated by ... . However if you train by
213 yourself then you can use splice site predictions of any kind... as long as the
214 prediction of one site is a real value.
215
216
217 Dependencies so far
218
219 - SWIG
220 - numpy
221 - pythongrid
222 - Genefinding doIntervalQuery
223
224 \begin{thebibliography}{1}
225
226 \bibitem[1]{DeBona08}
227 De~Bona~F.~and~Ossowski~S.~and~Schneeberger~K.~and~G.~R{\"a}tsch
228 \newblock Optimal Spliced Alignment of Short Sequence Reads
229 \newblock {\em ECCB 2008}
230
231 \bibitem[2]{Tsochantaridis04}
232 Ioannis~Tsochantaridis~and~Thomas~Hofmann~and~Thorsten~Joachims~and~Yasemin~Altun
233 \newblock Support Vector Machine Learning for Interdependent and Sturcutured Output Spaces
234 \newblock {\em Proceedings of the 16th International Conference on Machine Learning}, 2004
235
236 \end{thebibliography}
237 %
238 %
239 %
240 \end{document}