+ renamed dyn_prog directory
[qpalma.git] / doc / qpalma.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{}
14
15 \maketitle
16 %
17 %
18 %
19 \section{Intro}
20
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$ or $454$.
23 The basic idea is to use an extended Smith-Waterman algorithm for local
24 alignments that uses the base quality information of the reads directly in the
25 alignment step. Optimal alignment parameters i.e. scoring matrices are inferred
26 using a machine learning technique similar to \emph{Support Vector Machines}.
27 For further details on \QP itself consult the paper \cite{DeBona08}. For
28 details about the learning method \cite{Tsochantaridis04}.
29 %$SOLid$.
30 %We refer to the whole pipeline as the \QP pipeline and \QP respectively.
31
32 \section{Quicktour}
33
34 Basically \QP assumes that you have the following data:
35 \begin{itemize}
36 \item The reads you want to align,
37 \item parts/full genomic sequences of an organism,
38 \item splice site scores predicted for the genomic sequences.
39 \end{itemize}
40
41 \QP has one central configuration file:
42
43 Suppose you have a
44
45 The project results directory (\emph{result\_dir}) contains then the subdirectories
46 \begin{itemize}
47 \item \emph{mapping} with subdirs main and spliced
48 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
49 \item \emph{remapping}
50 \end{itemize}
51
52
53 Via the variable ``result\_dir'' you can specify where all of QPalma's data should reside.
54 This directory contains the following subdirectories:
55 \begin{itemize}
56 \item preprocessing
57 \item approximation
58 \item prediction
59 \item postprocessing, and
60 \item training
61 \end{itemize}
62
63 %
64 %
65 %
66 \section{Installation}
67
68 QPalma has the following requirements:
69 \begin{itemize}
70 \item Numpy
71 \item In order to use QPalma on a cluster you need the pythongrid package which
72 can be found under the following URL:
73 \item For training you need either one of the following optimization toolkits:
74 \begin{itemize}
75 \item CPLEX
76 \item CVXOPT
77 \item MOSEK
78 \end{itemize}
79 \end{itemize}
80
81
82 %
83 %
84 %
85 \section{Pipeline}
86
87 The full pipline consists of $n$ steps:
88
89 \begin{enumerate}
90 \item Find alignment seeds using a fast suffix array method (vmatch) for
91 all given reads. This may take several rounds for subsets of the reads.
92 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
93 \item Use the \QPH to identify those reads that have a full seed but might be
94 spliced anyways.
95 \item Once we identified all potentially spliced reads we use \QPA to align
96 those to their seed regions.
97 \item One can choose between several post-processing steps in order to refine
98 the quality of the alignments via filtering.
99 \end{enumerate}
100
101 %
102 %
103 %
104 \section{Usage}
105
106 A typical run consists of
107 \begin{enumerate}
108 \item Set your parameters (read size, number of mismatches, etc.) in the
109 configuration files.
110 \item Start the QPalma pipeline via \emph{start\_pipeline}.
111 \end{enumerate}
112
113 \section{Options}
114
115 First one creates datasets using run-specific preprocessing tools. After
116 dataset creation one checks the sets for consistency using the
117 check\_dataset\_consistency.py script.
118
119 \section{QPalma Commands}
120
121 \subsection{check\_and\_init}
122
123 Performs sanity checking of the configurations file(s). Initializes needed
124 directories.
125 %
126 %
127 %
128 \section{Training}
129
130 QPalma needs some training examples
131 For the training you need:
132
133 \begin{itemize}
134 \item Training examples i.e. correct alignments (you can artificially generate those see \ref)
135 \item Splice site predictions
136 \item Flat files of the genomic sequences you want to align to
137 \end{itemize}
138
139 A training example is definded as a $4$-tuple, with elements:
140 \begin{enumerate}
141 \item A sequence information tuple
142 \item the read itself
143 \item the quality tuple
144 \item the alignment information
145 \end{enumerate}
146
147 A prediction example is defined as a $3$-tuple, with elements:
148 \begin{enumerate}
149 \item A sequence information tuple
150 \item the read itself
151 \item the quality tuple
152 \end{enumerate}
153
154 The sequence information tuple itself consists of
155 \begin{enumerate}
156 \item The read id
157 \item Chromosome/Contig id
158 \item Strand
159 \item Start of the genomic region we want to align to
160 \item Stop of the genomic region
161 \end{enumerate}
162
163 %
164 %
165 %
166 \section{Format Specifications}
167
168 This section introduces all formats and conventions that are assumed to be met
169 by the users in order to make \QP work.
170
171 \subsection{Format of the configuration file}
172
173 The configuration file includes are settings \QP needs to perform an analysis.
174 This includes paths to file where the raw data exists as well as settings which
175 sequencing platform is being used,the number of cluster nodes to employ etc. .
176
177 Its values are in the form
178 \begin{center}
179 key = value
180 \end{center}
181 and ``#'' for lines containing comments.
182
183
184 \subsection{File Formats}
185 The format of the file containing the mapped short reads is as follows. Each
186 line corresponds to one short read. Each line has six tab separated entries,
187 namely:
188 \begin{enumerate}
189 \item unique read id
190 \item chromosome/contig id
191 \item position of match in chromosome/contig
192 \item strand
193 \item read sequence
194 \item read quality
195 \end{enumerate}
196
197 \subsection{Read format and internal representation}
198
199 \begin{itemize}
200 \item Which nucleotide bears the score of the splice site ?
201 \item What exactly are the exon/intron boundaries pointing to (also are they 0/1-based) ?
202 \end{itemize}
203
204 We address the first question as follows:
205 \texttt{
206 ----gt-------ag------
207 * *
208 }
209 the score is assumed to be saved at the position of the g's of the splice sites.
210
211 \subsection{Splice Scores}
212
213 The splice site scores where generated by ... . However if you train by
214 yourself then you can use splice site predictions of any kind... as long as the
215 prediction of one site is a real value.
216
217
218 Dependencies so far
219
220 - SWIG
221 - numpy
222 - pythongrid
223 - Genefinding doIntervalQuery
224
225
226 \begin{thebibliography}{1}
227
228 \bibitem[1]{DeBona08}
229 De~Bona~F.~and~Ossowski~S.~and~Schneeberger~K.~and~G.~R{\"a}tsch
230 \newblock Optimal Spliced Alignment of Short Sequence Reads
231 \newblock {\em ECCB 2008}
232
233 \bibitem[2]{Tsochantaridis04}
234 Ioannis~Tsochantaridis~and~Thomas~Hofmann~and~Thorsten~Joachims~and~Yasemin~Altun
235 \newblock Support Vector Machine Learning for Interdependent and Sturcutured Output Spaces
236 \newblock {\em Proceedings of the 16th International Conference on Machine Learning}, 2004
237
238 \end{thebibliography}
239 %
240 %
241 %
242 \end{document}