+ added settings in the form of a global and a run specific part
[qpalma.git] / doc / qpalma.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{}
14
15 \maketitle
16 %
17 %
18 %
19 \section{Intro}
20
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$, $454$ or
23 $SOLid$. We refer to the whole pipeline as the \QP pipeline and \QP
24 respectively.
25
26 Basically \QP assumes that you have the following data:
27 \begin{itemize}
28 \item The reads you want to align,
29 \item parts/full genomic sequences of an organism,
30 \item splice site scores predicted for the genomic sequences.
31 \end{itemize}
32
33 The project has two central configuration files: PipelineConf.py and
34 QPalmaConf.py where the former stores the pipeline settings and the latter the
35 alignment specific options.
36
37 The project results directory (\emph{result\_dir}) contains then the subdirectories
38 \begin{itemize}
39 \item \emph{mapping} with subdirs main and spliced
40 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
41 \item \emph{remapping}
42 \end{itemize}
43
44 %
45 %
46 %
47 \section{Installation}
48
49 QPalma has the following requirements:
50 \begin{itemize}
51 \item Numpy
52 \item In order to use QPalma on a cluster you need the pythongrid package which
53 can be found under the following URL:
54 \item For training you need either one of the following optimization toolkits:
55 \begin{itemize}
56 \item CPLEX
57 \item CVXOPT
58 \item MOSEK
59 \end{itemize}
60 \end{itemize}
61
62
63 %
64 %
65 %
66 \section{Pipeline}
67
68 The full pipline consists of $n$ steps:
69
70 \begin{enumerate}
71 \item Find alignment seeds using a fast suffix array method (vmatch) for
72 all given reads. This may take several rounds for subsets of the reads.
73 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
74 \item Use the \QPH to identify those reads that have a full seed but might be
75 spliced anyways.
76 \item Once we identified all potentially spliced reads we use \QPA to align
77 those to their seed regions.
78 \item One can choose between several post-processing steps in order to refine
79 the quality of the alignments via filtering.
80 \end{enumerate}
81
82 %
83 %
84 %
85 \section{Usage}
86
87 A typical run consists of
88 \begin{enumerate}
89 \item Set your parameters (read size, number of mismatches, etc.) in the
90 configuration files.
91 \item Start the QPalma pipeline via \emph{start\_pipeline}.
92 \end{enumerate}
93
94 \section{Options}
95
96 First one creates datasets using run-specific preprocessing tools. After
97 dataset creation one checks the sets for consistency using the
98 check\_dataset\_consistency.py script.
99
100 \section{QPalma Commands}
101
102 \subsection{check\_and\_init}
103
104 Performs sanity checking of the configurations file(s). Initializes needed
105 directories.
106 %
107 %
108 %
109 \section{Training}
110
111 QPalma needs some training examples
112 For the training you need:
113
114 \begin{itemize}
115 \item Training examples i.e. correct alignments (you can artificially generate those see \ref)
116 \item Splice site predictions
117 \item Flat files of the genomic sequences you want to align to
118 \end{itemize}
119
120 A training example is definded as a $4$-tuple, with elements:
121 \begin{enumerate}
122 \item A sequence information tuple
123 \item the read itself
124 \item the quality tuple
125 \item the alignment information
126 \end{enumerate}
127
128 A prediction example is defined as a $3$-tuple, with elements:
129 \begin{enumerate}
130 \item A sequence information tuple
131 \item the read itself
132 \item the quality tuple
133 \end{enumerate}
134
135 The sequence information tuple itself consists of
136 \begin{enumerate}
137 \item The read id
138 \item Chromosome/Contig id
139 \item Strand
140 \item Start of the genomic region we want to align to
141 \item Stop of the genomic region
142 \end{enumerate}
143
144
145
146 currentSeqInfo
147
148 %
149 %
150 %
151 \section{Data Standards}
152
153
154 \subsection{Read format and internal representation}
155
156 \begin{itemize}
157 \item Which nucleotide bears the score of the splice site ?
158 \item What exactly are the exon/intron boundaries pointing to (also are they 0/1-based) ?
159 \end{itemize}
160
161 We address the first question as follows:
162 \texttt{
163 ----gt-------ag------
164 * *
165 }
166 the score sits at the g's of the splice sites.
167
168
169 %
170 %
171 %
172 \section{Format Specifications}
173
174 The format of the file containing the mapped short reads is as follows. Each
175 line corresponds to one short read. Each line has six tab separated entries,
176 namely:
177 \begin{enumerate}
178 \item unique read id
179 \item chromosome/contig id
180 \item position of match in chromosome/contig
181 \item strand
182 \item read sequence
183 \item read quality
184 \end{enumerate}
185
186
187 \subsection{Splice Scores}
188
189 %
190 %
191 %
192 \end{document}