+ added some docu
[qpalma.git] / doc / qpalma.tex
1 \documentclass{article}
2 \usepackage{a4}
3
4 \begin{document}
5
6 \newcommand{\QP}{{\sl QPALMA }}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm }}
8 \newcommand{\QPH}{{\sl QPALMA approximation }}
9 \newcommand{\QPP}{{\sl QPALMA pipeline }}
10
11 \title{QPalma Documentation}
12 \author{Fabio De Bona}
13 \date{}
14
15 \maketitle
16 %
17 %
18 \section{Intro}
19 %
20 %
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$, $454$ or
23 $SOLid$. We refer to the whole pipeline as the \QP pipeline and \QP
24 respectively.
25
26
27
28 Basically \QP assumes that you have the following data:
29 \begin{itemize}
30 \item The reads you want to align,
31 \item parts/full genomic sequences of an organism,
32 \item splice site scores predicted for the genomic sequences.
33 \end{itemize}
34
35 The project has two central configuration files: PipelineConf.py and
36 QPalmaConf.py where the former stores the pipeline settings and the latter the
37 alignment specific options.
38
39 The project results directory (\emph{result\_dir}) contains then the subdirectories
40 \begin{itemize}
41 \item \emph{mapping} with subdirs main and spliced
42 \item \emph{alignment} with subdirs for the different parameters and \emph{heuristic}
43 \item \emph{remapping}
44 \end{itemize}
45
46 %
47 %
48 \section{Pipeline}
49 %
50 %
51 The full pipline constist of n steps
52 Usually:
53
54 \begin{enumerate}
55 \item Find alignment seeds using a fast suffix array method (vmatch) for
56 all given reads. This may take several rounds for subsets of the reads.
57 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
58 \item Use the \QPH to identify those reads that have a full seed but might be
59 spliced anyways.
60 \item Once we identified all potentially spliced reads we use \QPA to align
61 those to their seed regions.
62 \item One can choose between several post-processing steps in order to refine
63 the quality of the alignments via filtering.
64 \end{enumerate}
65
66 %
67 %
68 \section{Usage}
69 %
70 %
71 A typical run consists of
72 \begin{enumerate}
73 \item Set your parameters (read size, number of mismatches, etc.) in the
74 configuration files.
75 \item Start the QPalma pipeline via \emph{start\_pipeline}.
76 \end{enumerate}
77
78 \section{Options}
79
80 First one creates datasets using run-specific preprocessing tools. After
81 dataset creation one checks the sets for consistency using the
82 check\_dataset\_consistency.py script.
83
84 \section{QPalma Commands}
85
86 \subsection{check\_and\_init}
87
88 Performs sanity checking of the configurations file(s). Initializes needed
89 directories.
90
91 \section{Training}
92
93 QPalma needs some training examples
94 For the training you need:
95
96 \begin{itemize}
97 \item Training examples i.e. correct alignments (you can artificially generate those see \ref)
98 \item Splice site predictions
99 \item Flat file of the genomic sequence
100 \item VMatch
101 \end{itemize}
102
103
104 \section{Data Standards}
105
106
107 \subsection{Read format and internal representation}
108
109 \begin{itemize}
110 \item Which nucleotide bears the score of the splice site ?
111 \item What exactly are the exon/intron boundaries pointing to (also are they 0/1-based) ?
112 \end{itemize}
113
114 We address the first question as follows:
115 \texttt{
116 ----gt-------ag------
117 * *
118 }
119 the score sits at the g's of the splice sites.
120
121
122 \section{Format Specifications}
123
124 The format of the file containing the mapped short reads is as follows. Each
125 line corresponds to one short read. Each line has six tab separated entries,
126 namely:
127 \begin{enumerate}
128 \item unique read id
129 \item chromosome/contig id
130 \item position of match in chromosome/contig
131 \item strand
132 \item read sequence
133 \item read quality
134 \end{enumerate}
135
136
137 \subsection{Splice Scores}
138
139 %
140 %
141 %
142 \end{document}