1 \documentclass{article
}
6 \newcommand{\QP}{{\sl QPALMA
}}
7 \newcommand{\QPA}{{\sl QPALMA alignment algorithm
}}
8 \newcommand{\QPH}{{\sl QPALMA approximation
}}
9 \newcommand{\QPP}{{\sl QPALMA pipeline
}}
11 \title{QPalma Documentation
}
12 \author{Fabio De Bona
}
21 \QP is an alignment tool targeted to align spliced reads produced by ``Next
22 Generation'' sequencing platforms such as $Illumina Genome Analyzer$ or $
454$.
23 The basic idea is to use an extended Smith-Waterman algorithm for local
24 alignments that uses the base quality information of the reads directly during
25 the alignment step. Optimal alignment parameters i.e. scoring matrices are
26 inferred using a machine learning technique similar to
\emph{Support Vector
27 Machines
}. For further details on
\QP itself consult the paper
\cite{DeBona08
}.
28 For details about the learning method
\cite{Tsochantaridis04
}.
30 %We refer to the whole pipeline as the \QP pipeline and \QP respectively.
34 Basically
\QP assumes that you have the following data:
36 \item The reads you want to align,
37 \item parts/full genomic sequences of an organism,
38 \item splice site scores predicted for the genomic sequences.
41 \QP has one central configuration file:
45 The project results directory (
\emph{result
\_dir}) contains then the subdirectories
47 \item \emph{mapping
} with subdirs main and spliced
48 \item \emph{alignment
} with subdirs for the different parameters and
\emph{heuristic
}
49 \item \emph{remapping
}
53 Via the variable ``result
\_dir'' you can specify where all of QPalma's data should reside.
54 This directory contains the following subdirectories:
59 \item postprocessing, and
66 \section{Installation
}
68 QPalma has the following requirements:
71 \item In order to use QPalma on a cluster you need the pythongrid package which
72 can be found under the following URL:
73 \item For training you need either one of the following optimization toolkits:
87 The full pipline consists of $n$ steps:
90 \item Find alignment seeds using a fast suffix array method (vmatch) for
91 all given reads. This may take several rounds for subsets of the reads.
92 \item Preprocess the reads and their seeds. Convert them to a qpalma format with some sanity checks.
93 \item Use the
\QPH to identify those reads that have a full seed but might be
95 \item Once we identified all potentially spliced reads we use
\QPA to align
96 those to their seed regions.
97 \item One can choose between several post-processing steps in order to refine
98 the quality of the alignments via filtering.
106 A typical run consists of
108 \item Set your parameters (read size, number of mismatches, etc.) in the
110 \item Start the QPalma pipeline via
\emph{start
\_pipeline}.
115 First one creates datasets using run-specific preprocessing tools. After
116 dataset creation one checks the sets for consistency using the
117 check
\_dataset\_consistency.py script.
119 \section{QPalma Commands
}
121 \subsection{check
\_and\_init}
123 Performs sanity checking of the configurations file(s). Initializes needed
130 QPalma needs some training examples
131 For the training you need:
134 \item Training examples i.e. correct alignments (you can artificially generate those see
\ref)
135 \item Splice site predictions
136 \item Flat files of the genomic sequences you want to align to
139 A training example is definded as a $
4$-tuple, with elements:
141 \item A sequence information tuple
142 \item the read itself
143 \item the quality tuple
144 \item the alignment information
147 A prediction example is defined as a $
3$-tuple, with elements:
149 \item A sequence information tuple
150 \item the read itself
151 \item the quality tuple
154 The sequence information tuple itself consists of
157 \item Chromosome/Contig id
159 \item Start of the genomic region we want to align to
160 \item Stop of the genomic region
166 \section{Format Specifications
}
168 This section introduces all formats and conventions that are assumed to be met
169 by the users in order to make
\QP work.
171 \subsection{Format of the configuration file
}
173 The configuration file includes are settings
\QP needs to perform an analysis.
174 This includes paths to file where the raw data exists as well as settings which
175 sequencing platform is being used,the number of cluster nodes to employ etc. .
177 Its values are in the form
181 and ``#'' for lines containing comments.
184 \subsection{File Formats
}
185 The format of the file containing the mapped short reads is as follows. Each
186 line corresponds to one short read. Each line has six tab separated entries,
190 \item chromosome/contig id
191 \item position of match in chromosome/contig
197 \subsection{Read format and internal representation
}
200 \item Which nucleotide bears the score of the splice site ?
201 \item What exactly are the exon/intron boundaries pointing to (also are they
0/
1-based) ?
204 We address the first question as follows:
206 ----gt-------ag------
209 the score is assumed to be saved at the position of the g's of the splice sites.
211 \subsection{Splice Scores
}
213 The splice site scores where generated by ... . However if you train by
214 yourself then you can use splice site predictions of any kind... as long as the
215 prediction of one site is a real value.
218 \begin{thebibliography
}{1}
220 \bibitem[1]{DeBona08
}
221 De~Bona~F.~and~Ossowski~S.~and~Schneeberger~K.~and~G.~R
{\"a
}tsch
222 \newblock Optimal Spliced Alignment of Short Sequence Reads
223 \newblock {\em ECCB
2008}
225 \bibitem[2]{Tsochantaridis04
}
226 Ioannis~Tsochantaridis~and~Thomas~Hofmann~and~Thorsten~Joachims~and~Yasemin~Altun
227 \newblock Support Vector Machine Learning for Interdependent and Sturcutured Output Spaces
228 \newblock {\em Proceedings of the
16th International Conference on Machine Learning
},
2004
230 \end{thebibliography
}