my changes
[synmut.git] / synmut.tex
1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4 \documentclass[rmp, twocolumn]{revtex4}
5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7 \usepackage[english]{babel}
8 \usepackage[utf8x]{inputenc}
9 \usepackage{amsmath,amsfonts,amssymb,eucal,eurosym,textcomp}
10 \usepackage{color}
11 \usepackage{graphicx}
12 \usepackage[caption=false]{subfig}
13 \usepackage{natbib}
14 \usepackage{pslatex}
15 \usepackage[colorlinks,linkcolor=red,citecolor=red]{hyperref}
16 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
17 \graphicspath{{./figures/}}
18 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
19 \newcommand{\comment}[1]{\textit{\textcolor{red}{#1}}}
20 \newcommand{\mut}{\mu}
21 \newcommand{\mfit}{\langle F\rangle}
22 \newcommand{\mexpfit}{\langle e^{F}\rangle}
23 \newcommand{\ox}{r}
24 \newcommand{\co}{\rho}
25 \newcommand{\gt}{g}
26 \newcommand{\locus}{s}
27 \newcommand{\locuspm}{t}
28 \newcommand{\OO}{\mathcal{O}}
29 \newcommand{\rev}{\textit{rev}}
30 \newcommand{\FIG}[1]{Fig.~\ref{fig:#1}}
31 \newcommand{\FIGS}[2]{Figs.~\ref{fig:#1} and~\ref{fig:#2}}
32 \newcommand{\env}{\textit{env}}
33 \newcommand{\shankaregion}{C2-C5}
34 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
35 \renewcommand{\thesubfigure}{\Alph{subfigure}}
36 \newcommand{\Author}{Fabio~Zanini and Richard~A.~Neher}
37 \newcommand{\Title}{Deleterious synonymous mutations hitchhike to high frequency in HIV \env~evolution}
38 \newcommand{\Keywords}{{HIV}, {synonymous}, {population genetics}}
39 \hypersetup{pdfauthor={\Author}, pdftitle={\Title}, pdfkeywords={\Keywords}}
40 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
41 \begin{document}
42 \title{\Title}
43 \author{\Author}
44 \date{\today}
45 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
47 \begin{abstract}
48 \noindent
49 Intrapatient HIV evolution is governed by selection on the protein level in the
50 arms race with the immune system (killer T-cells and antibodies). Synonymous
51 mutations do not have an immunity-related phenotype and are often assumed to be
52 neutral. In this paper, we show that synonymous changes in epitope-rich regions
53 are often deleterious but still reach high frequencies in the viral population. We analyze time
54 series of viral sequences from the \shankaregion~part of {\it env} within individual
55 hosts and observe that synonymous derived alleles rarely reach fixation.
56 Simulations suggest that such synonymous mutations
57 have a (Malthusian) selection coefficient of the order of $-0.001$, and that
58 they are brought up to high frequency by hitchhiking on neighboring beneficial
59 nonsynonymous alleles. We detect a negative correlation between the fixation of an allele and
60 its involvement in RNA stem-loop structures.
61 Deleterious synonymous mutations are not observed as abundantly in other parts of the HIV genome, in which
62 selective sweeps are less dense and hitchhiking not as strong; this behaviour is
63 confirmed by extensive computer simulations.
65 \end{abstract}
66 \maketitle
68 \section{Introduction}
70 HIV evolves rapidly within a single host during the course of the infection.
71 This evolution is driven by strong selection imposed by the host immune system
72 via killer T cells (CTLs) and neutralizing antibodies
73 (ABs)~\citep{rambaut_causes_2004} and facilitated by the high
74 mutation rate of HIV~\citep{mansky_lower_1995}. When the host develops a CTL or
75 AB response against a particular HIV epitope, mutations in the viral genome that
76 reduce or prevent recognition of the epitope frequently emerge. Escape mutations
77 in epitopes targeted by CTLs typically evolve during early infection and spread
78 rapidly through the population~\citep{mcmichael_immune_2009}. During chronic
79 infection, the most rapidly evolving part of the HIV genome are the so called
80 variable loops of the envelope protein gp120, which need to avoid recognition by
81 neutralizing ABs. Mutations in \env, the gene encoding for gp120, spread
82 through the population within a few months (see \figurename~\ref{fig:aft}, solid
83 lines). Consistent with this time scale, it is found that serum from a
84 particular time typically neutralizes virus extracted more than 3-6 month
85 earlier \citep{richman_rapid_2003}.
87 These escape mutations are strongly selected for their effect on the amino acid
88 sequence of the viral proteins. Conversely, synonymous mutations are commonly
89 used as approximately neutral markers in studies of viral evolution. Neutral
90 markers are very useful in practice, because there evolution can be compared to
91 that of putatively functional sites to detect purifying or directional selection
92 \citep{Bhatt:2011p43255,Hurst:2002p32608,Chen:2004p22606}. In most species,
93 however, it is known that synonymous codons don't evolve completely neutrally
94 but that some codons are favored over others \citep{plotkin_synonymous_2011}.
95 In addition to maintaining protein function and avoiding immune
96 recognition, the HIV genome has to ensure efficient processing and translation,
97 nuclear export, and packaging into the viral capsid: all these processes operate at the RNA
98 level and are sensitive to synonymous changes. A few functionally important RNA
99 elements are well characterized. For example, a certain RNA sequence in the HIV
100 genome, called \rev{} response element (RRE), enhances nuclear export of viral
101 transcripts~\citep{fernandes_hiv-1_2012}. Another well studied case is the
102 interaction between viral reverse transcriptase, viral ssRNA, and the host
103 tRNA$^\text{Lys3}$: the latter is required for priming reverse transcription
104 (RT) and bound by a specifical pseudoknotted RNA structure in the viral 5'
105 untranslated region~\citep{barat_interaction_1991, paillart_vitro_2002}.
106 Nucleotide-level fitness effects have been observed beyond RNA structure as
107 well. Recent studies have shown that genetically engineered HIV strains with
108 a skewed codon usage bias towards more or less abundant tRNAs
109 replicate better or worse \comment{replicate better or produce more protein?},
110 respectively~\citep{ngumbela_quantitative_2008, li_codon-usage-based_2012}. A similar conclusion has been reached about
111 influenza, and codon-pessimized influenza strains have been shown to be good
112 live attenuated vaccines in mice~\citep{mueller_live_2010}. Purifying selection
113 beyond the protein sequence is therefore expected
114 \citep{forsdyke_reciprocal_1995,snoeck_mapping_2011}, but it seems reasonable
115 that the bulk of positive selection through the immune system be restricted to amino acid sequences.
120 In this paper, we characterize the dynamics of synonymous mutations in \env{}
121 and show that a substantial fraction of these mutations is deleterious. We
122 further show that, although such synonymous mutations cannot be used as neutral
123 markers, the degree to which they hitchhike with nearby nonsynonymous mutations
124 is very informative both about their on deleterious effect, as well as the
125 strength and nature of selection for nearby positively selected sites.
126 Their ability to hitchhike for extended times, which is a core requirement for
127 our analysis, is rooted in the small recombination rate of
128 HIV~\citep{neher_recombination_2010, batorsky_estimate_2011}. Extending the
129 analysis of fixation probabilities to the nonsynonymous mutations, we show that
130 time dependent selection or strong competition of escape mutations inside the
131 same epitope are necessary to explain the observed patterns of fixation and
132 loss.
134 \section{Results}
136 The central quantity we investigate is the probability of fixation of a
137 mutation, conditional on its population frequency. A neutral mutation
138 segregating at frequency $\nu$ has a probability $P_\text{fix}(\nu) = \nu$ to
139 spread through the population and fix; in the rest of the cases, i.e.~with
140 probability $1-\nu$, it goes extinct. As illustrated in the inset of \FIG{aftsyn},
141 this is a simple consequence of the fact that
142 (a) exactly one of the $N$ individuals in the current population will be
143 the common ancestor of the entire future population at a particular locus and
144 (b) this ancestor has a probability $\nu$ of carrying the mutation (assuming
145 the neutral mutation is not preferentially associated with genomes of high or
146 low fitness).
147 Deleterious or beneficial mutations fix less or
148 more often than neutral ones, respectively. Time series sequence data enable a
149 direct observation of both the current frequency $\nu$ of any particular
150 mutation and its future fate (fixation or extinction).
153 \subsection{Synonymous polymorphisms in \env, C2-V5, are mostly deleterious}
155 \FIG{aft} shows time series data of the frequencies of all mutations observed
156 \env~, C2-V5, in patient p8~\citep{shankarappa_consistent_1999}. Despite many
157 synonymous mutations reaching high frequency, very few fix
158 (panel~\ref{fig:aftsyn}); however, many nonsynonymous mutations fix
159 (panel~\ref{fig:aftnonsyn}).
160 These observations are quantified in \FIG{fixp}, which stratifies the data
161 of 7 (resp.~10) patients according to the frequency at which
162 different mutations are observed (see methods). Considering all mutations in a
163 frequency interval around $\nu_0$ at some time $t_i$, we calculate the fraction
164 that is found at frequency 1, at frequency 0, or at intermediate frequency at
165 later times $t_f$. Plotting these fixed, lost, and polymorphic fraction against
166 the time interval $t_f-t_i$, we see that most synonymous mutations segregate for
167 roughly one year and are lost much more frequently than expected (panel
168 \ref{fig:fixp1}). The long-time probability of fixation versus extinction of
169 synonymous mutations is shown as a function of the initial frequency $\nu_0$
170 in panel~\ref{fig:fixp2} (red line). The nonsynonymous seem to follow more or
171 less the neutral expectation (blue line) -- a point to which we will come back
172 below.
174 \begin{figure}
175 \begin{center}
176 \subfloat{\includegraphics[width=\linewidth]{Shankarappa_allele_freqs_trajectories_syn_p8.pdf}
177 \label{fig:aftsyn}}\\
178 \subfloat{\includegraphics[width=\linewidth]{Shankarappa_allele_freqs_trajectories_nonsyn_p8.pdf}
179 \label{fig:aftnonsyn}}
180 \caption{Time series of allele frequencies in \env, C2-V5, from
181 patient 8~\cite{shankarappa_consistent_1999},
182 divided by synonymous (panel A) and nonsynonymous (panel B) derived alleles.
183 While nonsynonymous mutations frequently fix, very few synonymous
184 mutations do even though they are frequently observed at intermediate
185 frequencies. Colors indicate the position of the site along the C2-V5 region
186 (blue to red). Inset: the fixation probability $P_\text{fix}(\nu)$ of a neutral mutation
187 is simply the likelihood that the future common ancestor is currently carrying
188 it, i.e. its frequency $\nu$.}
189 \label{fig:aft}
190 \end{center}
191 \end{figure}
193 \citet{bunnik_autologous_2008} present a longitudinal dataset on the entire
194 \env~gene of 3 patients at $\sim 5$ time points with approximately 5-20
195 sequences each (see methods). Repeating the above analysis separately on the
196 C2-V5 region studied above and the remainder of \env~ reveal striking
197 differences (see \FIG{fixp2}). Within C2-V5, this data fully confirms the
198 observations made in the data set by \citet{shankarappa_consistent_1999} (red
199 line). In the remainder of \env, however, observed synonymous mutations behave
200 neutrally (green line).
202 These observations suggest that many of the synonymous polymorphisms at
203 intermediate frequencies in the part of \env~that includes the hypervariable
204 regions are deleterious, while outside this regions polymorphisms are mostly
205 roughly neutral. Note that this does not imply that all synonymous mutations are
206 neutral -- only those mutations observed at high frequencies tend to be neutral.
208 \begin{figure}
209 \begin{center}
210 \subfloat{\includegraphics[width=0.9\linewidth]{Shankarappa_fix_loss_dt_times}
211 \label{fig:fixp1}}\\
212 \subfloat{\includegraphics[width=0.9\linewidth]{Bunnik2008_fixmid_syn_ShankanonShanka}
213 \label{fig:fixp2}}
214 \caption{Left panel: time course of loss and fixation of synonymous mutations
215 observed in a frequency interval $\nu_0$. The ultimate fraction of synonymous
216 mutations that fix as a function of intermediate frequency $\nu_0$ is the
217 fixation probability. Right panel: fixation probability of derived synonymous
218 alleles is strongly suppressed in C2-V5 versus other parts of the {\it env}
219 gene, and of nonsynonymous ones. Data from
220 Refs.~\cite{shankarappa_consistent_1999, bunnik_autologous_2008}.}
221 \label{fig:fixp}
222 \end{center}
223 \end{figure}
225 \subsection{Synonymous mutations in C2-V5 tend to disrupt conserved RNA stems}
227 One possible {\it a priori} explanation for lack of fixation of synonymous
228 mutations in C2-V5 are secondary structures in the viral RNA. It has
229 been suggested early on that parts of the viral genome that has the potential to
230 form stems is better conserved than the
231 remainder~\citep{forsdyke_reciprocal_1995,snoeck_mapping_2011}.
233 Recently, the propensity of nucleotides of the HIV genome to form base pairs has
234 been measured using the SHAPE assay (a biochemical reaction preferentially
235 altering unpaired bases)~\citep{watts_architecture_2009}. The SHAPE assay has
236 shown that the variable regions V1 to V5 tend to be unpaired, while the
237 conserved regions between those variable regions form stems. We partition all
238 synonymous alleles observed at intermediate frequencies above 10-15\% depending
239 on their final destiny (fixation or extinction). Subsequently, we align our
240 sequences to the reference NL4-3 strain used in
241 ref.~\citep{watts_architecture_2009} and assign them SHAPE reactivities. As
242 shown in \FIG{SHAPEA} in a cumulative histogram, the reactivities of fixed
243 alleles (red line) are systematically larger than of alleles that are doomed to
244 extinction (blue line) (Kolmogorov-Smirnov test, $P\approx 0.002$).
245 In other words, alleles that are likely to be
246 breaking RNA helices are also more likely to revert and finally be lost from the
247 population. As a control, the average over non-observed but potentially
248 available polymorphisms lies between the two curves (green line), as expected
249 (because only some of these mutations will interfere with stem loop formation).
250 To test the hypothesis that mutations in C2-V5 are lost since break stems
251 in the conserved between the variable loops, we split the synonymous mutations
252 in the extended V1-V5 region further into conserved and variable regions and
253 found that the biggest depression in fixation probability is observed in the
254 conserved stems, while the variable loops show little deviation from the
255 neutral signature, see \FIG{SHAPEB}. This is consistent with important stem
256 structures in conservered regions between loops.
258 In addition to RNA secondary structure, we have considered other possible
259 explanations for a fitness defect of synonymous mutations, in particular codon
260 usage bias (CUB). HIV is known to prefer A-rich codons over highly expressed
261 human codons~\citep{jenkins_extent_2003}. Moreover, codon-optimized
262 and -pessimized viruses have recently been generated and shown to replicate
263 better or worse than wild type strains,
264 respectively~\citep{li_codon-usage-based_2012, ngumbela_quantitative_2008,
265 coleman_virus_2008}. We do not find, however, any evidence for a contribution of
266 CUB to the ultimate fate of synonymous alleles. Several lines of thought support
267 this result. First of all, within a single patient, we do not observe any bias
268 towards more human-like CUB in the synonymous mutations that reach fixation
269 rather than extinction. Second, although codon-optimized HIV seems to perform
270 better {\it in vitro}, the distance in CUB between HIV and human genes is not
271 shrinking at the macroevolutionary level. Third, it is a common phenomenon for
272 retroviruses to use variously different codons from their hosts, and CUB effects
273 on fitness are thought to be so small that divergent nucleotide composition has
274 been suggested as a possible mechanism for viral
275 speciation~\citep{bronson_nucleotide_1994}. Fourth, CUB in the V1-C5 region is
276 not very different from other parts of the HIV genome, whereas the reduced
277 fixation probability is only observed there. In conclusion, although we cannot
278 exclude an effect of CUB on fitness as a general rule, we expect it to be a
279 minor effect in our context. \comment{Need to read literature}
281 \begin{figure}
282 \begin{center}
283 \subfloat{\includegraphics[width=0.9\linewidth]{mixed_Shankarappa_Bunnik2008_Liu_fixation_reactivity_Vandflanking_fromSHAPE}
284 \label{fig:SHAPEA}}\\
285 \subfloat{\includegraphics[width=0.9\linewidth]{Shankarappa_fixmid_syn_V_regions.pdf}\label{fig:SHAPEB}}
286 \caption{Permissible synonymous mutations have
287 high SHAPE reactivities. \comment{Is this C2-V5 only?}
288 Panel A) Synonymous substitions in the C2-V5 region have substantially higher
289 SHAPE reactivities than mutations that reach frequencies above 15\%, but are
290 susequently lost. This difference is quantified by a cumulative histogram
291 ($p=0.002$, Kolmogorov-Smirnov two sample test). Mutations that are never
292 observered show an intermediate distribution of SHAPE values.
293 Panel B shows the fixation probablility of synonymous mutations in C2-V5
294 separately for variable loops and the connecting conserved regions, that
295 harbor conserved RNA stems. As expected, fixation probability is lower inside
296 the conserved regions. Data from Refs.~\cite{shankarappa_consistent_1999,
297 bunnik_autologous_2008, liu_selection_2006}.}
298 \label{fig:SHAPE}
299 \end{center}
300 \end{figure}
303 \subsection{Deleterious mutations are brought to high frequency by hitch-hiking}
305 While the observation that some fraction of synonymous mutations is deleterious
306 is not unexpected, it seems odd that we observe them at high population
307 frequency -- at least in some regions of the genome. The region of \env~ in
308 which we observe deleterious mutations at high frequency, however, is special in
309 that it undergoes frequent adaptive changes to evade recognition by neutralizing
310 antibodies~\cite{williamson_adaptation_2003,richman_rapid_2003}. Due to the limited amount of
311 recombination in HIV~\cite{neher_recombination_2010,batorsky_estimate_2011},
312 deleterious mutations that are linked to adaptive variants can reach high
313 frequency~\citep[genetic hitchhiking,][]{smith_hitch-hiking_1974}.
315 The potential for hitchhiking is already apparent from the allele frequency
316 trajectories in \FIG{aft}, where many mutations appear to change rapidly in
317 frequency as a flock. Deleterious synonymous mutations can be brought to high
318 frequency by selection on linked nonsynonymous sites, a process
319 known as {\it genetic draft}~\citep{gillespie_genetic_2000, neher_genetic_2011}. In order
320 to be advected to high frequency by a linked adaptive mutation, the deleterious
321 effect of the mutation has to be substantially smaller than the adaptive effect.
322 The latter was estimated to be on the order of $s_a = 0.01$ per day~\citep{neher_recombination_2010}.
323 The approximate magnitude of the deleterious effects can be estimated from
324 \FIG{fixp1}, that shows the distribution of times for synonymous
325 alleles to reach the fix or get lost starting from intermediate frequencies. The
326 typical time to loss is of the order of 500 days. If this loss is driven by the
327 deleterious effect of the mutation, this corresponds to deleterious effects of
328 roughly $s_d \sim - 0.002$ per day.
330 \begin{figure}
331 \begin{center}
332 \subfloat{\includegraphics[width=\linewidth]{simulations_graduallydel.pdf}
333 \label{fig:simfixpvar}}\\
334 \subfloat{\includegraphics[width=\linewidth]{simulations_syn.pdf}
335 \label{fig:simsfig}}
336 \caption{Evolutionary parameters compatible with observations.
337 Panel A) The depression in $P_\text{fix}$ depends on the deleterious
338 effect size of the synonymous alleles (in HIV data is about $-0.2$).
339 Increasing this parameter also reduces synonymous
340 diversity, measured by the number of mutations at frequencies $0.25 < \nu < 0.75$,
341 normalized by the number of possible mutations (first inset).
342 Panel B) To assess the parameter space compatible with observations, we run many
343 simulations with random parameters for deleterious effect size, fraction of
344 deleterious synonymous sites, average size of escape mutations (color, blue ($10^{-2.5}$) to red ($10^{-1.5}$)), and rate of
345 introduction of new epitopes (marker size, from $10^{-3}$ to $10^{-2}$ per
346 generation). Mostly simulations with small deleterious effects and relatively
347 high deleterious fractions of synonymous mutations reproduce synonymous HIV-like
348 fixation probability and diversity. Parameters are chosen uniformly on a
349 logarithmic scale from the indicated windows. \comment{should we extend the
350 parameters space maybe to $10^{-4}$?}}
351 \label{fig:simheat}
352 \end{center}
353 \end{figure}
355 To get a better idea of the range of parameters that are compatible with the
356 observations and our interpretation, we perform computer simulations of
357 evolving viral populations assuming a mix of positive and purifying selection
358 and rare recombination.
359 For this purpose, we use the simulation package FFPopSim, which includes a
360 module dedicated to intrapatient HIV evolution~\citep{zanini_ffpopsim:_2012}.
362 The first result of the simulations is that genetic draft can indeed bring weakly
363 deleterious mutations to high frequencies and results in a dependence of the
364 fixation probability on initial frequency that is compatible with observations
365 (see \FIG{simfixpvar}).
366 The fixation probability of synonymous alleles decreases from the neutral
367 expectation to zero as their deleterious effect increases
368 (\FIG{simfixpvar}, from $-0.0003$ in blue to $-0.003$ in red). Along with the
369 reduction in fixation probability, the number of alleles that reach high
370 frequencies, which is a proxy for synonymous diversity, is reduced as well (\FIG{simfixpvar}, top inset).
372 To assess the reduction in fixation probability quantitatively, we calculate
373 the area between the measured fixation probability and the diagonal, which is
374 the neutral expectation (\FIG{simfixpvar}, lower inset). This area spans a range
375 between $-0.5$ (no fixation at all), zero (neutral-like curve), and $+0.5$ (always
376 fixed). The HIV data correspond to an area of approximately $-0.2$ for synonymous
377 changes, and essentially zero for nonsynonymous changes.
379 We pick many random combinations of the following four parameters: effect size
380 and rate of appearance of beneficial escape mutations, and effect size and
381 fraction of deleterious synonymous mutations.
382 Among all simulations, we select the one that show a large depression in
383 fixation probability (>-0.??) of synonymous mutations but, simulteneously, a
384 high synonymous diversity (>0.0? per site), as observed in HIV data. This
385 selection imposes a strong constraint on the parameters, in particular on the
386 ones related to synonymous changes, i.e.~their deleterious fraction and effect
387 size (see \FIG{simsfig}).
388 A high fraction ($\gtrsim 0.8$) of only slightly deleterious synonymous sites,
389 with effect size $|s_d| \lesssim 0.002$, is required. This result fits well the
390 expectation based on the fixation/extinction times above (see \FIG{fixp1}).
391 Moreover, it corresponds to {\it a priori} theoretical expectations:
392 (a) since high frequency bins are enriched in neutral mutations (it is easier
393 for them to hitchhike), the hallmark of deleterious mutations will be visible
394 there only if they are pervasive; (b) genetic draft stops working if the
395 deleterious effect size becomes comparable to the adaptive effect size of escape
396 mutations, because the double mutant has little or no fitness advantage.
398 As long as only features of synonymous mutations are filtered, as performed
399 above, no strong correlation of the two nonsynonymous parameters, effect size
400 and rate of appearance of epitopes, is observed; in \FIG{simsfig}, the points
401 shown possess a variety of shapes and colors. In most of these simulations,
402 however, nonsynonymous mutations almost always fix once they reach high frequencies
403 -- their nonsynonymous fixation area is well above zero. This is incompatible
404 with the blue line in \FIG{fixp}: in an HIV infection, nonsynonymous mutations fix as if they
405 were neutral. Although there are no doubts that nonsynonymous variation in the
406 variable regions is driven by positive selection and not by genetic drift, the
407 biological reason for the neutral-like curve in \FIG{fixp} is unclear.
408 Inspecting the trajectories of nonsynonymous mutations
409 suggests the rapid rise and fall of many alleles. We test two possible
410 mechanisms that are biologically plausible and could explain the transient rise
411 of nonsynonymous mutations: time-dependent selection and within-epitope
412 competition.
414 The former explanation can be formulated as follows: if the immune system
415 recognizes the escape mutant before its fixation, the mutant might cease to be
416 beneficial and disappear soon, despite its quick initial rise in frequency. In support of this idea,
417 \citet{richman_rapid_2003, bunnik_autologous_2008} report antibody responses to
418 escape mutants. These respones are delayed by a few months, roughly matching the
419 average time needed by an escape mutant to cross frequencies of order one.
420 In the alternative explanation, several different escape
421 mutations within the same epitope might arise almost simultaneously and start to
422 spread. Their benefits are not additive, because each of them is
423 essentially sufficient to escape. As a consequence, several escape mutations rise to
424 high frequency rapidly, while the one with the smallest cost in terms of replication,
425 packaging, etc.~is most likely to eventually fix. The emergence of
426 multiple sweeping nonsynonymous mutations in real HIV infections has been shown
427 previously~\citep{moore_limited_2009, bar_early_2012}.
428 We tested both hypotheses in our simulation framework and they both seem to
429 achieve a reduced, neutral-like fixation probability while maintaining a high
430 rate of substitutions, compatibly with HIV measurements. In real HIV infections,
431 both mechanisms are likely to be playing a role.
433 \section{Discussion}
434 Despite several known functional roles for RNA secondary structure in the HIV
435 genome, synonymous mutations are often used as approximately neutral markers in
436 evolutionary studies of viruses. We have shown that the majority of synonymous
437 mutations in the conserved regions C2-C5 of the \env~gene are deleterious.
438 Comparison with recent biochemical studies of binding propensity of bases in RNA
439 genome suggest that these mutations are deleterious, at least in part, because they disrupt
440 stems in RNA secondary structures. Furthermore, we provide evidence that these
441 mutations are brought to high frequency through linkage to adaptive mutations.
442 The latter mutations are only transiently adaptive, either through a
443 coevolution with the immune system or redundant escape within an epitope.
445 Our observations and conclusion rely heavily on longitudinal data in which the
446 dynamics of mutations can be explicitly observed. The fact that deleterious
447 mutations can be brought to high frequencies through hitchhiking underscores
448 the intensity of the coevolution with the immune system. The fact that
449 multiple escape mutations in the same epitope -- as is indeed observed in
450 studies of antibody escape~\citep{moore_limited_2009, bar_early_2012} -- are
451 necessary to explain the patterns of fixation of nonsynonymous mutations points
452 towards a large populations size that rapidly discovers adaptive mutations. A
453 similar point has been made recently by Boltz {\it et al.} in the context of
454 preexisting drug resistance mutations~\citep{boltz_ultrasensitive_2012}.
456 The observed hitchhiking highlights the importance of linkage due to infrequent
457 recombination for the evolution of HIV
458 \citep{neher_recombination_2010,batorsky_estimate_2011,
459 josefsson_majority_2011}. The recombination rate has been estimated to be on the
460 order of $\rho = 10^{-5}$ per base and day. It takes roughly $t_{sw} = s_a^{-1}
461 \log \nu_0$ generations for an adaptive mutation with growth rate $s_a$ to rise
462 from an initially low frequency $\nu_0\sim \mu$ to frequency one. This implies
463 that a region of length $l = (\rho t_{sw})^{-1} = s_a / \rho \log \nu_0$ remains
464 linked to the adaptive mutation. With $s_a=0.01$, $l\approx 100$ bases, hence we
465 expect strong linkage between the variable loops and their surrounding stems,
466 but none far beyond the variable regions, consistent with the lack of signal
467 outside of C1-V5. In case of much stronger selection -- such as observed during
468 early CTL escape or drug resistance evolution -- the linked region is of course
469 much larger.
471 The functional significance of the insulating RNA structure stems between the
472 hyper variable loops has been proposed
473 previously~\citep{watts_architecture_2009, sanjuan_interplay_2011}.
474 \citet{sanjuan_interplay_2011} have shown that insulating stems are relevant for
475 viral fitness {\it in vivo}. Our analysis is limited by the availability of
476 longitudinal data which requires a focus on the the variable regions of \env.
477 Conserved RNA structures exist in different parts of the HIV genome (several are
478 known). In absence of repeated adaptive substitutions in the vicinity that cause
479 hitchhiking, the deleterious synonymous mutations remain at low frequencies and
480 can only be observed by deep sequencing methods.
482 Our study uncovers the
483 subtle balance of evolutionary forces governing intrapatient HIV evolution. The
484 fixation and extinction times and probabilities represent a rich and simple
485 summary statistics useful to characterize sequencing data and compare to
486 models via computer simulations.
487 A similar method has been recently used in a longitudinal study of
488 influenza~\citep{strelkowa_clonal_2012}. The propagators suggested in that
489 paper, however, represent ratios between nonsynonymous
490 mutations and synonymous ones, hence they are inadequate to investigate
491 synonymous changes themselves. The authors also conclude that several
492 beneficial mutations segregate simultaneously in influenza, a scenario
493 remarkably similar to our within-epitope competition picture. While
494 recombination hardly affects evolution within an epitope, it might facilitate
495 immune escape by easing competition between different epitopes
496 \citep{neher_rate_2010,Rouzine:2005p17398}.
498 Our results emphasize the inadequacy of independent site
499 models of HIV evolution and the common assumption that selection is time
500 independent. If genetic variation is only transiently beneficial, existing
501 methods to quantify selection will yield substantial underestimates
502 \citep{williamson_adaptation_2003,neher_rate_2010,OTHER}. To explain the
503 observations regarding the fixation probabilities of non-synonymous mutations,
504 either transient selection, or substational within-epitope competition are
505 necessary. Which mechanism is more widespread is not clear as of now,
506 there is evidence for both~\citep{richman_rapid_2003, moore_limited_2009,
507 bar_early_2012}.
509 \section{Methods}
510 \subsection{Sequence data collection}
511 Longitudinal intrapatient viral RNA sequences were collected for published
512 studies~\citep{shankarappa_consistent_1999,
513 liu_selection_2006, bunnik_autologous_2008} and downloaded from the Los Alamos
514 National Laboratory (LANL) HIV sequence database~\citep{LANL2012}. The sequences from
515 some patients showed signs of HIV compartimentalization into subpopulations and
516 were discarded; a grand total of 11
517 patients with approximately 6 time points each and 10 sequences per time point
518 were analyzed. The time interval or resolution between two consecutive sequences
519 was approximately 6 to 18 months.
521 \comment{How did you determine who to discard. Maybe we should include the PCA
522 plots into the supplement.}
525 \subsection{Sequence analysis}
526 The good sequences were aligned within each patient
527 via the translated amino acid sequence, using
528 Muscle~\citep{edgar_muscle:_2004}, and to the NL4-3 reference sequence used
529 by \citet{watts_architecture_2009} in the SHAPE assay. Within each patient, a
530 consensus RNA sequence at the first time point was used to classify alleles as ancestral or
531 derived at all sites. Problematic sites that included large frequencies of gaps
532 were excluded from the analysis to avoid artefactual substitutions due to
533 alignment errors. Time series of allele frequencies were extracted from the
534 sequences.
536 The synonymity of a mutation was assigned if the rest of the codon was
537 in the ancestral state and using the standard genetic code. Cases where more
538 than one mutation within the codon was observed were discarded. Slightly
539 different criteria for synonymous/nonsynonymous discrimination yielded similar
540 results.
542 \subsection{Fixation probability and secondary structure}
543 For the estimate of times to fixation/extinction, polymorphisms were
544 binned by frequency and the time to reaching the first boundary (fixation or
545 extinction) was stored. For the fixation probability, the long-time limit of the
546 resulting curves was used, excluding polymorphisms that arose late in the
547 clinical history (and would have had no time to reach either boundary).
549 For the correlation analysis with RNA secondary structure, the SHAPE scores were
550 downloaded from the journal website~\citep{watts_architecture_2009}. By virtue
551 of the alignment of the longitudinal sequences with the reference used by
552 \citet{watts_architecture_2009}, SHAPE reactivities were assigned to most sites.
553 Problematic assignments in indel-rich regions were excluded from the analysis.
554 In order to restrict the analysis to synonymous polymorphisms, a lower frequency
555 threshold of 0.15 was used (other thresholds yielded the same results). Since
556 very few polymorphisms hitchhike beyond, say, a frequency of 0.5, this pool is
557 enriched for to-be-lost mutations; hence the ``lost" curve in \FIG{SHAPEA}
558 contains much more points than the ``fixed" one.
560 The V loops and flanking regions were identified manually starting from the
561 annotated reference HXB2 sequence from the LANL HIV database~\citep{LANL2012}. A
562 similar approach was used to label the C2-V5 region sequenced in
563 ref.~\citep{shankarappa_consistent_1999}.
565 \subsection{Computer simulations}
566 Simulations were performed using the recently published software
567 FFPopSim~\citep{zanini_ffpopsim:_2012}. Both full-length HIV genomes and
568 \env{}-only simulations were performed and yielded comparable results. For each
569 set of parameters, approximately 100 simulation runs were averaged over. In each
570 run, a random fitness landscape with specified statistical properties (e.g.
571 density of beneficial sites, average deleterious effect of synonymous changes) was generated.
572 Although the curves shown in \FIG{simfixpvar} are not very smooth, small
573 parameter changes resulted in overall consistent trends across many repetitions.
575 For the discussion of simulation parameters, the areas below or above the neutral
576 diagonal were estimated from the binned fixation probabilities using the linear
577 interpolation between the bin centers. This measure is sufficiently precise for
578 our purposes, because the HIV data are quite scarse themselves.
580 \section*{Acknowledgements}
581 \comment{to be written\dots}
584 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
585 \bibliographystyle{natbib}
586 \bibliography{bib}
587 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
588 \end{document}
589 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%