figures updates
[synmut.git] / synmut.tex
1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2 % ARTICLE ABOUT FATE OF SYNONYMOUS MUTATIONS IN HIV
3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4 \documentclass[rmp, twocolumn]{revtex4}
5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7 \newcommand{\Author}{Fabio~Zanini and Richard~A.~Neher}
8 \newcommand{\Title}{Deleterious synonymous mutations hitchhike to high frequency in HIV \env~evolution}
9 \newcommand{\Keywords}{{HIV}, {synonymous}, {population genetics}}
10 \usepackage[english]{babel}
11 \usepackage[utf8x]{inputenc}
12 \usepackage{amsmath,amsfonts,amssymb,eucal,eurosym,textcomp}
13 \usepackage{color}
14 \usepackage{graphicx}
15 \usepackage[caption=false]{subfig}
16 \usepackage{natbib}
17 \usepackage{pslatex}
18 \usepackage[colorlinks,linkcolor=red,citecolor=red]{hyperref}
19 \hypersetup{pdfauthor={\Author}, pdftitle={\Title}, pdfkeywords={\Keywords}}
20 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
21 \graphicspath{{./figures/}}
22 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23 %\DeclareMathOperator\de{d\!}
24 \newcommand{\comment}[1]{\textit{\textcolor{red}{#1}}}
25 \newcommand{\mut}{\mu}
26 \newcommand{\mfit}{\langle F\rangle}
27 \newcommand{\mexpfit}{\langle e^{F}\rangle}
28 \newcommand{\ox}{r}
29 \newcommand{\co}{\rho}
30 \newcommand{\gt}{g}
31 \newcommand{\locus}{s}
32 \newcommand{\locuspm}{t}
33 \newcommand{\OO}{\mathcal{O}}
34 \newcommand{\env}{\textit{env}}
35 \newcommand{\rev}{\textit{rev}}
36 \newcommand{\FIG}[1]{Fig.~\ref{fig:#1}}
37 \newcommand{\FIGS}[2]{Figs.~\ref{fig:#1} and~\ref{fig:#2}}
38
39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
40 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
41 \begin{document}
42 \title{\Title}
43 \author{\Author}
44 \date{\today}
45 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
46
47 \begin{abstract}
48 \noindent
49
50 Intrapatient HIV evolution is goverened by selection on the protein level in
51 the arms race with the immune system (killer T-cells and antibodies).
52 Synonymous mutations do not have an immunity-related phenotype and are often
53 assumed to be neutral. In this paper, we show that synonymous changes in
54 epitope-rich regions are often deleterious but reach frequencies of order one
55 via genetic hitchhiking. We analyze time series of viral sequences from the
56 V1-C5 part of {\it env}, the envelope gene, within individual hosts and observe
57 that synonymous derived alleles fix in the viral population much less
58 frequently than expected from neutral models. Instead, synonymous changes tend
59 to revert on a time scale of two years, suggesting a selection coefficient of the
60 order of $-1~$\textperthousand. Extensive computer simulations support these
61 findings quantitatively. We explore possible biological causes and detect a
62 negative correlation between fixation of an allele and its involvement in
63 secondary RNA stem-loop structures, indicating that such structures are
64 functionally relevant for HIV replication {\it in vivo}. This phenonenon is not
65 observed in other parts of the HIV genome, in which selective sweeps are less
66 dense and the genetic architecture less constrained.
67
68 \end{abstract}
69 \maketitle
70
71 \section{Introduction}
72
73 HIV evolves rapidly within a single host during the course of the infection.
74 This evolution is driven by strong selection imposed by the host immune system
75 via killer T cells (CTLs) and neutralizing antibodies
76 (ABs)~\citep{pantaleo_immunopathogenesis_1996} and facilitated by the high
77 mutation rate of HIV~\citep{mansky_lower_1995}. When the host develops a CTL or
78 AB response against a particular HIV epitope, mutations in the viral genome that
79 reduce or prevent recognition of the epitope frequently emerge. Escape mutations
80 in epitopes targeted by CTLs typically evolve during early infection and spread
81 rapidly through the population~\citep{mcmichael_immune_2009}. During chronic
82 infection, the most rapidly evolving part of the HIV genome are the so called
83 variable loops of the envelope protein gp120, which need to avoid recognition by
84 neutralizing ABs. Mutations in \env~, the gene encoding for gp120, spread
85 through the population within a few months (see \figurename~\ref{fig:aft}, solid
86 lines). The (Malthusian) effect size of these beneficial mutations is of the
87 order of $s_a \sim 0.01$~\citep{neher_recombination_2010}.
88
89 These escape mutations are strongly selected for their effect on the amino acid
90 sequence of the viral proteins. Conversely, synonymous mutations are commonly
91 used as approximate neutral markers in studies of viral evolution. Neutral
92 markers are very useful in practice, because they can be used to make inferences
93 about the stochastic forces driving evolution~\citep{yang_statistical_2000}.
94 The viral genome, however, needs to satisfy further constraints in addition to
95 immune escape, such as efficient processing and translation, nuclear export, and
96 packaging into the viral capsid: all these processes operate at the RNA level
97 and are sensitive to synonymous changes. A few functionally important RNA
98 elements are well characterized. For example, a certain RNA sequence in the HIV
99 genome, called \rev{} response element (RRE), enhances nuclear export of viral
100 transcripts~\citep{fernandes_hiv-1_2012}. Another well studied case is the
101 interaction between viral reverse transcriptase, viral ssRNA, and the host
102 tRNA$^\text{Lys3}$: the latter is required for priming reverse transcription
103 (RT) and bound by a specifical pseudoknotted RNA structure in the viral 5'
104 untranslated region~\citep{barat_interaction_1991, paillart_vitro_2002}.
105 Nucleotide-level fitness effects have been observed beyond RNA structure as
106 well. Recent studies have shown that genetically engineered HIV strains with
107 skewed codon usage bias (CUB) patterns towards more or less abundant tRNAs
108 replicate better or worse, respectively~\citep{ngumbela_quantitative_2008,
109 li_codon-usage-based_2012}. A similar conclusion has been reached about
110 influenza, and codon-pessimized influenza strains have been shown to be good
111 live attenuated vaccines in mice~\citep{mueller_live_2010}. Purifying selection
112 beyond the protein sequence is therefore expected, while it seems reasonable
113 that the bulk of positive selection through the immune system be restricted to
114 amino acid sequences.
115
116 %SYNONYMOUS CONSERVATION. DO WE HAVE A PLOT OF GENOME WIDE CONSERVATION, MAYBE
117 %FOR SUPPLEMENT? YES
118
119 In this paper, we characterize the dynamics of synonymous mutations in \env{}
120 and show that a substantial fraction of these mutations is deleterious. We
121 further show that, although such synonymous mutations cannot be used as neutral
122 markers, the degree to which they hitchhike with nearby nonsynonymous mutations
123 is very informative. Their ability to hitchhike for extended times, which is a
124 core requirement for our analysis, is rooted in the small recombination rate of
125 HIV~\citep{neher_recombination_2010, batorsky_estimate_2011}. Extending the
126 analysis of fixation probabilities to the nonsynonymous mutations, we show that
127 time dependent selection or strong competition of escape mutations inside the
128 same epitope are necessary to explain the observed patterns of fixation and
129 loss.
130
131 \section{Results}
132
133 The central quantity we investigate is the probability of fixation of a
134 mutation, conditional on its population frequency. A neutral mutation
135 segregating at frequency $\nu$ has a probability $P_\text{fix}(\nu) = \nu$ to
136 spread through the population and fix; in the rest of the cases, i.e. with
137 probability $1-\nu$, it goes extinct. This is a simple consequence of the fact
138 that (a) exactly one of the $N$ individuals in the current population will be
139 the common ancestor of the entire future population at a particular locus and
140 (b) this ancestor has a probability $\nu$ of carrying the mutation, see
141 illustration in \FIG{fixp}. Deleterious or beneficial mutations fix less or
142 more often than neutral ones, respectively. Time series sequence data enable a
143 direct observation of both the current frequency $\nu$ of any particular
144 mutation and its future fate (fixation or extinction). They therefore represent
145 a simple way to investigate average properties of different classes of
146 mutations.
147
148 \subsection{Synonymous polymorphisms in \env, C2-V5 are mostly deleterious}
149
150 \FIG{aft} shows time series data of the frequencies of all mutations observed
151 \env~, C2-V5, in patient p8~\citep{shankarappa_consistent_1999}. Despite many
152 synonymous mutations reaching high frequency (dashed lines), very few fix. This
153 observation is quantified in panels \FIG{fixp1} and \ref{fig:fixp2}, which
154 stratify the data of 7 (resp. 10) patients according to the frequency at which
155 different mutations are observed (see methods). Considering all mutations in a
156 frequency interval around $\nu_0$ at some time $t_i$, we calculate the fraction
157 that is found at frequency 1, at frequency 0, or at intermediate frequency at
158 later times $t_f$. Plotting these fixed, lost, and polymorphic fraction against
159 the time interval $t_f-t_i$, we see that most synonymous mutations segregate for
160 roughly two years and are lost much more frequently than expected. The long-time
161 probability of fixation versus extinction is shown as a function of the initial
162 frequency $\nu_0$ in panel~\ref{fig:fixp2} (red line). In contrast to synonymous
163 mutations, the nonsynonymous seem to follow more a less the neutral expectation
164 (blue line) -- a point to which we will come back below.
165
166 \begin{figure}
167 \begin{center}
168 \includegraphics[width=\linewidth]{Shankarappa_allele_freqs_trajectories_syn_nonsynp8.pdf}
169 \caption{Synonymous mutations rarely fix in \env, C2-V5: mutation frequency
170 trajectories observed in patient 8~\cite{shankarappa_consistent_1999};
171 Nonsynonymous and synonymous mutations are shown as solid and dashed lines,
172 respectively. Colors indicate the position of the site along the C2-V5 region
173 (red to blue) MAYBE MAKE FIGURE WITH SYNONYMOUS AND NONSYN
174 SEPARATELY. While nonsynonymous mutations frequently fix, very few synonymous
175 mutations do even though they are frequently observed at intermediate
176 frequencies.}
177 \label{fig:aft}
178 \end{center}
179 \end{figure}
180
181 \citet{bunnik_autologous_2008} present a longitudinal dataset on the entire
182 \env~gene of 3 patients at $\sim 5$ time points with approximately 5-20
183 sequences each (see methods). Repeating the above analysis separately on the
184 C2-V5 region studied above and the remainder of \env~ reveal striking
185 differences (see \FIG{fixp}). Within C2-V5, this data fully confirms the
186 observations made in the data set by \citet{shankarappa_consistent_1999} (red
187 line). In the remainder of \env, however, observed synonymous mutations behave
188 as if they were neutral (orange line).
189
190 %ARE OBSERVED SYNONYMOUS MUTATIONS OUTSIDE C2-V5 NEUTRAL? (?? SOME!)
191 %DOES LOSS/FIX CORRELATE WITH CONSERVATION? YES.
192 %MAYBE WE COULD HAVE ONE -- COMPLETELY CIRCULAR -- FIGURE SHOWING LOSS/FIX VS CONSERVATION: SUPPLEMENTARY?
193 %CAN WE LOOK AT THE AVERAGE LEVEL OF CONSERVATION STRATIFIED BY MAX FREQ? TRICKY: the maximal freq is achieved by hitchhiking...
194
195 These observations suggest that many of the synonymous polymorphisms in the part
196 of \env~that includes the hypervariable regions are deleterious, while outside
197 this regions polymorphisms are mostly roughly neutral.
198
199 \begin{figure}
200 \begin{center}
201 \subfloat{\includegraphics[width=0.9\linewidth]{Shankarappa_fix_loss_dt_times}
202 \label{fig:fixp1}}\\
203 \subfloat{\includegraphics[width=0.9\linewidth]{Bunnik2008_fixmid_syn_ShankanonShanka}
204 \label{fig:fixp2}}
205 \caption{Left panel: time course of loss and fixation of synonymous mutations
206 observed in a frequency interval $\nu_0$. The ultimate fraction of synonymous
207 mutations that fix as a function of intermediate frequency $\nu_0$ is the
208 fixation probability. Right panel: fixation probability of derived synonymous
209 alleles is strongly suppressed in C2-V5 versus other parts of the {\it env}
210 gene, and of nonsynonymous ones. Data from
211 Refs.~\cite{shankarappa_consistent_1999, bunnik_autologous_2008}.}
212 \label{fig:fixp}
213 \end{center}
214 \end{figure}
215
216 \subsection{Synonymous mutations in C2-V5 tend to disrupt conserved RNA stems}
217
218 One possible {\it a priori} explanation for lack of fixation of synonymous
219 mutations in C2-V5 are secondary structures in the viral RNA. If any RNA
220 secondary structures are relevant for HIV replication, mutations in nucleotides
221 involved in those base pairs are expected to be deleterious and to revert
222 preferentially. Many functionally important secondary structure elements have
223 been characterized, including the RRE~\citep{fernandes_hiv-1_2012} and the 5'
224 UTR pseudoknot interacting with the host
225 tRNA$^\text{Lys3}$~\citep{barat_interaction_1991, paillart_vitro_2002}. It has
226 been suggested early on that parts of the viral genome that has the potential to
227 form stems is better conserved than the
228 remainder~\citep{forsdyke_reciprocal_1995}.
229
230 Recently, the propensity of nucleotides of the HIV genome to form base pairs has
231 been measured using the SHAPE assay (a biochemical reaction preferentially
232 altering unpaired bases)~\citep{watts_architecture_2009}. The SHAPE assay has
233 shown that the variable regions V1 to V5 tend to be unpaired, while the
234 conserved regions between those variable regions form stems. We partition all
235 synonymous alleles observed at intermediate frequencies above 10-15\% depending
236 on their final destiny (fixation or extinction). Subsequently, we align our
237 sequences to the reference NL4-3 strain used in
238 ref.~\citep{watts_architecture_2009} and assign them SHAPE reactivities. As
239 shown in \FIG{SHAPEA} in a cumulative histogram, the reactivities of fixed
240 alleles (red line) are systematically larger than of alleles that are doomed to
241 extinction (blue line) (Kolmogorov-Smirnov test, $P\approx
242 2~\text{\textperthousand}$). In other words, alleles that are likely to be
243 breaking RNA helices are also more likely to revert and finally be lost from the
244 population. As a control, the average over non-observed but potentially
245 available polymorphisms lies between the two curves (green line), as expected
246 (because only some of them will be helix breakers). Furthermore, as a
247 complementary analysis, we split the synonymous mutations in the extended V1-V5
248 region further into conserved and variable regions and found that the biggest
249 depression in fixation probability is observed in the conserved stems, while the
250 variable loops show little deviation from the neutral signature, see
251 \FIG{SHAPEB}.
252
253 In addition to RNA secondary structure, we have considered other possible
254 explanations for a fitness effect of synonymous mutations, in particular codon
255 usage bias (CUB). HIV is known to prefer A-rich codons over highly expressed
256 human housekeeping genes~\citep{jenkins_extent_2003}. Moreover, codon-optimized
257 and -pessimized viruses have recently been generated and shown to replicate
258 better or worse than wild type strains,
259 respectively~\citep{li_codon-usage-based_2012, ngumbela_quantitative_2008,
260 coleman_virus_2008}. We do not find, however, evidence for any contribution of
261 CUB to the ultimate fate of synonymous alleles. Several lines of thought support
262 this result. First of all, although codon-optimized HIV seems to perform better
263 {\it in vitro}, the distance in CUB between HIV and human genes is not shrinking
264 at the macroevolutionary level. Second, within a single patient, we do not
265 observe any bias towards more human-like CUB in the synonymous mutations that
266 reach fixation rather than extinction. Third, it is a common phenomenon for
267 retroviruses to use variously different codons from their hosts, and CUB effects
268 on fitness are thought to be so small that divergent nucleotide composition has
269 been suggested as a possible mechanism for viral
270 speciation~\citep{bronson_nucleotide_1994}. Fourth, CUB in the V1-C5 region is
271 not very different from other parts of the HIV genome, whereas the reduced
272 fixation probability is only observed there. In conclusion, although we cannot
273 exclude an effect of CUB on fitness as a general rule, we expect it to be a
274 minor effect in our context.
275 \begin{figure}
276 \begin{center}
277 \subfloat{\includegraphics[width=0.9\linewidth]{mixed_Shankarappa_Bunnik2008_Liu_fixation_reactivity_Vandflanking_fromSHAPE}
278 \label{fig:SHAPEA}}\\
279 \subfloat{\includegraphics[width=0.9\linewidth]{Shankarappa_fixmid_syn_V_regions.pdf}\label{fig:SHAPEB}}
280 \caption{Watts et al. have measured the reactivity of HIV nucleotides to {\it
281 in vitro} chemical attack and shown that some nucleotides are more likely to
282 be involved in RNA secondary folds. C1-C5 regions, in particular, show
283 conserved stem-loop structures~\citep{watts_architecture_2009}. We show that
284 among all derived alleles in those regions reaching frequencies of order one,
285 there is a negative correlation between fixation and involvement in a base
286 pairing in a RNA stem (left panel). The rest of the genome does not show any
287 correlation (right panel). There might be too few silent polymorphisms in the
288 first place, or the signal might be masked by non-functional RNA
289 structures. Data from Refs.~\cite{shankarappa_consistent_1999,
290 bunnik_autologous_2008, liu_selection_2006}.}
291 \label{fig:SHAPE}
292 \end{center}
293 \end{figure}
294
295
296 \subsection{Deleterious mutations are brought to high frequency by hitch-hiking}
297
298 While the observation that some fraction of synonymous mutations is deleterious
299 is not unexpected, it seems odd that we observe them at high population
300 frequency -- at least in some regions of the genome. The region of \env~ in
301 which we observe deleterious mutations at high frequency, however, is special in
302 that it undergoes frequent adaptive changes to evade recognition by neutralizing
303 antibodies~\cite{williamson_adaptation_2003}. Due to the limited amount of
304 recombination in HIV~\cite{neher_recombination_2010,batorsky_estimate_2011},
305 deleterious mutations that are linked to adaptive variants can reach high
306 frequency~\citep{smith_hitch-hiking_1974}.
307
308 The potential for hitchhiking is already apparent from the allele frequency
309 trajectories in \FIG{aft}, where many mutations appear to change rapidly in
310 frequency as a flock. Deleterious synonymous mutations can be amplified
311 exponentially by selection on linked nonsynonymous sites, a process known as
312 {\it genetic draft}~\citep{gillespie_genetic_2000, neher_genetic_2011}. In order
313 to be advected to high frequency by a linked adaptive mutation, the deleterious
314 effect of the mutation has to be substantially smaller than the adaptive effect.
315 The latter was estimated to be on the order of $s_a = 0.01$ per day~\citep{neher_recombination_2010}.
316 The approximate magnitude of the deleterious effects can be estimated from
317 \FIG{fixp1}, that shows the distribution of times for synonymous
318 alleles to reach the fix or get lost starting from intermediate frequencies. The
319 typical time to loss is of the order of 500 days. If this loss is driven by the
320 deleterious effect of the mutation, this corresponds to deleterious effects of
321 roughly $s_d \sim - 0.002$ per day.
322
323 To get a better idea of the range of parameters that are compatible with the
324 observations and our interpretation, we perform computer simulations of
325 evolving viral populations under selection and rare recombination. For this
326 purpose, we use the recently published package FFPopSim, which includes a module
327 dedicated to intrapatient HIV evolution~\citep{zanini_ffpopsim:_2012}. We
328 analyze many combinations of parameters such as population size, recombination
329 rate, selection coefficient and density of escape mutations, and deleterious effect
330 of synonymous mutation.
331
332 The main result of the simulations is that genetic draft can indeed bring weakly
333 deleterious mutations to high frequencies and result in a dependence of the
334 fixation probability on initial frequency that is compatible with observations.
335 First of all, since neutral mutations are much more likely to rise to high
336 frequency than deleterious ones, the majority of the synonymous mutations needs
337 to be slightly deleterious observe a significant reduction of $P_\text{fix}$.
338 In order to further quantify the reduction in fixation probability, we look at
339 the difference between the neutral curve ($P_\text{fix}(\nu) = \nu$) and the
340 measured fixation probability and calculate its area (see inset of
341 \FIG{simfixpvar}). The minimal and maximal values for this area are zero
342 (neutral-like curve) and 0.5 (no fixation at all), respectively. The HIV data
343 correspond to an area under the diagonal of approximately 0.2 for synonymous
344 changes, and a very small area over the diagonal for nonsynonymous changes.
345 Various simulation curves are shown in \FIG{simfixpvar}. Then, in
346 \FIGS{simheat1}{simheat2}, we explore the parameter space: the combinations that
347 yield areas close to the experimental result are roughly indicated by ellipses.
348 The two crucial parameters that control the fixation probability are the
349 following: (a) the deleterious effects of hitchhikers compared to the beneficial
350 effects of escape mutants, and (b) the density of escape mutations. Intuitively,
351 a higher density of escape mutations (i.e., epitopes) enables a larger degree of
352 genetic draft, because escape mutations from different epitopes start to combine
353 and their effects add up. In \FIG{simheat1}, we show that this is indeed the
354 case in simulations.
355
356 \begin{figure}
357 \begin{center}
358 \subfloat{\includegraphics[width=0.9\linewidth]{fixation_loss_shortgenome_distance_ada_frac_del_eff_coi_various.pdf}
359 \label{fig:simfixpvar}}\\
360 \subfloat{\includegraphics[width=0.9\linewidth]{fixation_loss_shortgenome_area_ada_frac_del_eff_coi_0_01_nescepi_6_heat.pdf}
361 \label{fig:simheat1}}\\
362 \subfloat{\includegraphics[width=0.9\linewidth]{fixation_loss_shortgenome_area_ada_frac_del_eff_coi_0_01_nescepi_6_nonsyn_heat.pdf}
363 \label{fig:simheat2}}
364 \caption{\comment{USE A GOOD FIG WITH THE CORRECT $s_a$!} The depression in $P_\text{fix}$ depends on the deleterious effect size
365 of the synonymous alleles (panel A). Simulations on the escape competition
366 scenario show that the density of selective sweeps and the size of the
367 deleterious effects of synonymous mutations are the main driving forces of the
368 phenomenon. A convex fixation probability is recovered, as seen in the data,
369 along the diagonal (panel B): more dense sweeps can support more deleterious
370 linked mutations. The density of sweeps is limited, however, by the
371 nonsynonymous fixation probability, which is quite close to neutrality (panel
372 C). Moreover, strong competition between escape mutants is required, so that
373 several escape mutants are ``found'' by HIV within a few months of antibody
374 production.}
375 \label{fig:simheat}
376 \end{center}
377 \end{figure}
378
379 However, if hitchhiking is driven by nonsynonymous mutations that are
380 unconditionally beneficial, we should find that nonsynonymous mutations almost
381 always fix once they reach high frequencies -- in contrast with \FIG{fixp} that
382 shows that nonsynonymous mutations fix as if they were neutral. We know,
383 however, that nonsynonymous variation in the variable regions is driven by
384 positive selection. Inspecting the trajectories of nonsynonymous mutations
385 suggest the rapid rise and fall of many alleles. We test two possible
386 mechanisms that are biologically plausible and could explain the transient rise
387 of nonsynonymous mutations: time-dependent selection and within-epitope
388 competition. If the immune system recognizes the escape mutant before its
389 fixation, the mutant might cease to be beneficial and disappear despite its
390 quick initial rise in frequency. In support of this idea,
391 \citet{richman_rapid_2003, bunnik_autologous_2008} report antibody responses to
392 escape mutants. These respones are delayed by a few months, roughly matching the
393 average sweep time of an escape mutant. Alternatively, several different escape
394 mutations in the same epitope can arise almost simultaneously and start to
395 spread. Their fitness benefits are not additive, because each of them is
396 essentially sufficient to escape. As a consequence, several escape mutations rise to
397 high frequency, while the escape with the smallest cost in terms of replication,
398 packaging, etc. is most likely to
399 eventually fix. In simulations, this kind of epistatic interactions within
400 epitopes reduces the fixation probability. The emergence of
401 multiple sweeping nonsynonymous mutations in real HIV infections has been shown
402 previously~\citep{moore_limited_2009, bar_early_2012}.
403 See the supplementary material for examples of successful simulations in both scenarios.
404
405 \section{Discussion}
406 Despite several known functional roles for RNA secondary structure in the HIV
407 genome, synonymous mutations are often used as approximately neutral markers in
408 evolutionary studies of viruses. We have shown that the majority of synonymous
409 mutations in the conserved regions C2-C5 of the \env~gene are deleterious.
410 Comparison with recent biochemical studies of binding propensity of bases in RNA
411 genome suggest that these mutations are deleterious, at least in part, because they disrupt
412 stems in RNA secondary structures. Furthermore, we provide evidence that these
413 mutations are brought to high frequency through linkage to adaptive mutations.
414 The latter mutations are only transiently adaptive, either through a
415 coevolution with the immune system or redundant escape within an epitope.
416
417 Our observations and conclusion rely heavily on longitudinal data in which the
418 dynamics of mutations can be explicitly observed. The fact that deleterious
419 mutations can be brought to high frequencies through hitchhiking underscores
420 the intensity of the coevolution with the immune system. The fact that
421 multiple escape mutations in the same epitope -- as is indeed observed in
422 studies of antibody escape~\citep{moore_limited_2009, bar_early_2012} -- are
423 necessary to explain the patterns of fixation of nonsynonymous mutations points
424 towards a large populations size that rapidly discovers adaptive mutations. A
425 similar point has been made recently by Boltz {\it et al.} in the context of
426 preexisting drug resistance mutations~\citep{boltz_ultrasensitive_2012}.
427
428 The observed hitchhiking highlights the importance of linkage due to infrequent
429 recombination for the evolution of HIV
430 \citep{neher_recombination_2010,batorsky_estimate_2011,
431 josefsson_majority_2011}. The recombination rate has been estimated to be on the
432 order of $\rho = 10^{-5}$ per base and day. It takes roughly $t_{sw} = s_a^{-1}
433 \log \nu_0$ generations for an adaptive mutation with growth rate $s_a$ to rise
434 from an initially low frequency $\nu_0\sim \mu$ to frequency one. This implies
435 that a region of length $l = (\rho t_{sw})^{-1} = s_a / \rho \log \nu_0$ remains
436 linked to the adaptive mutation. With $s_a=0.01$, $l\approx 100$ bases which is
437 consistent with strong linkage between the variable loops and the stems in
438 between. Furthermore, we do not expect hitchhiking to extend far beyond
439 the variable regions consistent with the lack of signal out side of C1-V5. In
440 case of much stronger selection -- such as observed during early CTL escape or
441 drug resistance evolution -- the linked region is of course much larger.
442
443 The functional significance of the insulating RNA structure stems between the
444 hyper variable loops has been proposed
445 previously~\citep{watts_architecture_2009, sanjuan_interplay_2011}.
446 \citet{sanjuan_interplay_2011} have shown that insulating stems are relevant for
447 viral fitness {\it in vivo}. Our analysis is limited by the availability of
448 longitudinal data which requires a focus on the the variable regions of \env.
449 Conserved RNA structures most likely exist in different parts
450 of the HIV genome (several are known). In absence of repeated adaptive substitutions in the vicinity
451 that cause hitchhiking, the deleterious synonymous mutations remain at low
452 frequencies and can only be observed by deep sequencing methods.
453
454 As far as population genetics models are concerned, our study uncovers the
455 subtle balance of evolutionary forces governing intrapatient HIV evolution. The
456 fixation and extinction times and probabilities represent a rich and simple
457 summary statistics to test sequencing data and computer simulation upon. A
458 similar method has been recently used in a longitudinal study of
459 influenza~\citep{strelkowa_clonal_2012}. The propagators suggested in that
460 paper, however, represent ratios between (certain kinds of) nonsynonymous
461 mutations and synonymous ones, hence they are inadequate to investigate
462 synonymous changes themselves. Those authors also conclude that several
463 beneficial mutations segregate simultaneously in influenza, a scenario
464 remarkably similar to our within-epitope competition picture. These results
465 jointly suggest that viral evolution proceeds by multiple concurrent sweeps
466 rather then by successive fixation~\citep{desai_beneficial_2007, neher_rate_2010}.
467
468 Finally, our results emphasize the inadequacy of independent site
469 models of HIV evolution, especially in the light of transient effects on
470 sweeping sites, such as time-dependent selection and within-epitope negative
471 epistasis. Although a final word about which mechanism is more
472 widespread is yet to be spoken, both intuition and biological evidence from the
473 literature support a mixed scenario~\citep{richman_rapid_2003,
474 moore_limited_2009, bar_early_2012}. Note also that, unlike influenza, HIV does
475 recombine if rarely, hence clonal interference as studied in
476 ref.~\citep{strelkowa_clonal_2012} is only a short-term effect.
477
478 \section{Methods}
479 \subsection{Sequence data collection}
480 Longitudinal intrapatient viral RNA sequences were collected for published
481 studies~\citep{shankarappa_consistent_1999,
482 liu_selection_2006, bunnik_autologous_2008} and downloaded from the Los Alamos
483 National Laboratory (LANL) HIV sequence database~\citep{LANL2012}. The sequences from
484 some patients showed signs of HIV compartimentalization into subpopulations and
485 were discarded; a grand total of 11
486 patients with approximately 6 time points each and 10 sequences per time point
487 were analyzed. The time interval or resolution between two ocnsecutive sequences
488 was approximately 6 to 18 months.
489
490 \subsection{Sequence analisys}
491 The good sequences were aligned within each patient
492 via the translated amino acid sequence, using
493 Muscle~\citep{edgar_muscle:_2004}, and to the NL4-3 reference sequence probed
494 by \citet{watts_architecture_2009}. Within each patient, a consensus RNA
495 sequence at the first time point was used to classify alleles as ancestral or
496 derived at all sites. Problematic sites that included large frequencies of gaps
497 were excluded from the analysis, because variable regions are known to be
498 subject to frequent indels, while our analysis is limited to nucleotide
499 substitutions. Time series of allele frequencies were extracted from the
500 sequences.
501
502 The synonymity of a mutation was assigned if the rest of the codon was
503 in the ancestral state and using the standard genetic code. Cases where more
504 than one mutation within the codon was observed were discarded. Slightly
505 different criteria for synonymous/nonsynonymous discrimination yielded similar
506 results.
507
508 \subsection{Fixation probability and secondary structure}
509 For the estimate of times to fixation/extinction, polymorphisms were
510 binned by frequency and the time to reaching the first boundary (fixation or
511 extinction) was stored. For the fixation probability, the long-time limit of the
512 resulting curves was used, excluding polymorphisms that arose late in the
513 clinical history (and would have had no time to reach either boundary).
514
515 For the correlation analysis with RNA secondary structure, the SHAPE scores were
516 downloaded from the journal website~\citep{watts_architecture_2009}. By virtue
517 of the alignment of the longitudinal sequences with the reference used by
518 \citet{watts_architecture_2009}, SHAPE reactivities were assigned to most sites.
519 Problematic assignments in indel-rich regions were excluded from the analysis.
520 In order to restrict the analysis to synonymous polymorphisms, a lower frequency
521 threshold of 0.15 was used (other thresholds yielded the same results). Since
522 very few polymorphisms hitchhike beyond, say, a frequency of 0.5, this pool is
523 enriched for to-be-lost mutations; hence the "lost" curve in \FIG{SHAPEA}
524 contains much more points than the "fixed" one.
525
526 The V loops and flanking regions were identified manually starting from the
527 annotated reference HXB2 sequence from the LANL HIV database~\citep{LANL2012}. A
528 similar approach was used to label the C2-V5 region sequenced in
529 ref.~\citep{shankarappa_consistent_1999}.
530
531 \subsection{Computer simulations}
532 Simulations were performed using the recently published software
533 FFPopSim~\citep{zanini_ffpopsim:_2012}. Both full-length HIV genomes and
534 \env{}-only simulations were performed and yielded comparable results. For each
535 set of parameters, approximately 100 simulation runs were averaged over. In each
536 run, a random fitness landscape with specified statistical properties (e.g.
537 density of beneficial sites, average deleterious effect of synonymous changes) was generated.
538 Although the curves shown in \FIG{simfixpvar} are not very smooth, small
539 parameter changes resulted in overall consistent trends across many repetitions.
540
541 For the discussion of simulation parameters, the areas below or above the neutral
542 diagonal were estimated from the binned fixation probabilities using the linear
543 interpolation between the bin centers. This measure is sufficiently precise for
544 our purposes, because the HIV data are quite scarse themselves.
545
546 \section*{Acknowledgements}
547 \comment{to be written\dots}
548
549
550 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
551 \bibliographystyle{natbib}
552 \bibliography{bib}
553 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
554 \end{document}
555 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
556