Methods
[synmut.git] / synmut.tex
1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2 % ARTICLE ABOUT FATE OF SYNONYMOUS MUTATIONS IN HIV
3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4 \documentclass[rmp, twocolumn]{revtex4}
5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7 \newcommand{\Author}{Fabio~Zanini and Richard~A.~Neher}
8 \newcommand{\Title}{Deleterious synonymous mutations hitchhike to high frequency in HIV \env~evolution}
9 \newcommand{\Keywords}{{HIV}, {synonymous}, {population genetics}}
10 \usepackage[english]{babel}
11 \usepackage[utf8x]{inputenc}
12 \usepackage{amsmath,amsfonts,amssymb,eucal,eurosym,textcomp}
13 \usepackage{color}
14 \usepackage{graphicx}
15 \usepackage[caption=false]{subfig}
16 \usepackage{natbib}
17 \usepackage{pslatex}
18 \usepackage[colorlinks,linkcolor=red,citecolor=red]{hyperref}
19 \hypersetup{pdfauthor={\Author}, pdftitle={\Title}, pdfkeywords={\Keywords}}
20 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
21 \graphicspath{{./figures/}}
22 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23 %\DeclareMathOperator\de{d\!}
24 \newcommand{\comment}[1]{\textit{\textcolor{red}{#1}}}
25 \newcommand{\mut}{\mu}
26 \newcommand{\mfit}{\langle F\rangle}
27 \newcommand{\mexpfit}{\langle e^{F}\rangle}
28 \newcommand{\ox}{r}
29 \newcommand{\co}{\rho}
30 \newcommand{\gt}{g}
31 \newcommand{\locus}{s}
32 \newcommand{\locuspm}{t}
33 \newcommand{\OO}{\mathcal{O}}
34 \newcommand{\env}{\textit{env}}
35 \newcommand{\rev}{\textit{rev}}
36 \newcommand{\FIG}[1]{Fig.~\ref{fig:#1}}
37 \newcommand{\FIGS}[2]{Figs.~\ref{fig:#1} and~\ref{fig:#2}}
38
39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
40 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
41 \begin{document}
42 \title{\Title}
43 \author{\Author}
44 \date{\today}
45 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
46
47 \begin{abstract}
48 \noindent
49 Intrapatient HIV evolution is goverened by selection on the protein level in the
50 arms race with the immune system (killer T-cells and antibodies). Synonymous
51 mutations do not have an immunity-related phenotype and are often assumed to be
52 neutral. In this paper, we show that synonymous changes in epitope-rich regions
53 are often deleterious but still reach frequencies of order one. We analyze time
54 series of viral sequences from the V1-C5 part of {\it env} within individual
55 hosts and observe that synonymous derived alleles rarely fix in the
56 viral population. Simulations suggest that such synonymous mutations
57 have a (Malthisuan) selection coefficient of the order of $-0.001$, and that
58 they are brought up to high frequency by linkage to neighbouring beneficial
59 nonsynonymous alleles (genetic draft). As far as the biological causes are
60 concerned, we detect a negative correlation between fixation of an allele and
61 its involvement in evolutionarily conserved RNA stem-loop structures.
62 This phenonenon is not observed in other parts of the HIV genome, in which
63 selective sweeps are less dense and the genetic architecture less constrained.
64 \end{abstract}
65
66
67 \maketitle
68
69 \section{Introduction}
70 HIV evolves rapidly within a single host during the course of the infection.
71 This evolution is driven by strong selection imposed by the host immune system
72 via killer T cells (CTLs) and neutralizing antibodies
73 (AB)~\citep{pantaleo_immunopathogenesis_1996} and facilitated by the high
74 mutation rate of HIV~\citep{mansky_lower_1995}. When the host develops a CTL or
75 AB response against a particular HIV epitope, mutations in the viral genome that reduce or prevent
76 recognition of the epitope frequently emerge. Escape mutations in epitopes
77 targeted by CTLs typically evolve during early infection and spread rapidly through
78 the population~\citep{mcmichael_immune_2009}. During chronic infection, the most
79 rapidly evolving part of the HIV genome are the so called variable loops of the
80 envelope protein gp120, which need to avoid recognition by neutralizing ABs.
81 Mutations in \env~, the gene encoding for gp120, spread through the population
82 within a few months (see \figurename~\ref{fig:aft}, solid lines).
83 The (Malthusian) effect size of these beneficial mutations is of the
84 order of $s_a \sim 0.01$~\citep{neher_recombination_2010}.
85
86 These escape mutations are strongly selected for their effect on the amino acid
87 sequence of the viral proteins. Conversely, synonymous mutations are commonly
88 used as approximate neutral markers in studies of viral evolution. Neutral
89 markers are very useful in practice, because they can be used to make inferences
90 about the stochastic forces driving evolution~\citep{yang_statistical_2000}.
91 The viral genome, however, needs to satisfy further constraints in addition to
92 immune escape, such as efficient processing and translation, nuclear export, and
93 packaging into the viral capsid: all these processes operate at the RNA level
94 and are sensitive to synonymous changes. A
95 few functionally important RNA elements are well characterized. For example, a certain RNA
96 sequence, called \rev{} response element (RRE), is used by HIV to enhance
97 nuclear export of some of its transcripts~\citep{fernandes_hiv-1_2012}. Another
98 well studied case is the interaction between viral reverse transcriptase, viral
99 ssRNA, and the host tRNA$^\text{Lys3}$: the latter is required for priming
100 reverse transcription (RT) and bound by a specifical pseudoknotted RNA structure
101 in the viral 5' untranslated region~\citep{barat_interaction_1991,
102 paillart_vitro_2002}. Furthermore, recent studies have shown that genetically
103 engineered HIV strains with skewed codon usage bias (CUB) patterns towards more
104 or less abundant tRNAs replicate better or worse,
105 respectively~\citep{ngumbela_quantitative_2008, li_codon-usage-based_2012}.
106 Purifying selection beyond the protein sequence is therefore expected, while it
107 seems reasonable that the bulk of positive selection through the immune system
108 should be restricted to amino acid sequences.
109
110 INFLUENZA PSEUDO VACCINE.
111
112 %SYNONYMOUS CONSERVATION. DO WE HAVE A PLOT OF GENOME WIDE CONSERVATION, MAYBE
113 %FOR SUPPLEMENT? YES
114
115 Here, we characterize the dynamics of synonymous mutations in \env{} and show
116 that a substantial fraction of these mutations are deleterious. We further show
117 that, although such synonymous mutations cannot be used as neutral markers, the
118 degree to which they hitchhike with nearby nonsynonymous mutations is very
119 informative. Their ability to hitchhike for extended times, which is a core
120 requirement for our analysis, is rooted in the small recombination rate of
121 HIV~\citep{neher_recombination_2010, batorsky_estimate_2011}. Extending the
122 analysis of fixation probabilities to the nonsynonymous mutations, we show that
123 time dependent selection or strong competition of escape mutations inside the
124 same epitope are necessary to explain the observed patterns of fixation and
125 loss.
126
127 \section{Results}
128 The central quantity we investigate is the probability of fixation
129 of a mutation, conditional on its population frequency.
130 A neutral mutation segregating at frequency $\nu$ has a probability $\nu$ to
131 spread through the population and fix; in the rest of the cases, i.e. with probability
132 $1-\nu$, it goes extinct. This is a simple consequence of the fact that (a) exactly
133 one of the $N$ individuals in the current population will be the common ancestor
134 of the entire future population at a
135 particular locus and (b) this ancestor has a probability $\nu$ of carrying the
136 mutation, see illustration in \FIG{fixp}. Deleterious or beneficial
137 mutations fix less or more often than neutral ones, respectively. Time series
138 sequence data enable a direct observation of both the current frequency $\nu$ of
139 any particular mutation and its future fate (fixation or extinction).
140 They therefore represent a simple way to investigate average properties
141 of different classes of mutations.
142
143 \subsection{Synonymous polymorphisms in \env, C2-V5 are mostly deleterious}
144
145 \FIG{aft} shows time series data of the frequencies of all mutations observed
146 \env~, C2-V5, in patient p8~\citep{shankarappa_consistent_1999,
147 liu_selection_2006}. Despite many synonymous mutations reaching high frequency,
148 very few fix. This observation is quantified in panels \ref{fig:fixp1} and
149 \ref{fig:fixp2}, which stratify the data of 7-10 patients according to the
150 frequency at which different mutations are observed (see methods). Considering
151 all mutations in a frequency interval around $\nu_0$ at some time $t_i$, we
152 calculate the fraction that is found at frequency 1, at frequency 0, or at
153 intermediate frequency at later times $t_f$. Plotting these fixed, lost, and
154 polymorphic fraction against the time interval $t_f-t_i$, we see that most
155 synonymous mutations segregate for roughly one year and are lost much more
156 frequently than expected. The long-time probability of fixation versus
157 extinction is shown as a function of the initial frequency $\nu_0$ in
158 panel~\ref{fig:fixp2}. In contrast to synonymous mutations, the nonsynonymous
159 seem to follow more a less the neutral expectation -- a point to which we will
160 come back below.
161 \begin{figure}
162 \begin{center}
163 \includegraphics[width=\linewidth]{Shankarappa_allele_freqs_trajectories_syn_nonsynp8}
164 \caption{Synonymous mutations rarely fix in \env, C2-V5: mutation frequency
165 trajectories observed in patient 8~\cite{shankarappa_consistent_1999};
166 Nonsynonymous and synonymous mutations are shown as solid and dashed lines,
167 respectively. Colors indicate the position of the site along the C2-V5 region
168 (red to blue) MAYBE MAKE FIGURE WITH SYNONYMOUS AND NONSYN
169 SEPARATELY. While nonsynonymous mutations frequently fix, very few synonymous
170 mutations do even though they are frequently observed at intermediate
171 frequencies.}
172 \label{fig:aft}
173 \end{center}
174 \end{figure}
175
176 \citet{bunnik_autologous_2008} present a longitudinal dataset on the entire
177 \env~gene of 3 patients at $\sim 5$ time points with approximately 5-20
178 sequences each (see methods). Repeating the above analysis separately on the C2-V5 region
179 studied above and the remainder of \env~ reveal strikingly different behavior
180 inside and outside the hypervariable region. Within C2-V5, this data fully
181 confirms the observations made in the data set by
182 \citet{shankarappa_consistent_1999}. In the remainder of \env, however, observed
183 synonymous mutations behave as if they were neutral; see \FIG{fixp}.
184
185 %ARE OBSERVED SYNONYMOUS MUTATIONS OUTSIDE C2-V5 NEUTRAL? (?? SOME!)
186 %DOES LOSS/FIX CORRELATE WITH CONSERVATION? YES.
187 %MAYBE WE COULD HAVE ONE -- COMPLETELY CIRCULAR -- FIGURE SHOWING LOSS/FIX VS CONSERVATION: YES.
188 %CAN WE LOOK AT THE AVERAGE LEVEL OF CONSERVATION STRATIFIED BY MAX FREQ? TRICKY: the maximal freq is achieved by hitchhiking...
189
190 These observations suggest that many of the synonymous polymorphisms in the part
191 of \env~that includes the hypervariable regions are deleterious, while outside
192 this regions polymorphisms are mostly roughly neutral.
193
194 \begin{figure}
195 \begin{center}
196 \subfloat{\includegraphics[width=0.9\linewidth]{Shankarappa_fix_loss_dt_times}
197 \label{fig:fixp1}}\\
198 \subfloat{\includegraphics[width=0.9\linewidth]{Bunnik2008_fixmid_syn_ShankanonShanka}
199 \label{fig:fixp2}}
200 \caption{Left panel: time course of loss and fixation of synonymous mutations
201 observed in a frequency interval $\nu_0$. The ultimate fraction of synonymous
202 mutations that fix as a function of intermediate frequency $\nu_0$ is the
203 fixation probability. Right panel: fixation probability of derived synonymous
204 alleles is strongly suppressed in C2-V5 versus other parts of the {\it env}
205 gene, and of nonsynonymous ones. Data from
206 Refs.~\cite{shankarappa_consistent_1999, bunnik_autologous_2008}.}
207 \label{fig:fixp}
208 \end{center}
209 \end{figure}
210
211
212 \subsection{Synonymous mutations in C2-V5 tend to disrupt conserved RNA stems}
213 One possible {\it a priori} explanation for lack of fixation of synonymous
214 mutations in C2-V5 are secondary structures in the viral RNA. If any RNA
215 secondary structures are relevant for HIV replication, mutations in nucleotides
216 involved in those base pairs are expected to be deleterious and to revert
217 preferentially. Many functionally important secondary structure elements have
218 been characterized, including the RRE~\citep{fernandes_hiv-1_2012} the
219 5' UTR pseudoknot interacting with thehost tRNA$^\text{Lys3}$~\citep{barat_interaction_1991,
220 paillart_vitro_2002}. It has been suggested early on that parts of the viral
221 genome that has the potential to form stems is better conserved that the
222 remainder~\citep{forsdyke_reciprocal_1995}.
223
224 Recently, the propensity of nucleotides of the HIV genome to form base pairs has
225 been measured using the SHAPE assay (a biochemical reaction preferentially
226 altering unpaired bases)~\citep{watts_architecture_2009}. The SHAPE assay has
227 shown that the variable regions V1 to V5 tend to be unpaired, while the
228 conserved regions between those variable regions form stems. We partition all
229 synonymous alleles observed at intermediate frequencies above 10-15\% depending
230 on their final destiny (fixation or extinction). Subsequently, we align our
231 sequences to the reference NL4-3 strain used in
232 ref.~\citep{watts_architecture_2009} and assign them SHAPE reactivities. As
233 shown in \FIG{SHAPEA} in a cumulative histogram, the reactivity of fixed alleles
234 are systematically larger than of alleles that are doomed to extinction
235 (Kolmogorov-Smirnov test, $P\approx 2~\text{\textperthousand}$). In other
236 words, alleles that are likely to be breaking RNA helices are also more likely
237 to revert and finally be lost from the population. As a complementary analysis,
238 we split the synonymous mutations in the C2-V5 region further into conserved and
239 variable regions and found that the biggest depression in fixation probability
240 is observed in the conserved stems, while the variable loops show little
241 deviation from the neutral signature, see \FIG{SHAPEB}.
242
243 In addition to RNA secondary structure, we have considered other possible
244 explanations for a fitness effect of synonymous mutations, in particular codon
245 usage bias (CUB). HIV is known to prefer A-rich codons over highly expressed
246 human housekeeping genes~\citep{jenkins_extent_2003}. Moreover, codon-optimized
247 and -pessimized viruses have recently been generated and shown to replicate
248 better or worse than wild type strains,
249 respectively~\citep{li_codon-usage-based_2012, ngumbela_quantitative_2008,
250 coleman_virus_2008}. We do not find, however, evidence for any contribution of
251 CUB to the ultimate fate of synonymous alleles. Several lines of thought support
252 this result. First of all, although codon-optimized HIV seems to perform better
253 {\it in vitro}, the distance in CUB between HIV and human genes is not shrinking
254 at the macroevolutionary level. Second, within a single patient, we do not
255 observe any bias towards more human-like CUB in the synonymous mutations that
256 reach fixation rather than extinction. Third, it is a common phenomenon for
257 retroviruses to use variously different codons from their hosts, and CUB effects
258 on fitness are thought to be so small that divergent nucleotide composition has
259 been suggested as a possible mechanism for viral
260 speciation~\citep{bronson_nucleotide_1994}. Fourth, CUB in the V1-C5 region is
261 not very different from other parts of the HIV genome, whereas the reduced
262 fixation probability is only observed there. In conclusion, although we cannot
263 exclude an effect of CUB on fitness as a general rule, we expect it to be a
264 minor effect in our context.
265 \begin{figure}
266 \begin{center}
267 \subfloat{\includegraphics[width=0.9\linewidth]{mixed_Shankarappa_Bunnik2008_Liu_fixation_reactivity_Vandflanking_fromSHAPE}
268 \label{fig:SHAPEA}}\\
269 \subfloat{\includegraphics[width=0.9\linewidth]{Shankarappa_fixmid_syn_V_regions.pdf}\label{fig:SHAPEB}}
270 \caption{Watts et al. have measured the reactivity of HIV nucleotides to {\it
271 in vitro} chemical attack and shown that some nucleotides are more likely to
272 be involved in RNA secondary folds. C1-C5 regions, in particular, show
273 conserved stem-loop structures~\citep{watts_architecture_2009}. We show that
274 among all derived alleles in those regions reaching frequencies of order one,
275 there is a negative correlation between fixation and involvement in a base
276 pairing in a RNA stem (left panel). The rest of the genome does not show any
277 correlation (right panel). There might be too few silent polymorphisms in the
278 first place, or the signal might be masked by non-functional RNA
279 structures. Data from Refs.~\cite{shankarappa_consistent_1999,
280 bunnik_autologous_2008, liu_selection_2006}.}
281 \label{fig:SHAPE}
282 \end{center}
283 \end{figure}
284
285
286 \subsection{Deleterious mutations are brought to high frequency by hitch-hiking}
287 While the observation that some fraction of synonymous mutations is deleterious
288 is not unexpected, it seems odd that we observe them at high population
289 frequency -- at least in some regions of the genome. The region of \env~ in which
290 we observe deleterious mutations at high frequency is special in that it
291 undergoes frequent adaptive changes to evade recognition by neutralizing
292 antibodies~\cite{williamson_adaptation_2003}. Due to the limited amount of
293 recombination in HIV~\cite{neher_recombination_2010,batorsky_estimate_2011},
294 deleterious mutations that are linked to adaptive variants can reach high
295 frequency~\citep{smith_hitch-hiking_1974}.
296
297 The potential for hitchhiking is already apparent from the allele frequency
298 trajectories in \FIG{aft}, where many mutations appear to change rapidly in
299 frequency as a flock. Deleterious synonymous mutations can be amplified
300 exponentially by selection on linked nonsynonymous sites, a process known as
301 {\it genetic draft}~\citep{gillespie_genetic_2000, neher_genetic_2011}. In order
302 to be advected to high frequency by a linked adaptive mutation, the deleterious
303 effect of the mutation has to be substantially smaller than the adaptive effect.
304 The latter was estimated to be on the order of $s_a = 0.01$ per day~\citep{neher_recombination_2010}.
305 The approximate magnitude of the deleterious effects can be estimated from
306 \FIG{fixp1}, that shows the distribution of times for synonymous
307 alleles to reach the fix or get lost starting from intermediate frequencies. The
308 typical time to loss is of the order of 500 days. If this loss is driven by the
309 deleterious effect of the mutation, this corresponds to deleterious effects of
310 roughly $s_d \sim - 0.002$ per day.
311
312 To get a better idea of the range of parameters that are compatible with the
313 observations and our interpretation, we perform computer simulations of
314 evolving viral populations under selection and rare recombination. For this
315 purpose, we use the recently published package FFPopSim, which includes a module
316 dedicated to intrapatient HIV evolution~\citep{zanini_ffpopsim:_2012}. We
317 analyze many combinations of parameters such as population size, recombination
318 rate, selection coefficient and density of escape mutations, deleterious effect
319 of synonymous mutation.
320
321 The main result of the simulations is that genetic draft can indeed bring weakly
322 deleterious mutations to high frequencies and result in a dependence of the
323 fixation probability on initial frequency that is compatible with observations.
324 First of all, since neutral mutations
325 are much more likely to rise to high frequency than deleterious ones, the
326 majority of the synonymous mutations needs to be slightly deleterious observe a
327 significant reduction of $P_\text{fix}$.
328 In order to further quantify the reduction in fixation probability, we look at the
329 difference between the neutral curve ($P(\nu) = \nu$) and the measured fixation
330 probability and calculate its area (see inset of \FIG{simfixpvar}). The minimal and maximal values for this
331 area are zero (neutral-like curve) and 0.5 (no fixation at all),
332 respectively. The convex curve seen in the HIV data corresponds to an area of
333 approximately 0.17. Various simulation curves are shown in \FIG{simfixpvar}, and
334 the area of the data curve is shown in the inset.
335 In \FIGS{simheat1}{simheat2}, we explore the parameter space: the combinations that yield areas close to the
336 experimental result are roughly indicated by ellipses. The two crucial parameters
337 that control the fixation probability are the following: (a) the deleterious
338 effects of hitchhikers compared to the beneficial effects of escape mutants, and
339 (b) the density of escape mutations. Intuitively, a higher density of escape
340 mutations (i.e., epitopes) enables a larger degree of genetic draft, because
341 escape mutations start to combine and their effects add up. In \FIG{simheat1},
342 we show that this is indeed the case in simulations.
343
344 \begin{figure}
345 \begin{center}
346 \subfloat{\includegraphics[width=0.9\linewidth]{fixation_loss_shortgenome_distance_ada_frac_del_eff_coi_various.pdf}
347 \label{fig:simfixpvar}}\\
348 \subfloat{\includegraphics[width=0.9\linewidth]{fixation_loss_shortgenome_area_ada_frac_del_eff_coi_0_01_nescepi_6_heat.pdf}
349 \label{fig:simheat1}}\\
350 \subfloat{\includegraphics[width=0.9\linewidth]{fixation_loss_shortgenome_area_ada_frac_del_eff_coi_0_01_nescepi_6_nonsyn_heat.pdf}
351 \label{fig:simheat2}}
352 \caption{The depression in $P_\text{fix}$ depends on the deleterious effect size
353 of the synonymous alleles (panel A). Simulations on the escape competition
354 scenario show that the density of selective sweeps and the size of the
355 deleterious effects of synonymous mutations are the main driving forces of the
356 phenomenon. A convex fixation probability is recovered, as seen in the data,
357 along the diagonal (panel B): more dense sweeps can support more deleterious
358 linked mutations. The density of sweeps is limited, however, by the
359 nonsynonymous fixation probability, which is quite close to neutrality (panel
360 C). Moreover, strong competition between escape mutants is required, so that
361 several escape mutants are ``found'' by HIV within a few months of antibody
362 production.}
363 \label{fig:simheat}
364 \end{center}
365 \end{figure}
366
367
368 However, if hitch-hiking is driven by nonsynonymous mutations that are
369 unconditionally beneficial, we should find that nonsynonymous mutations almost
370 always fix once they reach high frequencies -- in contrast with \FIG{fixp} that
371 shows that nonsynonymous mutations fix as if they were neutral. We know,
372 however, that nonsynonymous variation in the variable regions is driven by
373 positive selection. Inspecting the trajectories of nonsynonymous mutations
374 suggest the rapid rise and fall of many alleles. We test two possible such
375 mechanisms that are biologically plausible and could explain the transient rise
376 of nonsynonymous mutations: time-dependent selection and within-epitope
377 competition. If the immune system recognizes the escape mutant before its
378 fixation, the mutant might cease to be beneficial and disappear despite its
379 quick initial rise in frequency. In support of this idea,
380 \citet{richman_rapid_2003, bunnik_autologous_2008} report antibody responses to
381 escape mutants. These respones are delayed by a few months, roughly matching the
382 average sweep time of an escape mutant. Alternatively, several different escape
383 mutations in the same epitope can arise almost simultaneously and start to
384 spread. Their fitness benefits are not additive, because each of them is
385 essentially sufficient to escape. As a consequence, several escape mutations rise to
386 high frequency, while the escape with the smallest cost in terms of replication,
387 packaging, etc. is most likely to
388 eventually fix. In simulations, this kind of epistatic interactions within
389 epitopes reduces the fixation probability. The emergence of
390 multiple sweeping nonsynonymous mutations in real HIV infections has been shown
391 previously~\citep{moore_limited_2009, bar_early_2012}.
392 See the supplementary material for examples of successful simulations in both scenarios.
393
394 \section{Discussion}
395 Despite several known functional roles for RNA secondary structure in the HIV
396 genome, synonymous mutations are often used as approximately neutral markers in
397 evolutionary studies of viruses. We have shown that the majority of synonymous
398 mutations in the conserved regions C2-C5 of the \env~gene are deleterious.
399 Comparison with recent biochemical studies of binding propensity of bases in RNA
400 genome suggest that these mutations are deleterious, at least in part, because they disrupt
401 stems in RNA secondary structures. Furthermore, we provide evidence that these
402 mutations are brought to high frequency through linkage to adaptive mutations.
403 The latter mutations are only transiently adaptive, either through a
404 coevolution with the immune system or redundant escape within an epitope.
405
406 Our observations and conclusion rely heavily on longitudinal data in which the
407 dynamics of mutations can be explicitly observed. The fact that deleterious
408 mutations can be brought to high frequencies through hitchhiking underscores
409 the intensity of the coevolution with the immune system. The fact that
410 multiple escape mutations in the same epitope -- as is indeed observed in
411 studies of antibody escape~\citep{moore_limited_2009, bar_early_2012} -- are
412 necessary to explain the patterns of fixation of nonsynonymous mutations points
413 towards a large populations size that rapidly discovers adaptive mutations. A
414 similar point has been made recently by Boltz {\it et al.} in the context of
415 preexisting drug resistance mutations~\citep{boltz_ultrasensitive_2012}.
416
417 The observed hitchhiking highlights the importance of linkage due to infrequent
418 recombination for the evolution of HIV
419 \citep{neher_recombination_2010,batorsky_estimate_2011,
420 josefsson_majority_2011}. The recombination rate has been estimated to be on the
421 order of $\rho = 10^{-5}$ per base and day. It takes roughly $t_{sw} = s_a^{-1}
422 \log \nu_0$ generations for an adaptive mutation with growth rate $s_a$ to rise
423 from an initially low frequency $\nu_0\sim \mu$ to frequency one. This implies
424 that a region of length $l = (\rho t_{sw})^{-1} = s_a / \rho \log \nu_0$ remains
425 linked to the adaptive mutation. With $s_a=0.01$, $l\approx 100$ bases which is
426 consistent with strong linkage between the variable loops and the stems in
427 between. Furthermore, we do not expect hitchhiking to extend far beyond
428 the variable regions consistent with the lack of signal out side of C5-V5. In
429 case of much stronger selection -- such as observed during early CTL escape or
430 drug resistance evolution -- the linked region is of course much larger.
431
432 The functional significance of the insulating RNA structure stems between the
433 hyper variable loops has been proposed
434 previously~\citep{watts_architecture_2009, sanjuan_interplay_2011}.
435 \citet{sanjuan_interplay_2011} have shown that insulating stems are relevant for
436 viral fitness {\it in vivo}. Our analysis is limited by the availability of
437 longitudinal data which requires a focus on the the variable regions of \env~.
438 Conserved RNA structures most likely exist in different parts
439 of the HIV genome (several are known). In absence of repeated adaptive substitutions in the vicinity
440 that cause hitchhiking, the deleterious synonymous mutations remain at low
441 frequencies and can only be observed by deep sequencing methods.
442
443 As far as population genetics models are concerned, our study uncovers the
444 subtle balance of evolutionary forces governing intrapatient HIV evolution. The
445 fixation and extinction times and probabilities represent a rich and simple
446 summary statistics to test sequencing data and computer simulation upon. A
447 similar method has been recently used in a longitudinal study of
448 influenza~\citep{strelkowa_clonal_2012}. The propagators suggested in that
449 paper, however, represent ratios between (certain kinds of) nonsynonymous
450 mutations and synonymous ones, hence they are inadequate to investigate
451 synonymous changes themselves. Those authors also conclude that several
452 beneficial mutations segregate simultaneously in influenza, a scenario
453 remarkably similar to our within-epitope competition picture. These results
454 jointly suggest that viral evolution proceeds by multiple concurrent sweeps
455 rather then by successive fixation~\citep{desai_beneficial_2007, neher_rate_2010}.
456
457 Finally, our results emphasize the inadequacy of independent site
458 models of HIV evolution, especially in the light of transient effects on
459 sweeping sites, such as time-dependent selection and within-epitope negative
460 epistasis. Although a final word about which mechanism is more
461 widespread is yet to be spoken, both intuition and biological evidence from the
462 literature support a mixed scenario~\citep{richman_rapid_2003,
463 moore_limited_2009, bar_early_2012}. Note also that, unlike influenza, HIV does
464 recombine if rarely, hence clonal interference as studied in
465 ref.~\citep{strelkowa_clonal_2012} is only a short-term effect.
466
467 \section{Methods}
468 \subsection{Sequence data collection}
469 Longitudinal intrapatient viral RNA sequences were collected for published
470 studies~\citep{shankarappa_consistent_1999,
471 liu_selection_2006, bunnik_autologous_2008} and downloaded from the Los Alamos
472 National Laboratory (LANL) HIV sequence database~\citep{LANL2012}. The sequences from
473 some patients showed signs of HIV compartimentalization into subpopulations and
474 were discarded; a grand total of 11
475 patients with approximately 6 time points each and 10 sequences per time point
476 were analyzed. The time interval or resolution between two ocnsecutive sequences
477 was approximately 6 to 18 months.
478
479 \subsection{Sequence analisys}
480 The good sequences were aligned within each patient
481 via the translated amino acid sequence, using
482 Muscle~\citep{edgar_muscle:_2004}, and to the NL4-3 reference sequence probed
483 by \citet{watts_architecture_2009}. Within each patient, a consensus RNA
484 sequence at the first time point was used to classify alleles as ancestral or
485 derived at all sites. Problematic sites that included large frequencies of gaps
486 were excluded from the analysis, because variable regions are known to be
487 subject to frequent indels, while our analysis is limited to nucleotide
488 substitutions. Time series of allele frequencies were extracted from the
489 sequences.
490
491 The synonymity of a mutation was assigned if the rest of the codon was
492 in the ancestral state and using the standard genetic code. Cases where more
493 than one mutation within the codon was observed were discarded. Slightly
494 different criteria for synonymous/nonsynonymous discrimination yielded similar
495 results.
496
497 \subsection{Fixation probability and secondary structure}
498 For the estimate of times to fixation/extinction, polymorphisms were
499 binned by frequency and the time to reaching the first boundary (fixation or
500 extinction) was stored. For the fixation probability, the long-time limit of the
501 resulting curves was used, excluding polymorphisms that arose late in the
502 clinical history (and would have had no time to reach either boundary).
503
504 For the correlation analysis with RNA secondary structure, the SHAPE scores were
505 downloaded from the journal website~\citep{watts_architecture_2009}. By virtue
506 of the alignment of the longitudinal sequences with the reference used by
507 \citet{watts_architecture_2009}, SHAPE reactivities were assigned to most sites.
508 Problematic assignments in indel-rich regions were excluded from the analysis.
509 In order to restrict the analysis to synonymous polymorphisms, a lower frequency
510 threshold of 0.15 was used (other thresholds yielded the same results). Since
511 very few polymorphisms hitchhike beyond, say, a frequency of 0.5, this pool is
512 enriched for to-be-lost mutations; hence the "lost" curve in \FIG{SHAPEA}
513 contains much more points than the "fixed" one.
514
515 The V loops and flanking regions were identified manually starting from the
516 annotated reference HXB2 sequence from the LANL HIV database~\citep{LANL2012}. A
517 similar approach was used to label the C2-V5 region sequenced in
518 ref.~\citep{shankarappa_consistent_1999}.
519
520 \subsection{Computer simulations}
521 Simulations were performed using the recently published software
522 FFPopSim~\citep{zanini_ffpopsim:_2012}. Both full-length HIV genomes and
523 \env{}-only simulations were performed and yielded comparable results. For each
524 set of parameters, approximately 100 simulation runs were averaged over. In each
525 run, a random fitness landscape with specified statistical properties (e.g.
526 density of beneficial sites, average deleterious effect of synonymous changes) was generated.
527 Although the curves shown in \FIG{simfixpvar} are not very smooth, small
528 parameter changes resulted in overall consistent trends across many repetitions.
529
530 For the discussion of simulation parameters, the areas below or above the neutral
531 diagonal were estimated from the binned fixation probabilities using the linear
532 interpolation between the bin centers. This measure is sufficiently precise for
533 our purposes, because the HIV data are quite scarse themselves.
534
535 \section*{Acknowledgements}
536 \comment{to be written\dots}
537
538
539 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
540 \bibliographystyle{natbib}
541 \bibliography{bib}
542 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
543 \end{document}
544 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
545