We then raise it to the power p so that we can weight variants in a nonlinear fashion with respect to this fraction. We give guidance on the choice of p in the Supplementary Material online. S12 , Supplementary Material online.
We use the folded site frequency spectrum in calculating d i , as the frequency difference between the core variant and the second variant is independent of whether the derived or ancestral allele of the nearby allele is in linkage with the derived or ancestral core allele. In a region under long-term balancing selection, the average d i between a core SNP and the surrounding variants is expected to be elevated.
However, d i alone is not optimally powered to detect balancing selection, as its value will be sensitive to changes in the mutation rate in the surrounding region, and it does not take into account the probability of observing each allele frequency under neutrality. Our approach is inspired by previous summary statistics based on the site frequency spectrum Tajima ; Fay and Wu S2 , Supplementary Material online.
This behavior is expected because higher frequency alleles will tend to have a longer TMRCA and therefore higher diversity. The exception to this trend is neutral SNPs of frequency 0.
To address this possible shortcoming, we developed a version of the statistic based on a folded site frequency spectrum. This formulation is available in the Supplementary Material online. Although our statistic can be calculated on any window size, previous work has suggested that the effects of balancing selection localize to a narrow region surrounding the balanced site Gao etal.
Ultimately, the optimal window size depends on the recombination rate, as it breaks up allelic classes. In the Supplementary Material online, we present some mathematical formulations to suggest reasonable window sizes. We used forward simulations Haller and Messer to calculate the power of our approach to detect balancing selection relative to other commonly utilized statistics. Initially, we simulated a single, overdominant mutation for each simulation replicate in an equilibrium demographic model, varied over a range of balancing selection equilibrium frequencies and onset times see Materials and Methods.
We also simulated genomic regions in which all variants were selectively neutral. As a reference, we also measured the likelihood-based statistic, T2. In order to make a fair comparison between these methods, we first determined the most powerful window size for each method using simulations supplementary fig. S5 , Supplementary Material online. For the summary statistics, a 1-kb window size did well across a range of selection timings and equilibrium frequencies. This 1-kb region matches the approximate size of the ancestral region, in which there have been no expected recombination events between allelic classes see Supplementary Material online.
S6 , Supplementary Material online. Furthermore, this roughly matches the expected number of informative sites in a 1-kb region under selection see Supplementary Material online. Therefore, a window of 20 total informative sites is roughly equal to the expected ancestral region size, which is roughly equal to the window at which all these methods achieve optimal power.
For this reason, we used a 1-kb window or 20 informative sites, as applicable, when calculating each statistic. S4—S17 , Supplementary Material online. However, unlike T2, our method does not require an outgroup sequence, or grids of simulations which are computationally expensive. Power of methods to detect ancient balancing selection. Power was calculated based on simulation replicates containing only neutral variants True Negatives or containing a balanced variant that was introduced True Positives.
Columns correspond to simulations of balanced alleles at equilibrium frequencies 0. Rows correspond to older and more recent selection, beginning , and , generations prior to sampling, respectively. Under an expansion scenario, the performance of all methods decreased supplementary fig. S7 , Supplementary Material online , consistent with results from previous studies DeGiorgio etal. The effect of a population bottleneck on power was less drastic and led to a slight increase in power to detect more recent selection supplementary fig.
S8 , Supplementary Material online. Population substructure can confound scans for selection Schierup etal. To investigate the power of our method in these scenarios, we simulated two models of population substructure. First, we considered a model of two completely subdivided populations. We pooled together 50 individuals from each subpopulation with which to perform the statistical calculations.
In this case, the power of all methods to detect balancing selection at equilibrium frequency 0. S9 , Supplementary Material online. This matches expectation, as this situation is expected to drastically increase the number of variants at frequency 0. Next, we considered a two-pulse model of ancient admixture.
We selected this model because of its approximation of Neanderthal admixture into human Vernot and Akey , which may be thought to confound scans for selection in humans. Power with Neanderthal admixture stayed roughly the same as without supplementary fig. S10 , Supplementary Material online. This is as expected, as most haplotypes introduced through admixture are expected to be at very low frequency. We next examined the power for all methods under models of variable mutation rates, recombination rates, and sample sizes.
As expected, the power of all methods was positively correlated with mutation rate supplementary figs. S13 and S15 , Supplementary Material online , and negatively correlated with recombination rate supplementary figs. S14 and S16 , Supplementary Material online. A higher mutation rate provides more variants that can accumulate within an allelic class, whereas a lower recombination rate allows for longer haplotypes upon which mutations can accumulate.
S19 and S20 , Supplementary Material online. In practice, the sample size used to calculate the frequency of each variant may differ between variants. We found that this decreases power very slightly, and that lower values of p perform better in this scenario supplementary fig.
S11 , Supplementary Material online. Finally, power remained high under frequency-dependent selection supplementary fig. S18 , Supplementary Material online , and when a lower selection coefficient was simulated supplementary fig. S17 , Supplementary Material online.
This matches expectation, as frequency-dependent selection is expected to maintain haplotypes in the population for long time periods, causing allelic class build-up. A lower selective coefficient would be expected to lower the probability of maintenance of the balanced allele in the population, but conditioned on this maintenance, should not affect power, as we observed. S4 , Supplementary Material online. We focused on regions that passed sequencing accessibility and repeat filters see Materials and Methods.
Although this phenomenon has not been described for population sizes near that of humans to our knowledge, it has been detailed for lower effective population sizes Ewens and Thomson We analyzed the autosomes and X-chromosome separately.
Because our method is substantially better powered to detect older selection, we focus on signals of selection that predate the split of modern populations. For this reason, we further filtered for loci that were top-scoring in at least half of the populations tested see Materials and Methods.
Together, these variants comprise 2, distinct autosomal and 86 X-chromosomal loci, and these signatures overlapped autosomal and 29 X-chromosomal genes. Trans-species haplotypes are defined as two or more variants are found in tight linkage and are shared between humans and a primate outgroup in our case, chimpanzee. These haplotypes are highly unlikely to occur by chance, unlike trans-species SNPs, which are expected to be observed in the genome due to recurrent mutations Gao etal.
Our scan identified several loci that have been previously implicated as putative targets of balancing selection see Supplementary Material online. Several major signals occurred on chromosome 6 near the HLA, a region long presumed to be subjected to balancing selection Hedrick ; Hughes and Nei In particular, we found a strong signal in the HLA at a locus influencing response to Hepatitis B infection, rs Thursz etal.
Several additional top sites in our scan matched those from DeGiorgio etal. These include sites that tag phenotypic associations Welter etal. In addition to passing the Genomes strict filter and the RepeatMasker test, these haplotypes also passed Hardy—Weinberg filtering see Materials and Methods. One of our top-scoring regions fell within an intron of the cell adhesion molecule 2 gene, CADM2.
In the remaining six populations, the haplotype was at folded frequency 0. Signal of balancing selection at CADM2. The signal of selection is located in an intron of CADM2.
The purple dashed line marks two regulatory variants found on the balanced haplotype. Bottom: Approximate haplotype spans for each population. To elucidate the potential mechanisms contributing to the signal in this region, we overlapped multiple genomic data sets to identify potential functional variants that were tightly linked with our haplotype signature. Second, multiple variants in this region colocalized EUR r 2 between 0. Several SNPs on this haplotype, particularly rs and rs, fall in enhancers in several brain tissues, including the hippocampus Boyle etal.
Taken collectively, these data suggest that our haplotype tags a region of regulatory potential that may influence the expression of CADM2 , and potentially implicates cognitive or neuronal phenotypes in the selective pressure at this site. We identified a novel region of interest within the intron of WFS1 , a transmembrane glycoprotein localized primarily to the endoplasmic reticulum ER. WFS1 functions in protein assembly Takei etal. In the remaining five populations, this haplotype was at frequency 0.
Signal of balancing selection at the WFS1 gene. The purple dashed line marks five regulatory variants found on the balanced haplotype.
Our identified high-scoring haplotype tags several functional and phenotypic variant associations. Second, multiple variants in this region are associated with expression-level changes of WFS1 in numerous tissues The GTEx Consortium ; these variants are strongly tagged by our high-scoring haplotype EUR r 2 between 0. Taken collectively, these data suggest that our haplotype tags a region of strong regulatory potential that is likely to influence the expression of WFS1.
Informed by previous theory on allelic-class build-up Hey ; Hudson ; Charlesworth , we developed a novel summary statistic to detect the signature of balancing selection, and measured efficacy and robustness of our approach using simulations.
Although our method does not require knowledge of ancestral states for each variant from outgroup sequences, this information can improve power at extreme equilibrium frequencies.
Although our method outperforms existing summary statistic methods, it is not as powerful as the computationally intensive approach of T2, which uses simulations to calculate likelihoods of observed data DeGiorgio etal. To improve power, we considered utilizing information on rates of substitutions, but this did not substantially improve discriminatory power see supplementary methods , Supplementary Material online.
Alternative possibilities could include the following: 1 consideration of the region past the ancestral region surrounding the balanced variant, or 2 deviations in the frequency spectrum beyond just nearly identical frequencies to the balanced SNP. As expected from theory, we also note that models of population structure can also produce our haplotype signature, emphasizing the requirement to perform scans on individual populations.
Balancing selection can cause a similar signature in self-fertilizing species, though we focused on out-crossed species in this report.
Previous work has shown that given the same selection coefficient, the signature of balancing selection can be wider in self-fertilizing species due to a lower effective recombination rate Nordborg etal. However, lower recombination rate also means that background selection leaves a wider footprint on the genome in these species, which can reduce levels of polymorphism Agrawal and Hartfield Furthermore, a decrease in the frequency of heterozygotes, owing to selfing, can reduce or eliminate the effects of heterozygote advantage.
Instead, modes of balancing selection like frequency, temporally or spatially dependent selection may be more significant.
We have also assumed a single causal variant throughout. However, there may be more than one variant at a locus experiencing balancing selection.
This situation is thought to occur throughout the HLA region Hedrick Assuming the maintenance of multiple variants, this scenario would also increase the regional TMRCA, leading to allele class build-up, spanning perhaps a larger window than our single-variant models Lenz etal.
The dynamics of this type of situations could be the focus of future work. Although it is impossible to know the true selective pressure underlying our highlighted loci, our results suggest that balancing selection could contribute to the genetic architecture of complex traits in human populations. At the CADM2 locus, functional genomics data suggests that our haplotype signature may connect to brain-related biology.
Intriguingly, a recent report also noted a strong signature of selection at this locus in canine Freedman etal. That said, the phenotypes that have resulted in a historical fitness trade-off at this locus are far from obvious.
Similarly, speculation on the potential phenotypes subject to balancing selection at WFS1 should also be interpreted cautiously. It is known that autosomal recessive, loss of function mutations in this gene cause Wolfram Syndrome. The temporal structure of selection is then important. The selective divergence is then given by 4. The two sums on the right represent respectively: the divergence contribution from fitness variation within intervening generations; and the divergence contribution from temporal consistency in fitness variation across intervening generations.
The multilocus case similarly involves exponential decay averaged over all linked sites under selection [ 42 ]. Thus, among-locus temporal autocovariances Cov s i , s j can make a substantial contribution to the overall selective divergence. The variance resulting from this effect accumulates at a slower linear rate with time because there are t variance terms in Eq 4 ; C in S1 Text —a selective random walk [ 43 ].
Selection that changes in a more predictable manner could in principle generate no overall divergence at all—if selection reverses direction concurrently at many loci, negative covariances can be created in Eq 4 shrinking the overall divergence. This effect occurs when a mean selective bias in the cohort displaces allele frequencies and thus perturbs the effects of drift regardless of whether there is among-locus variation in total selection coefficients.
We show that the selective perturbation to the drift variance has the form where c is a frequency-independent constant of order 1 D in S1 Text. Wright-Fisher and Moran , but in general it is possible that the exact form of the selective drift perturbation depends on population specifics. In the following analysis the exact expression for the selective drift perturbation will not be important; we only use the fact that it scales with , which implies that its effects are negligibly small in the populations of interest here Methods.
Combining variance contributions we have 5 where D t is the frequency-independent variance coefficient in the absence of selection. The variance coefficient C t p is thus partitioned respectively into a frequency-independent genetic drift component, a frequency-dependent selective drift perturbation, and a frequency-dependent selective divergence. This quantity is challenging to analyze because it is determined by the structure of linkage disequilibrium. We thus performed forward-time population genetic simulations using SLiM [ 44 ] to supplement our theoretical results see Methods for simulation details.
For simplicity, we focus on three archetypal scenarios in an unstructured, demographically stable population closed to migration: a continual influx of deleterious mutations, no non-neutral mutations the control case , and a continual influx of unconditionally beneficial mutations. We also calculate total selection coefficients for all segregating mutations to investigate how the selective divergence term in Eq 5 behaves as a function of p. To make the magnitude of the latter easier to interpret, we show total selection coefficient variance on a per-generation scale where is the time-averaged total selection coefficient.
Our positive selection simulations confirm this prediction, consistently creating positive excess variance Fig 2A and 2C. On the other hand, there is no consistent deviation from binomial variance in the negative selection simulations: increases with major allele frequency so rapidly that the overall selective divergence term in Eq 5 is independent of frequency Fig 2A and 2B.
A Forward-time population genetic simulations consistently show elevated excess variance under positive selection only. C In contrast, the selective divergence shows clear frequency dependence under positive selection, thus producing excess variance at intermediate frequencies. Stars indicate which panel A simulations are shown in panels B and C respectively. We next investigated whether binomial allele frequency variance is observed empirically.
In two fruit fly D. We rule out measurement error as driving this pattern, because the major sources of pooled sequencing error population sampling, read sampling, unequal individual contributions to pooled DNA also create binomial variance rather than a systematic frequency-dependent bias E in S1 Text ; [ 45 , 46 ]. Moreover, as will be discussed in the next section, systematically elevated variance cannot be explained by a few large effect loci, implying that a substantial fraction of SNPs across the genome are involved in the observed pattern.
Hence we also rule out mutation bias and gene drive as being the main driver of elevated variance at intermediate frequencies since these processes do not have the requisite scale. We deduce that the pattern observed in Fig 3 is due to selection, consistent with the theoretical prediction that selective divergence tends to cause elevated variance at intermediate frequencies. C t p is calculated in 2. We subtract the constant min p C t p from C t p in each replicate to prevent differences in the overall magnitude of C t p between replicates from obscuring p dependence within each replicate.
Similar results are found in a wild D. Melanogaster population [ 15 ] S1 Fig , although this population is not closed and elevated variance could also be attributed to migration.
The migration divergence thus depends on the structure of differentiation between focal and source populations. However, since we do not know the structure of population differentiation or even what the source population might be , we remain agnostic about the influence of migration in the ref. Next we explored the behavior over time of the elevated variance shown in Fig 3 by following its accumulation within a frequency cohort for two studies in which allele frequencies were measured more than twice [ 11 , 15 ].
We find that excess variance accumulates over the course of the entire Barghi et al. Sustained divergence is what we expect to occur from selection in a novel but constant laboratory environment. By contrast, excess variance in wild D. Melanogaster populations [ 15 ] does not exhibit continual accumulation of excess variance over time, with fluctuations evident in each cohort Fig 4B.
Fluctuations imply a concurrent reversal in the direction of non-neutral allele frequency change across many loci such that non-neutral divergence is partly lost to a subsequent coordinated non-neutral convergence. Bearing in mind that migration may contribute to this pattern, the fluctuations shown in Fig 4 are compatible with temporally fluctuating selection affecting a large proportion of the genome, as proposed by ref. However, while ref.
A similar lack of annual periodicity is found in allele frequency temporal autocovariances [ 28 ]. These results suggest a more complex selective or migratory regime of which seasonal fluctuations are only a part. In the previous section we argued that selection is most likely responsible for elevated allele frequency divergence at intermediate frequencies in three Drosophila studies with the possible exception of the ref.
We next used the theory developed above to estimate the typical magnitude of total selection coefficients associated with elevated divergence we also apply our analysis to ref. This quantity determines the selective divergence in Eq 5 , and has the convenient property of measuring the absolute magnitude of s regardless of sign.
Since we only have measurements separated by t generations, we actually estimate where is the time-averaged selection coefficient. To estimate from Eq 5 , we need to eliminate the non-selective divergence contributions of genetic drift D t and measurement error which was not included in Eq 5. Several lines of evidence support the view that selection strongly influences genetic variation in Drosophila [ 8 , 12 , 28 , 48 ]. Since our method relies on contrasting behavior at different frequencies, the effect of selection on extreme frequency alleles is used as a reference and is therefore not directly inferred.
We expect the effects of selection to be even greater at extreme frequencies where most deleterious mutations are segregating and recent neutral mutations are most tightly linked to selected backgrounds. The power of our approach stems from aggregating allele frequency behavior over many loci, thereby leveraging the sheer number of variants measured with whole-genome sequencing to discern a selective signal.
Heuristically, the sampling error in the lower bound estimate 6 is proportional to where L is the number of independent loci used to estimate C t p. Intuitively, variants across the genome experience a detectable non-neutral shift as a collective even though the underlying allele frequency changes may be indistinguishable from drift at individual loci.
Our approach is a departure from the widespread use of frequency-independent C t for neutral mutations [ 30 ]. Thus, selection makes N e frequency-dependent for neutral mutations over short timescales i. The origin of this non-binomial allele frequency variance is variation in the selective background of alleles at different loci. Selection does not need to be consistent over time to have this effect: stochastically fluctuating selection with no temporal consistency can also generate non-binomial allele frequency variance.
However, temporally consistent selection generates divergence more rapidly, and temporal covariances can be responsible for most of the selective divergence Results. Thus, it seems likely that temporally consistent selection is at least partly responsible for the patterns documented here.
Note, however, that in contrast to ref. These cross-measurement covariances do not contribute to the divergence observed at t generations, and are only a subset of the covariances contributing to the divergence observed at 2 t generations Eq 4.
Therefore, the patterns of variance accumulation documented here are related but not equivalent to the patterns documented in ref. Allele frequency divergence captures the cumulative genome-wide influence of both temporally stable and fluctuating selection between two measurements.
The relative contribution from temporal covariances in total selection coefficients depends on the intensity of selective fluctuations as well as the persistence time of linkage disequilibrium Results , and would require generational allele frequency measurements to quantify. We found that the frequency structure of allele frequency divergence is informative about the underlying structure of direct selection Fig 2.
Elevated divergence of intermediate frequency alleles is difficult to explain if only negative selection on unconditionally deleterious mutations is occurring. Quantifying the bounds on how much selection is possible, and how much selection actually occurs in natural popoulations, is a long running controversy [ 50 , 51 ].
This implies a substantial risk of overestimating the amount of direct selection when, as is commonly done, selection coefficients are inferred at individual loci and then attributed to direct selection.
Our results indicate that improving the sensitivity of single-locus selection coefficient inferences, or better controlling for multiple comparisons, will likely not resolve this issue.
Our total selection coefficient estimates are also substantially larger than direct selection coefficients of individual alleles estimated from diversity patterns in Drosophila [ 8 ]. Total selection coefficients in Fig 2B and 2C computed using Eq 2 from genotype data at generation 10 4. SNP frequency data were obtained from the open access resources published in [ 15 ] wild D.
We performed no additional SNP filtering. We use bootstrapping to estimate the variability of the quantities plotted in Figs 3 — 5. These quantities are calculated as an average over loci, where nearby loci are unlikely to be statistically independent due to linkage. Bootstrap sampling is then applied to these windows. The plotted vertical lines span the 2. M is frequency-independent because measurement error is binomial E in S1 Text ; [ 45 , 46 ].
Our analysis relies on detecting differences in C t p between cohorts with different values of p. Same as Fig 3 but for the Bergland et al. Each curve represents a different seasonal iterate e. Same as Fig 4A but including all 10 replicates from Barghi et al. Abstract Resolving the role of natural selection is a basic objective of evolutionary biology. Author summary Natural selection is the process fundamentally driving evolutionary adaptation; yet the specifics of how natural selection molds the genome are contentious.
Funding: The author s received no specific funding for this work. Introduction One of the central problems of evolutionary biology is to delineate the role of natural selection in shaping genetic variation. Evolutionary processes depend on both changes in genetic variability and changes in allele frequencies over time. The study of evolution can be performed on different scales. Microevolution reflects changes in DNA sequences and allele frequencies within a species over time.
These changes may be due to mutations, which can introduce new alleles into a population. In addition, new alleles can be introduced in a population by gene flow, which occurs during breeding between two populations that carry unique alleles. In contrast with microevolution, macroevolution reflects large-scale changes at the species level, which result from the accumulation of numerous small changes on the microevolutionary scale.
An example of macroevolution is the evolution of a new species.
0コメント