Introduction
Phenotypic data analysis
Pipeline for the analysis of GBS data adapted from Wickland et al. 2013
Filtering
Allele frequency estimation
Estimation of the effective population size
LD decay
Scan for selection signatures mapping with FST leveraging replicated selection
Test for selection: "Sauron plot"
Window based analysis
Comparison of significance thresholds and candidate gene identification
Haplotype estimation and haplotype block calculation
Generation of the LDheatmap
This repository contains scripts for the analysis of replicated experimental evolution studies.
Replicated experimental evolution has been implemented with several model organisms including bacteria, viruses, Drosophila melanogaster and yeast. However, replicated experimental evolution has not been applied to a large population in any agricultural species. Replicating selection can help to identify selected sites and differentiate them from sites affected only by drift. We developed a replicated set of experimental evolution maize populations, each with large population size (~5,000 plants), and implemented three generations of selection for height. We analysed the sequence data with window-based analysis, performed a haplotype block analysis, and created pairwise LD heatmaps of all regions putatively under selection. The preprint is available as Tost et al., 2024.
In the following you can find detailed explanations of the different analysis steps:
The truncation thresholds of the 5% tallest or 5% shortest plants within the populations can be calculated like this:
quantile(Short_Population_1$PlantHeight, 0.05, na.rm = TRUE)
quantile(Short_Population_2$PlantHeight, 0.05, na.rm = TRUE)
quantile(Tall_Population_1$PlantHeight, 0.05, na.rm = TRUE)
quantile(Tall_Population_2$PlantHeight, 0.05, na.rm = TRUE)
The phenotypic data analysis was conducted to evaluate the effect of selection on the phenotype. Therefore, the trait measurements in the subpopulations selected in opposite directions can be compared. We conducted a t-test between the subpopulations selected in opposite directions to test for significance.
Short_plants_2016 <- data[PlantHeight_group == "Selected for short plant height" & year == "2016",]
Tall_plants_2016 <- data[PlantHeight_group == "Selected for tall plant height" & year == "2016",]
Short_plants_2020 <- data[PlantHeight_group == "Selected for short plant height" & year == "2020",]
Tall_plants_2020 <- data[PlantHeight_group == "Selected for tall plant height" & year == "2020",]
t.test(Short_plants_2020$PlantHeight,Short_plants_2016$PlantHeight,
alternative = "less",
paired = TRUE)
t.test(Tall_plants_2020$PlantHeight,Tall_plants_2016$PlantHeight,
alternative = "greater",
paired = TRUE)
t.test(Short_plants_2020$PlantHeight,Tall_plants_2020$PlantHeight,
alternative = "less",
paired = TRUE)
Additionally we also used the trait measurments from the base population and all generations of selection to show the decrease and increase in the selected trait. In our case the selected trait was plant height.
Measured plant height in all years in the subpopulations selected for short plant (green) and tall plant height (purple).
The computation of the t-test statistic and the script for plotting the measured phenotypes across all years are available in the Phenotypic_data_analysis.R
script.
The realized heritability was calcaluated according to (Lush, 1937). This is done by the Realized_heritability_calculations.R
script. To calculate the realized heritability the selection differential and the response to selection are calculated between the different generations of selection. In the following example the realized heritability was calculated between generation 0 and generation 1 of the subpopulations short plants 1.
calculated_realized_herit <- function(Data_Gen1,
Data_Gen2){
mean_Gen_1 <- mean(Data_Gen1$PlantHeight, na.rm = TRUE)
mean_Gen_2 <- mean(Data_Gen2$PlantHeight, na.rm = TRUE)
Selected_prop <- quantile(Data_Gen1$PlantHeight, 0.05, na.rm = TRUE)
Selection_Differential <- Selected_prop - mean_Gen_1
Response_to_Selection <- mean_Gen_2 - mean_Gen_1
herit_BS <- as.numeric(Response_to_Selection)/as.numeric(Selection_Differential)
return(herit_BS)
}
calculated_realized_herit(Data_Gen1 = Short_1_2016,
Data_Gen2 = Short_1_2017)
Pipeline for the analysis of GBS data adapted from Wickland et al. 2013
For the analysis of our raw reads from paired-end genotyping-by-sequencing (GBS) with ApeKI according to Elshire et al. 2011, we used the GB-eaSy pipeline from Wickland et al. 2013.
The pipeline consists out of several steps, which comprises:
- Demultiplexing and trimming of the adapter sequence
- Alignement to the reference genome
- Create a list of sorted bam files
- Generate pileup and SNP calling
- Filtering for quality parameters
After the VCF file was created with the GB-eaSy pipeline from Wickland et al. 2013 and filtered for some quality parameters we did some additional filtering in R. All listed Filterung steps were conducted by the Filtering_for_coverage_average_RD_missingness.R
script. The Rscript contains the Filtering functions for the following filtering steps.
- Filtering for individual samples with low coverage
- Filtering for read depth per sample
- Filtering for missingness
- Removal of non-diallelic and non-polymorphic markers
The Rscript contains many functions from thevcfR
package from Knaus and Grünwald 2018
The allele frequency estimation was done with the Allele_frequency_estimation.R
rscript.
The effective population size was calculated using the known demographic parameters of the populations and based off of the formula:
where Estimation_of_effective_population_size.R
rscript.
In field trials, the assumption of random mating is violated because of the effects of assortative mating due to varying flowering dates and limited spatial pollen dispersion (Allard, 1999). Therefore, we evaluated the flowering dates of the 96 randomly chosen plants from each subpopulation to approximate the number of simultaneously flowering tassels and silks. It was assumed that silks remain receptive up to 5 days after silk emergence (Nieh et al., 2014), so that we calculated the number of flowering tassels during this time interval and projected it onto the entire subpopulation. This calculation of this is also available in the Estimation_of_effective_population_size.R
rscript.
The extent of linkage disequilibrium (LD) was estimated based on 100,000 SNP markers with TASSEL v5 using the squared correlation between markers over a window size of 2,000 bp (Bradbury et al., 2007).
We filtered the individuals out with a marker coverage below 0.6
. For this analysis, we required that every SNP marker was observed at least 80 times in each subpopulation. This filtering resulted in 1,243,604 SNP markers. Additionally, we thinned the data set by randomly sampling 10,000 markers per chromosome.
The calculation can be found in the script Calc_LD_decay_calc_with_TASSEL_based_on_thinned_out_VCF.bash
.
LD decay was modeled with a nonlinear regression model as expected value with N as the number of individuals at each site and C as the recombination coefficient between sites (Remington et al., 2001). The calculation and plotting script is available as Plot_LD_decay_based_on_TASSEL_output.R
.
Our scan for selection was based on the
according to Weir and Cockerham, 1984.
The sum of Fst_Sum_calculation.R
.
We implemented a window-based analysis to assess linked selection. A cubic smoothing spline was applied to the single-marker based splineAnalyze()
function from the GenWin package are available in the Window_based_analysis.R
.
Significance thresholds for selection were calculated three ways: 1) based on the empirical distribution; 2) based on drift simulations; and 3) based on the false discovery rate for selection (FDRfS) Turner and Miller (2012).
The calculation of significance thresholds and the plotting functions are contained in the Significance_thresholds_and_plotting.R
.
The significance thresholds based on the empirical distribution, were calculated by taking the 99.9th and 99.99th percentile of the empirical distribution of the
FST_sig_thres_1 <- quantile(FST_values_od_cor$Fst, probs = 0.9999, na.rm = TRUE)
FST_sig_thres_2 <- quantile(FST_values_od_cor$Fst, probs = 0.999, na.rm = TRUE)
The significance threshold is stored, so it can be used later directly for plotting.
FDRfS for all possible values of the statistics
The function calculate_FDR_for_selection()
generates a table to show all posible values of a statistic and the number of observed markers diverged between subpopulation selected in the same and opposite directions at a certain value. The FDR for selection is received by dividing the number of observed markers diverged between subpopulation selected in the same direction by the number of observed markers diverged between subpopulation selected in opposite directions.
The different thresholds are compared here:
The plotting functions are available in the Plot_Manhattan_plot_with_WStat.R
script.
The data were phased using fastPHASE version 1.4.8 (Sheet and Stephens, 2006) for further dissection. We phased all regions putatively under selection with 10 iterations of the expectation-maximization (EM) algorithm (Sheet and Stephens, 2006). To phase our data, we prepare the regions with the rscript Prepare_data_for_fastPhase.R
script. The haplotype block calculation and plotting of the blocks are contained in the Create_Haplotype_figure.R
script.
We also created a pairwise LD heatmap with the R package LDheatmap Shin et al., 2006 to supplement our haplotype investigation. We looked at LD across putatively selected regions in the different subpopulations. This calculation is contained in the Create_Haplotype_figure.R
script.
Here a both plots combined for the identified region on chromosome 3: