main_analysis.Rmd

---
title: "Part 1: Bayesian Meta-analysis of BBS phenotypes"
abstract: 'This is a supplementary file for the paper "Meta-analysis of genotype-phenotype associations in Bardet-Biedl Syndrome uncovers differences among causative genes". This document describes the Bayesian analysis reported in the main manuscript. The complete source code for the analysis can be found at https://github.com/martinmodrak/bbs-metaanalysis-bayes or Zenodo, DOI: 10.5281/zenodo.3243264'
output:
  pdf_document: 
     toc: true
---

```{r setup_main, echo=FALSE, message = FALSE, warning=FALSE}
knitr::opts_chunk$set(echo=FALSE)
library(rstan)
library(brms)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(skimr)
library(readxl)
library(here)
library(tidyverse)
library(tidybayes)
library(bayesplot)
library(cowplot)
library(knitr)
library(svglite)


source(here("data_processing.R"))
source(here("models.R"))
source(here("models_funcs.R"))
source(here("plots.R"))


if(!dir.exists(here(stored_fits_dir))) {
  dir.create(here(stored_fits_dir))
}

include_validation <-  FALSE
multiverse_part <- if(include_validation) { "Part 4" } else {"Part 3"}
```


## The data

First let us examine some of the properties of the dataset we are working with - a brief summary follows.

```{r}
data <- read_main_data()
skim_with(numeric = list(hist = NULL, sd = NULL, p0 = NULL, p25 = NULL, p50 = NULL, p75 = NULL, p100 = NULL, n_unique = n_unique), character = list(empty = NULL), factor = list(top_counts = NULL, ordered = NULL))
data %>%  select(-age_corr, -ID, -age_numbers, -age_numbers_groups_guessed, -age_std_for_model, -loss_of_function_certain) %>% skim() 
```

Note in particular, that both age and sex are missing in almost half of the records. Also, the data about individual phenotypes (all the numeric columns) is largely incomplete. Some minor clearing is required to use age, as it is stored as character (a combination of age ranges and ages). For some phenotypes we get values that are not 0 or 1 - those correspond to patients that were monitored in multiple studies, but the phenotype data was inconsistent between studies. In our analysis, we treat those patients as exhibiting the phenotype.

`loss_of_function` encodes the fact that the particular mutation certainly leads to the loss of function of the protein (is truncated in both alleles). 

For some analyses, we group the genes together according to *functional groups*, those are defined as follows:

```{r}
data %>% select(gene, functional_group) %>% distinct() %>% group_by(functional_group) %>% summarise(genes = paste(gene, collapse = ",")) %>% kable()
```

And here are the counts of individual mutations as observed in the data:

```{r}
data %>% group_by(gene) %>% summarise(count = length(gene)) %>% kable()
```

While we include all of the genes in our computational model, we will mostly show only the most frequent mutations in the results here; those include BBS1 through BBS10 and BBS12.

```{r}
genes_to_show <- genes_to_show_from_data(data)
data_long <- data_long_from_data(data)
```

The phenotypes present in the data are:

```{r}
paste0(phenotype_long_from_phenotype(phenotypes_to_use)," (", phenotypes_to_use,")")
```


The data shows considerable between-study (source) variability (showing only the BBSome genes for clarity):

```{r, fig.width=6.6666, fig.height=4}
data_long %>% 
  group_by(source, phenotype, gene) %>% 
  filter(functional_group == "BBSome", gene != "BBS18") %>%
  summarise(n_patients = length(phenotype_value), proportion = sum(phenotype_value) / n_patients) %>%
  
  ggplot(aes(x = gene, y = proportion, size = n_patients, color = gene)) +
    geom_jitter(alpha = 0.3, height = 0, width = 0.3) +
    facet_wrap(~phenotype, ncol = 3) +
    scale_x_discrete(labels = bbs_labeller) + 
    scale_y_continuous("Proportion of phenotype") +
    scale_size_continuous(range = c(0.5,4)) +
    guides(color = FALSE) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust =0.5))
```

Here, every point represents the proportion of patients with a given mutation that manifested a given phenotype in one study. The point size represents the number of patients with the mutation in the study. We see that even studies with a relatively large number of patients show very different proportions. We are therefore confident, that allowing for between-study heterogeneity is important for analyzing the data correctly. In Part 2, we also show an attempt to model the studies as homogenous and find it to be a bad fit, although most of the qualitative conclusions are supported in both models, as described in `r multiverse_part`.

## Handling missingness

The most problematic missing data problem is the missingness in phenotype data. There are two distinct sources of missingness: a) a study missing the phenotype value for only some patients or b) a study not reporting the status of the phenotype for any patient. 
Our analysis assumes the phenotype data to be missing at random, i.e. that the decision to not report a given phenotype in a study and missingness for individual patients is independent of the prevalence of the phenotype in the study population. This is probably not true for missingness at the study level, as the investigators are plausibly more likely to report more prevalent phenotypes and more likely to ignore phenotype that was not observed in any patient. Similarly, if data for a specific patient omits a given phenotype (the state of the phenotype is reported as missing data in the original study), it is more likely the phenotype was not present. 

However, in our analysis we were unable to find a good way to account for this phenomenon. But since we focus on comparison of the prevalence of individual phenotype across different mutations within a single study and do not compare phenotypes against each other, this should only be a significant issue if the rate of missingness in phenotype values was correlated with the prevalence of individual mutations present in a study (e.g., if studies with high obesity missingness would also tend to have overabundance of mutations in BBS3). Let's check whether this is the case:

```{r, fig.width=8,fig.height=4}
phenotype_stats <- data_long %>% 
  group_by(ID, source) %>%
  summarise(n_pheno = length(phenotype)) %>%
  group_by(source) %>%
  summarise(avg_n_pheno = mean(n_pheno), n_patients = length(ID))

data_long %>%
  group_by(source, gene, functional_group) %>%
  summarise(n_mutations = length(unique(ID))) %>%
  inner_join(phenotype_stats, by = c("source" = "source")) %>%
  mutate(mutation_prevalence = n_mutations / n_patients) %>%
  filter(functional_group %in% c("BBSome","BBS3"), gene != "BBS18" ) %>%
  ggplot(aes(x = avg_n_pheno, y = mutation_prevalence, size = n_patients, weight = n_patients)) + 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  geom_jitter(alpha = 0.3) +
  guides(size = guide_legend(override.aes = list(alpha = 1))) +
  facet_wrap(~gene, scales = "free", nrow = 2, labeller = bbs_labeller_facet )
```

In the figure above, each dot is a single study and shows the average number of phenotypes reported per patient vs. the prevalence of mutations in individual genes. Point size corresponds to the number of patients in a given study. Looking at the figure, a strong association of missingness to specific mutations seems implausible, but we can't completely rule out that it biases our results. This is a limitation of our approach and should be taken into account when interpreting our conclusions.

There is large missingness in age and sex data as well. While our main analysis ignores age and sex, Part 2 shows that after accounting for the between-study differences, age and sex differences are already mostly accounted for. We have also tried to impute age and sex data and show that models including imputed age and/or sex provide almost identical results (see `r multiverse_part`).

## The model - an accessible explanation

*An exact mathematical formulation is given in the following section. This section may be safely skipped.*

As all models, the model we use simplifies and abstracts the medical reality in hope that we can arrive at useful conclusions. Our model is a member of generalized linear model family, using logit link and hierarchical terms in a fully Bayesian treatment. Let's unpack this a little, starting with what a logit link does. In the following, we will describe how we handle a single phenotype, as the estimation for individual phenotypes is mostly independent. 

Our model tries to estimate theoretical  *true prevalence* of the phenotype in a population - i.e. the probability that a randomly selected patient from the population will exhibit this phenotype. But all we observe is that each individual either exhibits the phenotype or not. Depending on the number of individuals enrolled in a study, the *observed prevalence* will jump more or less around this true prevalence - let's have a look at an example: 


```{r, fig.width=8, fig.height=4}

plot_example <- function(true_prevalences, n_cases, between_study_variability = 0, n_samples = 20) {
  vertical_labels <- theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust =0.5))
  
  inv_logit <- function(x) { 1 / (1 + exp(-x)) }
  
  per_study_geom <- 
    if(between_study_variability != 0) {
        geom_errorbar(aes(x = study, ymax = per_study_prevalence, ymin = per_study_prevalence),  color = "blue") 
    } else {NULL} 
  
  per_mutation_geom <- 
    if(length(unique(true_prevalences)) > 1) {
      geom_hline(aes(yintercept = true_prevalence),  color = "darkred", size = 1)
    } else {NULL}
  
  tibble(mutation = paste0("Phenotype ", 1:length(true_prevalences)), true_prevalence = true_prevalences) %>%
    crossing(tibble(study = paste0("Study ", 1:length(n_cases), ", n = ", n_cases), n_cases = n_cases)) %>%
    mutate(
      log_odds_base = log(true_prevalence / (1 - true_prevalence)),
      overall_prevalence = inv_logit(mean(log_odds_base)), 
      per_study_prevalence = 
             rnorm(n(), log_odds_base, between_study_variability) %>% inv_logit()
           ) %>%
    crossing(tibble(sample = 1:n_samples)) %>%
    mutate(observed_prevalence = rbinom(n(), n_cases, per_study_prevalence) / n_cases) %>%
    ggplot(aes(x = study, y = observed_prevalence)) +
      per_mutation_geom +
      per_study_geom +
      #geom_hline(aes(yintercept = overall_prevalence),  color = "darkgreen", size = 2, linetype = "twodash" ) +
      geom_jitter(width = 0.3, height = 0, alpha = 0.3) +
      facet_wrap(~ mutation) +
      expand_limits(y = c(0,1)) +
      vertical_labels
  
}

plot_example(c(0.63, 0.83, 0.96), c(5,11,23,33))
```

```{r}
mut_in_study <-  data %>% group_by(gene, source) %>%
  summarise(count = length(gene)) 

max_mut_in_study <- mut_in_study %>% arrange(desc(count)) %>% head(n = 1)
studies_less_than_5 <- round(mean(mut_in_study$count < 5) * 100)
```


The red line shows the true prevalence of a phenotype (assuming it is the same across multiple studies). Each point shows the observed prevalence of a single possible realization of a study. We see that the observed prevalence can differ substantially from the true value and that the spread decreases with increasing $n$. In our data, we can't however expect large $n$, as the biggest $n$ we have is (`r max_mut_in_study$source`) with `r max_mut_in_study$count` patients having mutation in `r max_mut_in_study$gene` and in `r studies_less_than_5`% of cases there are less than 5 patients with a given mutation in the same study. 
Also note that the observed values are clustered at discrete "levels" because e.g., among 5 patients you only can have prevalence of 0.2 or 0.4 and nothing in between. Further, you can see that with high prevalence, there is less variation in the observed prevalence.

For mathematical convenience, the model does not work with prevalence directly, but with *odds*. Odds are just another way to express prevalence - for example, when the prevalence is $20\%$, we expect one patient to exhibit the phenotype for each four patients not exhibiting the phenotype, leading to odds $1:4$ or $0.25$ and log odds (base 10 here) of roughly $-0.6$. Unlike prevalence, which is constrained between 0 and 1,  odds can be any non-negative number. 

Most of the results of the model are reported as comparisons of odds in different populations. For this, we use the ratio of the corresponding odds. E.g., when the odds ratio is 2, we expect the first population to have twice as much patients exhibiting the phenotype for each healthy patient than the second population.

Technically, the model works with the logarithm of odds. The logit function transforms a probability (prevalence) to log odds, hence the logit "link" used in the model. What is the linear part?

Our model assumes that the true odds of a phenotype is a function of four numbers (*coefficients*): the overall odds of the phenotype in the population, a modifier for the gene the patient has damaged, and a modifier for the study the patient is enrolled in. One additional modifier is added when the mutation is a certain loss-of-function (cLOF). These four numbers are multiplied to arrive at the final odds for the patient. Assuming only cLOF mutations for simplicity, this means that while the odds of a phenotype are allowed to vary between genes and the overall rate of a phenotype may vary between studies, the odds ratio of different genes is the same across all studies. Let's look at an example:

```{r, fig.width=8, fig.height=3}
set.seed(1211)
tibble(gene = paste("Gene",1:3), odds = c(1, 5, 3)) %>% crossing(tibble(source = paste("Study ", 1:2), multiplier = 1:2)) %>% crossing(tibble(type = c("Allowed", "Disallowed 1", "Disallowed 2"), noise_sd = c(0,1,1))) %>%
  mutate(odds = if_else(multiplier == 1, odds, rlnorm(n(), log(odds * multiplier), sd = noise_sd))) %>%
  ggplot(aes(x = gene, y = odds, color = source, group = source, shape = source)) + 
  geom_line() + geom_point() +
  scale_y_log10("true odds") + 
  facet_wrap(~type) +
  base_theme + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust =0.5))
```

Above, on the leftmost panel, we see that the two studies differ in the odds of the phenotype for each gene, but the ratio of odds for Gene 1 to Gene 2 (and 3) is the same in both studies (since the odds are shown on log scale this manifests as a constant gap between the two lines). This type of between-study variation is allowed. On the other two panels, the odds ratio for Gene 1 to Gene 2 (and 3) differs between studies - this type of variation is not allowed by the model. 

Another way to describe the allowed case is that, for both studies a mutation in Gene 2 makes a patient five times more likely to exhibit the phenotype than a patient with a mutation in Gene 1, although the base rate of the phenotype may vary between studies.

The cLOF coefficient is held constant for all genes in a given phenotype, meaning that odds for any given phenotype are multiplied by a small number when the mutation is cLOF. Once again this means that relative odds are the same among cLOF and other mutations, but absolute odds can be higher (or lower) in cLOF mutations.

This is the "linear" part of the model - we multiply odds, which is the same as adding the logarithm of the odds and addition is a neatly linear thing. 

Now the "hierarchical" part. This ties the coefficients in the model in two important ways:
 i) it assumes that small differences in odds across genes and studies are more likely than large differences and updates the estimates accordingly,
 ii) *partial pooling*: the degree to which odds are allowed to "jump around" across both genes and studies is informed by the data, e.g., if the odds are similar for all genes except one, the model will put higher weight on the possibility that the difference in the last mutation is just noise and shrink its estimate towards the average for other genes. On the other hand, if the odds vary wildly across all genes, the model will assume it is more likely that this is a true variation and it will not shrink the estimate much. The amount of shrinkage also depends on the number of observations as estimates in which there is a larger number of observations are shrunk less. The variability across studies is pooled in a similar way.

Together, those two features result in low risk of overfitting the data, even though we have very little observations for most study - gene - phenotype combinations.

We also allow for a correlation between phenotypes, e.g., that some phenotypes occur frequently together while others rarely manifest in the same patient. Once again, the amount of correlation is estimated from the data.

Finally, the "Bayesian" part: We follow the Bayesian paradigm, so our estimates of the model coefficients are not a single number, but rather a distribution - some values are more likely than others, but the data are insufficient to let us determine the coefficients with high certainty. Therefore, we never report exact numbers, but rather 50% and 95% *credible intervals* of the distribution. Unlike confidence intervals in frequentist analysis, we can directly interpret the 95% credible interval as the interval that contains the true value with 95% probability - assuming our model is correct (which it is not, but we hope it is still a useful abstraction).

## The model - mathematical formulation

We use a generalized linear model with logit link and hierarchical terms. Let us dive into the details.
For the model, we expand the data into long form, i.e. each row in the dataset corresponds to a combination of patient and reported phenotype (a patient with reported values for 3 phenotypes would correspond to 3 rows in the long form dataset). The model is specified with the following brms formula, using the Bernoulli family with logit link function:

```{r}
cat(strwrap(as.character(paste(as.character(main_model_def$formula)[c(2,1,3)], collapse = " ")), width = 80, prefix = "\n\t", initial = "" ))

```

This can be expressed in mathematical notation as:

$$
\begin{aligned}
Y_i &\sim Bernoulli(\mu_i) \\
\mathrm{logit}(\mu_i) &= \alpha + \beta^1_{p_i} + \beta^2_{p_i,g_i} + \beta^3_{p_i,s_i} + c_i\beta^4_{p_i} 
\end{aligned}
$$

Where $p_i \in \{1, ..., P \}$ is the index of the phenotype for $i$-th row, $g_i \in \{1,...,G\}$ is mutated gene for $i$-th row and $s_i \in \{1,...,S\}$ is the index of the source study for $i$-th row. $c_i$ is 1 when the mutation on the $i$-th row is cLOF and 0 otherwise.  $\alpha$ is the intercept and $\beta^1, \beta^2$ and $\beta^3$ model the overall phenotype prevalence,  phenotype prevalence specific to a given mutation and between-study variability in phenotype prevalence respectively. $\beta^4$ models the phenotype-specific effect of cLOF.

Note that this is very similar to running a separate regression for each phenotype, with two exceptions: the overall intercept $\alpha$ is explicitly shared between phenotypes and the structure of the priors introduces some information flow between the other coefficients.

The priors we use for the parameters are:


$$
\begin{aligned}
\alpha &\sim N(0, 2) \\
\beta^1 &\sim N(0, \sigma_1) \\
\sigma_1 &\sim N(0, 2) \\
\
\beta^2 &\sim \mathcal{N}_P(\boldsymbol{0}, \boldsymbol{\Sigma}) \\
\boldsymbol{\Sigma} &= \boldsymbol{\sigma}_2  \bar{\boldsymbol{\Sigma}} \\
\bar{\boldsymbol{\Sigma}} &\sim LKJ_P(1) \\
\sigma_{2,p} &\sim N(0, 2) \\
\
\beta^3_{p,s} &\sim  N(0, \sigma_{3,p})\\
\sigma_{3,p} &\sim N(0, 2)\\
\
\beta^4_{p} &\sim N(0,2)
\end{aligned}
$$

Note that the prior on $\beta^1$ is $P$-dimensional multivariete normal $\mathcal{N}_P$, explicitly modelling the correlation $\bar{\boldsymbol{\Sigma}}$ between the prevalence of individual phenotypes and per-phenotype variance $\sigma_{2,p}$, while the other priors are univariate normal.

The $N(0, 2)$ priors on the various parameters are mildly skeptical in that they exclude that any of the parameter would explain odds ratio larger than $\sim 50$. In `r multiverse_part`, we show that the results are almost identical, when the priors are different.

## Main results

First, a summary of the model fit as posterior intervals for main model parameters (omitting correlation parameters for brevity): 

```{r}
fit <- fit_base_model(main_model_def, data_long)
s <- summary(fit)
s$random$gene <- s$random$gene[!grepl("^cor", rownames(s$random$gene)),]
s
```

```{r}
data_for_prediction <- data_for_prediction_base_model(main_model_def, genes_to_show, phenotypes_to_show)

data_for_prediction_BBSome <- data_for_prediction %>% filter(functional_group_for_gene(gene) == "BBSome", gene != "BBS18")
```

The fitted model parameters themselves are however hard to interpret, as they operate on log odds scale and there might be notable covariance in the posterior. It is also hard to say how to handle the between-study variability of coefficients. And this variability  is substantial - note that the `sd` parameters under `~source` (corresponding to $\sigma_3$) admit ranges from $\sim 1.1$ to $\sim 5.1$, so the odds of a phenotype, given a mutation can plausibly differ between studies by $1.96 \times \pm 1.1 = \pm 2.16$ to $1.96 \times \pm 5.09 = \pm 9.99$ _on the log scale_ (95% of mass of a normal distribution is within $1.96 \times \sigma$ from the mean). 

Instead, we will focus on model predictions. In particular, the results we report can be interpreted as if a new study is drawn at random from the same population of studies as we used (i.e. matching all the inclusion criteria) and we directly observe true odds of all phenotypes for all mutations in this study. That is, the predictions do include between-study variability, our uncertainty about the population of studies, our uncertainty about overall prevalence of the individual phenotypes, our uncertainty about the strength of links between mutations and phenotypes and our uncertainty about correlations between the presence of individual phenotypes. The predictions do NOT include the sampling uncertainty of the hypothetical new study. For example when the hypothetical study has the true odds of a phenotype, given a mutation in BBS12, equal to $1:2$ ($0.5$), a study on 20 patients can easily observe odds of $3:17$ ($0.18$) or $11:9$ ($1.22$) simply due to chance, but we will treat the hypothetical study here as having odds of $0.5$ and ignore this additional noise in our results.

Since those odds can vary wildly between studies, we will focus on various odds ratios (OR) within a single hypothetical study.

## Summary for functional groups

```{r, fig.width=8, fig.height=6}
differences_func_group_plot <- plot_gene_phenotype_differences_estimates(fit, data_for_prediction %>% mutate(group = functional_group), genes_to_show = genes_to_show, group_title = "Functional group", data_original = data_long %>% mutate(group = functional_group))
differences_func_group_plot
```

The plot above shows posterior 95% (thin) and 50% (thick) credible intervals for ratio of odds for a phenotype given a random mutation within a functional group to odds for the phenotype given a random mutation across all groups shown. All mutations are assumed to be equally likely - the odds are not weighed by the frequency of the mutations in the dataset. Odds ratios are shown on the log scale and each phenotype has its own scale. Gray dots show the same odds ratio calculated for individual studies included in the meta-analysis. Dots outside the dashed lines correspond to studies, where the empirical odds ratio is 0 or infinity. Dot size represents the number of relevant cases in the study.

## Summary for BBSome genes


```{r, fig.width=10, fig.height=7}
differences_bbsome_plot <- plot_gene_phenotype_differences_estimates(fit, data_for_prediction_BBSome, genes_to_show = genes_to_show, data_original = data_long, wider = TRUE) + theme(axis.text.x = element_text(angle = 45, hjust = 0.7, vjust =0.7)) 
differences_bbsome_plot 
```

The plot above shows posterior 95% (thin) and 50% (thick) credible intervals for ratio of odds for a phenotype given a mutation in a gene to odds for the phenotype given a random mutation across all genes shown. All mutations are assumed to be equally likely - the odds are not weighed by the frequency of the mutations in the dataset. Odds ratios are shown on the log scale and each phenotype has its own scale. Gray dots show the same odds ratio calculated for individual studies included in the meta-analysis. Dots below the dashed lines correspond to studies where the empirical odds ratio is 0 or infinity. Dot size represents the number of relevant cases in the study.


## Summary for Chaperonins

```{r, fig.width=8, fig.height=6}

data_for_prediction_chaperonins <- data_for_prediction %>% filter(functional_group_for_gene(gene) == "Chaperonins")


differences_chaperonins_plot <- plot_gene_phenotype_differences_estimates(fit, data_for_prediction_chaperonins, data_original = data_long)
differences_chaperonins_plot 
```

This is the same plot as for the BBSome genes, only showing the members of the chaperonins group.

## Pairwise comparisons of mutations in BBSome genes

The summary plots above are not well suited to infer pairwise comparisons, as the estimates for the individual genes / functional groups are not independent. In particular, non-overlapping marginal credible intervals in the summary plot imply that there is a consistent difference, but the converse is not true. If there is a strong positive correlation, there might be a consistent difference even when the above plot would show mostly overlapping marginal posterior intervals.

Pairwise comparisons also have the benefit of better interpretability, as we do not need to rely on odds ratio of the phenotype against some average, which might not be clinically meaningful. In pairwise comparisons, we can directly work with odds ratios for the phenotype given the two mutations in question.

### Conservative estimates

```{r, fig.width=8, fig.height=6}
plot_pairwise_differences(main_model_def, fit, data_for_prediction_BBSome,  "heatmap_min")
```

The most conservative (closest to one) pairwise odds ratios within 95% posterior credible intervals.
The reported odds ratios are for gene on the horizontal axis against the gene on the vertical axis.

This shows pairs, where we are fairly certain there is a systematic difference and the minimal magnitude of this difference consistent with the data.

### Extreme estimates



```{r, fig.width=8, fig.height=6}

plot_pairwise_differences(main_model_def, fit, data_for_prediction_BBSome, plot_type = "heatmap_max")
```

The most extreme (furthest from one) pairwise odds ratios within 95% posterior credible intervals.
The direction of the effect is not reported as effects in both directions might be similarly plausible - the  odds ratios are transformed to be larger than one in all cases.

This shows maximal differences consistent with our model and lets us constrain the differences between mutations for some phenotypes. We see that the data does not let us to put tight constraints on most differences - the tightest we get is OR of 3 - which would still be a very important difference for the clinical prognosis.

### Everything at once

```{r, fig.width=10, fig.height=8}
plot_pairwise_differences(main_model_def, fit, data_for_prediction_BBSome, plot_type = "linerange_all")

```

Posterior 95% (thin) and 50% (thick) credible intervals for odds ratio for a phenotype given mutation in the gene on horizontal axis against the gene on the vertical axis. Color indicates the widest central posterior credible interval that does not include one. We deliberately not make any strong cutoff at 95% excluding zero or similar as the tail probabilities have high variance (e.g., where there is less data, one extra positive case could plausibly move this quantity from say 93% to 96%). Odds ratio are shown on the log scale.

This plot integrates the information shown in the plots above, and some more.

## Type of mutation - loss of function

What is the effect of whether the mutation certainly leads to the complete loss of function (LOF) of the protein? Since this information (will be furter reffered to simply as LOF) does not vary per gene, let us look at the corresponding odds ratio per phenotype.

```{r, fig.width = 8, fig.height = 3}
#Here the LOF coefficient directly corresponds to the odds ratios, so not using predictions, but it is equivalant
summary_50 <- summary(fit, prob = 0.50)$fixed %>%
  as.data.frame() %>% 
  rownames_to_column("parameter") %>% 
  mutate(lower_50 = exp(`l-50% CI`), upper_50 = exp(`u-50% CI`)) %>% 
  select(parameter, lower_50, upper_50)

summary_wide <- summary(fit, prob = posterior_interval)$fixed %>%
  as.data.frame() %>% 
  rownames_to_column("parameter") %>% 
  mutate(lower = exp(`l-95% CI`), upper = exp(`u-95% CI`)) %>% 
  select(parameter, lower, upper)


summary_50 %>%
  inner_join(summary_wide, by = c("parameter" = "parameter")) %>%
  filter(parameter != "Intercept") %>%
  mutate(phenotype = gsub("(phenotype)|(:loss_of_function_certain)","", parameter)) %>%
  ggplot(aes(x = phenotype, ymin = lower, ymax = upper)) +
    geom_hline(yintercept = 1, color = "darkred")+ 
    geom_linerange(aes(ymin = lower_50, ymax = upper_50), size = 2) +
    geom_linerange() +
    scale_y_log10("Odds ratio cLOF vs. others", labels = simple_num_format) +
    base_theme
```

For most phenotypes we (expectedly) see that in cLOF mutations the phenotype is more likely. For LIV, RD and REP, the data is not very conclusive. The surprising part is the high posterior probability assigned to cLOF mutations having less severe phenotype in HEART. This might nevertheless be due to biases in reporting or due to higher lethality of HEART phenotype in cLOF mutations (and thus making the patients not included in the dataset). It is however likely not a result of low sample size: there are `r sum(!is.na(data$HEART))` patients with the HEART phenotype reported, spread roughly equally between cLOF and other mutations.

```{r saving_figures_for_publication}
std_width = 8
std_height = 6
img_path = here("tmp_pics")
if(!dir.exists(img_path)) {
  dir.create(img_path)
}
#types = c(".svg", ".png")
types = c(".pdf")

for(type in types) {
  differences_func_group_plot %>% ggsave(paste0("functional_groups", type), plot = ., path = img_path,height = std_height, width = 4.2)
  
  differences_bbsome_plot %>% ggsave(paste0("bbsome_overall", type), plot = ., path = img_path,height = std_height, width = 8)
  
  differences_chaperonins_plot %>% ggsave(paste0("chaperonins_overall", type), plot = ., path = img_path,height = std_height, width = std_width)
  
  plot_pairwise_differences(main_model_def, fit, data_for_prediction_BBSome, plot_type = "heatmap_min") %>%
    ggsave(paste0("heatmap_min", type), plot = ., path = img_path,height = std_height, width = std_width)

  plot_pairwise_differences(main_model_def, fit, data_for_prediction_BBSome, plot_type = "heatmap_max") %>%
    ggsave(paste0("heatmap_max", type), plot = ., path = img_path,height = std_height, width = std_width)

  plot_pairwise_differences(main_model_def, fit, data_for_prediction_BBSome, plot_type = "linerange_all") %>%
    ggsave(paste0("linerange_all", type), plot = ., path = img_path,height = 8, width = 10)
}
```


## Discrepancies between frequentist and Bayesian analysis

Here we try to understand why the frequentist and Bayesian analyses provided different results for some of the questions we asked. In all cases the answer is that there is a substantial between-study variability, which, when taken into account, prevents us from making firm conclusions about some of the comparisons that are significant in the frequentist analysis (which ignores between-study variability).

### CI phenotype and BBS7

Looking at raw proportions in the whole dataset, BBS7 has the highest prevalence of the CI phenotype (and the difference was significant for the frequentist analysis):

```{r}
bbs_genes_to_check <- paste0("BBS0", c(1,2,4,5,7,8,9))

data_long %>% filter(phenotype == "CI", gene %in% bbs_genes_to_check) %>% group_by(gene) %>%
 summarise(prevalence_CI = mean(phenotype_value), n = n()) %>% arrange(desc(prevalence_CI)) %>%
  kable()


```

The picture however gets complicated when we look at individual studies:

```{r, fig.width = 6, fig.height = 2.5}
plot_diff_average_to_single_gene <- function(data_long, target_phenotype, target_gene, genes_to_check, others_label = "others") {
  pos = position_jitterdodge(dodge.width = 0.3, jitter.width = 0, jitter.height = 0.02)
  
  data_long %>% filter(phenotype == target_phenotype, gene %in% genes_to_check) %>% 
    mutate(type = if_else(gene == target_gene, target_gene, others_label)) %>%
    group_by(type, source) %>%
    summarise(proportion = mean(phenotype_value), num_cases = length(phenotype_value)) %>% 
    droplevels() %>%
    ggplot(aes(x = type, y = proportion, group = source, color = source)) + 
    geom_line(alpha = 0.5, position = pos) + geom_point(aes(size = num_cases), position = pos, alpha = 0.5) + guides(color = FALSE) + scale_y_continuous(paste("Prevalence", target_phenotype)) +
    base_theme
}

plot_diff_average_to_single_gene(data_long, "CI", "BBS07", bbs_genes_to_check)

```

Here, each dot is a study with size indicating the number of cases. We show the prevalence for BBS7 and prevalence across the other mutations. Lines connect estimates from the same study. This shows us that while BBS7 indeed has high CI prevalence (the prevalence is 100% in most studies), there are some studies where the CI prevalence is lower than prevalence in other mutations:

```{r}
data_long %>% filter(phenotype == "CI") %>% group_by(source) %>%
 summarise(prevalence_CI_BBS07 = mean(phenotype_value[gene == "BBS07"]), prevalence_CI_others = mean(phenotype_value[gene != "BBS07"]), n_BBS07 = sum(gene=="BBS07"), n_others = sum(gene != "BBS07")) %>% filter(prevalence_CI_BBS07 < prevalence_CI_others) %>% arrange(desc(prevalence_CI_BBS07)) %>%
  kable()

```

This includes the Ullah et al. study which has the largest number of BBS7 cases with reported CI phenotype in the whole dataset. So while there is some evidence for BBS7 having unusually high prevalence of CI, we cannot rule out that this is caused by between-study variability.

The above plot showed the total prevalence across all other mutations, we can also look at comparisons of prevalence between individual mutations:

```{r, fig.width = 8, fig.height = 3}
plot_all_diffs_to_single_gene <- function(data_long, target_phenotype, target_gene, genes_to_check) {
  pos = position_dodge(width = 0.3)
  
  data_long %>% filter(phenotype == target_phenotype, gene %in% bbs_genes_to_check) %>% 
    group_by(source, gene) %>%
    summarise(proportion = mean(phenotype_value), num_cases = length(phenotype_value)) %>% 
    group_by(source) %>%
    filter(target_gene %in% gene) %>%
    mutate(diff = proportion[gene == target_gene] - proportion) %>%
    ungroup() %>%
    filter(gene != target_gene) %>%
    droplevels() %>%
    ggplot(aes(x = gene, y = diff, group = source, color = source)) + 
    #geom_line(alpha = 0.5, position = pos) + 
    geom_point(aes(size = num_cases), position = pos, alpha = 0.5) +
    guides(color = FALSE) +
    scale_y_continuous(paste("Difference in prevalence of", target_phenotype, "to",target_gene)) +
    base_theme + 
    theme(axis.title.y = element_text(size = 10))
}

plot_all_diffs_to_single_gene(data_long, "CI", "BBS07", bbs_genes_to_check)

```

Once again, each dot is a study, the dot size corresponds to the number of patients with the mutation on x-axis and the y-axis shows the difference in prevalence against BBS7 (positive - BBS7 has higher prevalence). Once again, most studies show no or positive difference, but some studies show negative difference and for BBS8 there are actually more studies showing effect in the opposite direction. This means that the model can conclude neither that BBS7 has higher prevalence of CI than each other mutation individually nor (weaker) that it has higher CI prevalence than the average of other mutations.


### HEART phenotype and BBS2

The case of BBS2 is similar to BBS7 - looking at proportions across the whole dataset, BBS2 has the highest prevalence of HEART (significant in the frequentist analysis), but the sample sizes are even smaller than in the above case:

```{r}
data_long %>% filter(phenotype == "HEART", gene %in% bbs_genes_to_check) %>% group_by(gene) %>%
 summarise(prevalence_HEART = mean(phenotype_value), n = n()) %>% arrange(desc(prevalence_HEART)) %>%
  kable()
```

Let's look at BBS2 compared to prevalence across all other phenotypes:

```{r, fig.width = 6, fig.height = 2.5}
plot_diff_average_to_single_gene(data_long, "HEART", "BBS02", bbs_genes_to_check)
```

We see there is one study showing higher prevalence, one study showing lower prevalence and two showing equal prevalence. All the other studies either had only BBS2 or had no BBS2 and so cannot directly contribute to a conclusion, despite showing small prevalence of HEART. Once again, we cannot rule out the difference between BBS2 and others in HEART phenotype is simply due to between-study variability.

### LIV BBS2 to BBS1 comparison

In the data BBS2 has higher LIV prevalence than BBS1 (again statistically significant, despite the small sample sizes):

```{r}
data_long %>% filter(phenotype == "LIV", gene %in% c("BBS01","BBS02")) %>% group_by(gene) %>%
 summarise(prevalence_LIV = mean(phenotype_value), num_cases = n()) %>%
  kable()

```

Let's plot the between-study variability:

```{r, fig.width = 6, fig.height = 2.5}
plot_diff_average_to_single_gene(data_long, "LIV", target_gene = "BBS01", genes_to_check = c("BBS02", "BBS01"), others_label = "BBS02")
```

There is one study that shows increase for BBS2 and one that shows decrease for BBS2. And the one showing decrease is actually the study with larger sample size. The difference seems to be driven by studies that have BBS1 patients and no or very few BBS2 patients and can thus be attributed to between-study variability.


### LIV phenotype and functional groups

For the LIV phenotype the frequentist analysis disagrees with the Bayesian analysis presented here. In particular, the frequentist analysis shows BBS3 as the least likely to result in LIV and chaperonins as the most likely, while the analysis shown here reports exactly opposite trend. First, let's look at the aggregate frequencies of LIV phenotype:

```{r}
data_long %>% filter(phenotype == "LIV") %>% group_by(functional_group) %>%
 summarise(proportion_LIV = mean(phenotype_value)) %>% kable()
```

Indeed, this supports the frequentist conclusions (as this is actually what those are based on). However, this aggregate look ignores the between-study variability. So let's look at what the individual studies show:


```{r, fig.width= 8, fig.height=2.5}
pos = position_jitterdodge(dodge.width = 0.3, jitter.width = 0, jitter.height = 0.02)

plot_liv_source <-  data_long %>% filter(phenotype == "LIV", functional_group != "Others") %>% 
  group_by(functional_group, source) %>%
  summarise(proportion_LIV = mean(phenotype_value), num_cases = length(phenotype_value)) %>% 
  droplevels() %>%
  ggplot(aes(x = functional_group, y = proportion_LIV, group = source, color = source)) + 
  geom_line(alpha = 0.5, position = pos) + geom_point(aes(size = num_cases), position = pos, alpha = 0.5) + guides(color = FALSE) + scale_y_continuous("Prevalence LIV") +
  base_theme

plot_liv_source

for(type in ".png") {
  ggsave(paste0("liv_by_source", type), plot = plot_liv_source, path = img_path, height = std_height, width = std_width)
}
```

In the plot above, each dot is the proportion of patients with LIV given a mutation in a gene from one of the functional groups in a single study. Lines connect values that are from the same study (note that only one study had mutations from all groups and many had mutations in just one group) - those also have the same color. Size of the points represents the number of patients. We added some jitter to let us differentiate the points - notably all the points near zero and one are actually exactly zero and one.

This plot tells a different story: most individual studies, especially those with larger number of cases, show increase in LIV phenotype for BBSome against Chaperonins. There is only one larger study including both BBS3 and BBSome and it has more LIV positive cases for BBS3. The overall opposite trend is driven by a) the only larger studies reporting BBS3 having unusually low proportion of LIV in general and b) studies reporting mutations only for BBS3 or only for BBSome having unusually low proportion of LIV. 

However, it seems that the evidence in this direction is not very compelling.