-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
154 lines (107 loc) · 6.76 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
```{r make_sticker, echo = FALSE, out.width= "100%"}
library(hexSticker)
library(jcolors)
sticker(expression({plot.new(); text(0.5, 0.5, "F", font=2, cex = 10, col = scales::alpha("#0091b6", 0.35))}),
package="FreeHiCLite",
p_size = 7, s_x=0.8, s_y=.75, s_width=1.3, s_height=1,
p_y = 1.35,
h_size = 2,
h_color = scales::alpha(unname(jcolors()[5]), 0.75),
h_fill = scales::alpha(unname(jcolors()[5]), 1),
filename="man/figures/sticker.png")
```
<img src="man/figures/sticker.png" align="right" width="15%" height="15%" />
## Overview
The `FreeHiCLite` package is designed for simulate Hi-C contact matrix.
The original package [FreeHi-C](https://github.com/yezhengSTAT/FreeHiC) is short for **Fr**agment interactions **e**mpirical **e**stimation for fast simulation of **Hi-C** data. It is a data-driven Hi-C data simulator for simulating and augmenting Hi-C datasets. FreeHi-C employs a non-parametric strategy for estimating an interaction distribution of genome fragments and simulates Hi-C reads from interacting fragments. Data from FreeHi-C exhibit higher fidelity to the biological Hi-C data. FreeHi-C not only can be used to study and benchmark a wide range of Hi-C analysis methods but also boosts power and enables false discovery rate control for differential interaction detection algorithms through data augmentation. Different from FreeHi-C (v1.0), a spike-in module is added enabling the simulation of true differential chromatin interactions.
FreeHi-C is designed for studies that are prone to simulate Hi-C interactions from the real data and add deviations from the true ones. Therefore, FreeHi-C requires real Hi-C sequencing data (FASTQ format) as input along with user-defined simulation parameters. FreeHi-C will eventually provide the simulated genomics contact counts in a sparse matrix format (BED format) which is compatible with the standard input of downstream Hi-C analysis.
## Installation
Install the development version using the **devtools** package:
```{r, eval = FALSE}
devtools::install_github("baconzhou/FreeHiCLite")
```
## Usage
Load the package:
```{r, message = FALSE, warning = FALSE}
library(FreeHiCLite)
```
### Read hic data
The [`.hic`](https://github.com/aidenlab/juicer/wiki/Data) file is a highly compressed binary file, which is developed in the [Aiden Lab](http://aidenlab.org/). Which can be used in [juicebox](https://aidenlab.org/juicebox/) for contact matrix visualization.
The `.hic` file is formatted as [HiC Format](https://github.com/aidenlab/Juicebox/blob/master/HiCFormatV8.md). To program with `.hic` file, they provide [straw](https://github.com/aidenlab/straw) and [Dump](https://github.com/aidenlab/juicer/wiki/Data-Extraction) to extract the information from the `.hic` file. The `readJuicer()` adopts most from the C++ version of [straw](https://github.com/aidenlab/straw).
The `.hic` file only contains two units of resolution, and each unit contains a fix set of resolutions.
1. Base-pair-delimited resolutions (**BP**): 2.5M, 1M, 500K, 250K, 100K, 50K, 25K, 10K, and 5K.
2. Fragment-delimited resolutions (**FRAG**): 500f, 250f, 100f, 50f, 20f, 5f, 2f, 1f.
```{r read-hic}
## Remote file location. The reomte file include downloading, it may take a while
remoteFilePath <- "https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic"
## Local file location
localFilePath <- system.file("extdata", "example.hic", package = "FreeHiCLite")
## Chromosomes needs to be extra
chromosomes <- c("chr1", "chr2")
## Pairs needs to be extra
pairs <- c("1_1", "1_2")
unit <- "BP"
resolution <- 500000L
```
#### Extract hic information
We can extract some basic information from a `.hic` file via:
```{r hic-info}
juicerInfo <- readJuicerInformation(localFilePath, verbose = TRUE)
```
Besides, `juicerInfo()` also provides chromosome size information, if the genomeID is not one of the genome they provided. [check here](https://github.com/aidenlab/juicer/wiki/Pre#usage)
```{r hic-info-size}
head(juicerInfo[["chromosomeSizes"]])
```
#### Read hic file
To read a remote `.hic` file. Provide a remote url to `readJuicer()`. The full list of remote links can be found in [http://aidenlab.org/data.html](http://aidenlab.org/data.html).
```{r read-remote, eval =FALSE}
## pass chrosomes into function, it will contains all the interaction pairs
datRemote <- readJuicer(file=remoteFilePath, chromosomes=chromosomes, pairs = NULL, unit=unit, resolution=resolution)
```
```{r read-local}
## pass chrosomes into function, it will contains all the interaction pairs
datLoc <- readJuicer(file=localFilePath, chromosomes=chromosomes, pairs = NULL, unit=unit, resolution=resolution)
str(datLoc)
```
#### Write to disk file
The [juicebox](https://aidenlab.org/juicebox/) also provide a function [Pre](https://github.com/aidenlab/juicer/wiki/Pre) to generate `.hic` file from different format. In `FreeHiCLite`, we provide a function `writeJuicer()` to write the contact matrix into a [short with score format](https://github.com/aidenlab/juicer/wiki/Pre#short-with-score-format). You can use `convertJuicer()` to view the matrix, or directly use `writeJuicer()`.
```{r convert-juicer}
head(convertJuicer(datLoc[['contact']][[1]], "1", "1"))
```
```{r write, eval=FALSE}
writeJuicer(datLoc, file, overwrite = TRUE)
```
### Add spikeIn
To add spikeIn into contact matrix. The `FreeHiCLite` provides `FreeSpikeIn()` to add spikein.
```{r spikein}
kernelSmooth <- TRUE
bandwidth <- 50000L
## Create a random spikeIn
contact <- datLoc[['contact']][[1]]
Ns <- 0.1 * NROW(contact)
spikeIn <- contact[sample(1:NROW(contact), Ns), ]
spikeIn[,3] <- spikeIn[,3] * sample(seq(0, 10, 0.5), Ns, replace=TRUE) # 3rd is the counts
spikeInContact <- FreeSpikeIn(contact, spikeIn, kernelSmooth = kernelSmooth, bandwidth = bandwidth)
print(summary(contact[,3]))
print(summary(spikeInContact[,3]))
```
### Simulate from contact matrix
Here we use `FreeHiC()` function to perform simulation.
```{r}
simuContact <- FreeHiC(spikeInContact, seqDepth = 2 * sum(contact[,3]), resolution = 5000L)
print(summary(simuContact[,3]))
```
## References
1. Ye Zheng and Sündüz Keleş. ["FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation."](https://www.nature.com/articles/s41592-019-0624-3) Nature Methods (2019).
2. Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. ["Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments."](https://www.cell.com/cell-systems/fulltext/S2405-4712(16)30219-8) Cell Systems 3(1), 2016.