Skip to content

Some ideas on things to learn to get started with bioinformatics

License

Notifications You must be signed in to change notification settings

ctmrbio/bioinformatics_curriculum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

Bioinformatics curriculum

This document is an incomplete list of suitable topics to study to learn the basics of DNA sequencing-based bioinformatics.

Learn to work on the Linux command line

As most bioinformatics tools are written to run in a Linux environment it is important to learn how to work on the command line. In addition, accessing high-performance computer resources is also normally done via a terminal interface. There is a lot to learn, but after learning the 20 or so most used commands you can start to be productive on the command line!

  • Basic file handling and navigation: ls, cd, mkdir, cp, mv, rm, cat, less/more, chmod, etc...
  • At least one terminal editor: nano, emacs, vim
  • Learn to use terminal multiplexers: screen or preferably tmux
  • Advanced tools:
    • Pipes, output redirection (<, >, stdout, and stderr)
    • Shell scripts
    • Regular expressions in grep, sed, awk...
    • Non-standard power tools, e.g. GNU parallel, Ebay's tsv-utils, etc.
  • Learn to work remotely over SSH
  • Connect to remote computers
    • Transfer files to/from remote computer using command line interface, e.g. scp, sftp, rsync, lftp
    • Work on shared cluster systems with job submission systems (e.g. Slurm on UPPMAX)

ExplainShell does an amazing job at explaining the different components of a command line. Try it out!

Linux books:

Programming

Knowing basic programming is essential for a bioinformatician. Programming is often used to handle input and output files, pre-process data files, create plots, create workflows that run several different tools in a specified order. A well-constructed bioinformatics data analysis is reproducible, meaning that any one can run the same analysis on a different computer using the same input files to produce the same output results. This is challenging in practice, but it is important to consider all scripts that are written in the course of a bioinformatics analysis project as the "log book" or "lab book" of how the analysis was actually performed. And while it can be rewarding to do a quick analysis of some output files at the command line, you should make it a habit to always include everything you do to the data in a script that you can come back to in the future when you have forgotten exactly what you did.

Version control systems

There are several revision control systems that one can use to maintain a versioned history of for example program code. The most popular version control system in widespread use today is Git. Three common places to publish code are Github, Gitlab, and Bitbucket. They all work pretty much the same.

There are some very good guides and tutorials listed below that will introduce you to the vocabulary and concepts concerning version control. Version control rocks(!) and is a crucial tool in a bioinformatician's tool belt, so take the opportunity to learn it as soon as possible. When you start writing code, you will eventually encounter a situation where you want to make changes to the code, but without losing the older version (that you know worked). Version control makes it possible to go back in time to older versions of the code, without having to mess with copies of files called my_code_version-20181015_final_final2.py. It will make your life so much easier and you will enjoy it!

Tutorials

Here are some nice introductions to version control:

Academic accounts on GitHub

GitHub has some nice resources for research/education. Check out their education portal! You can also get a free researcher account that enables unlimited free private repositories.

Python

Python is the most common (and in my opinion most easy to learn) programming language. It is typically available on all Linux systems. Some resources for learning about Python in general:

There was a big debate a couple of years ago about which version of Python to learn. That discussion is no longer valid: you should learn Python 3 (start by installing the latest available version (3.6+)). There are several ways to download and install Python, but I recommend learning to use conda. There is a conda getting started guide that is OK, after you are familiar with the command line.

Unfortunately, there are no de facto standard integrated development environments (IDEs) for Python like there is for R (i.e. RStudio, see more below). The most common alternatives are probably Microsofts' Visual Studio Code and JetBrain's PyCharm, both are great and cross-platform. VS Code is free for everyone, and a free community edition (without professional support) is available for PyCharm. An other important Python programming tool you should learn is Jupyter. It is a tool to work with interactive programming notebooks where you can combine blocks of Markdown formatted notes with individually executable code blocks (with inline plots!). It is actually not specific to Python: it started out as a notebook format for the languages Julia, Python, and R (JuPyteR), but now runs more than a hundred different language kernels. It is often used in bioinformatics analyses and is getting more and more common nowadays as a way to share how analyses and plots were made for scientific papers.

R

R is by far the most commonly used language/environment for any type of data analysis that requires statistics. A bioinformatician has to be familiar with R. There is a very good Integrated Development Environment (IDE) available for R: RStudio. Ensure you become familiar with R, RStudio, and R Markdown (kind of like Jupyter notebooks, but focused on R).

Databases

There are several database systems, but the most common are some kind of relational database system (often called SQL databases). There are others, especially NoSQL-databases, that are gaining popularity (MongoDB is a NoSQL database that is seeing some use in bioinformatics applications). A bioinformatician can definitely benefit from learning the basics of SQL and a NoSQL system.

A word on coding style

Using a consistent coding style is important to ensure code readability (you are going to read your code much more than you write it). Python has a style document called PEP8 (Python Enhancement Proposal number 8), which is a great starting point for a standard Python coding style. Every Python programmer should read and try their best to follow PEP8 to make it easy for other Python programmers to read and understand your code. In addition, have a look at The Zen of Python (i.e. PEP20).

There are style guides for R and SQL as well. A decent style guide for R is explained in the R for Data Science book (see link below).

Workflows

Workflow managers are tools that help you write reliable and easy-to-use bioinformatics workflows. They make it easy to run several different programs after each other, or sometimes in parallel. This is an advanced bioinformatics topic that will be most useful after you have learnt the basics of programming (in either Python or R) and started using established bioinformatics tools to process your data.

General data analysis

General techniques

  • Ordination: PCA/PCoA, NMDS, t-SNE, OPLS-DA etc.
  • Classification: LDA, Decision trees, Random Forests, SVM, ANN, ROC curves, supervised/un-supervised, etc.
  • Clustering: Hierarchical, k-means, etc.
  • GUSTAME is a very useful field guide to multivariate statistics

Statistics

Python

R

Databases

Bioinformatics stuff

Overviewsi/course materials for 16s and Shotgun

General sequencing stuff

16S read processing and OTU picking

16S taxonomic annotation

General 16S tools

Shotgun metagenomics

Online resources for bioinformatics questions

About

Some ideas on things to learn to get started with bioinformatics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published