Uses in bioinformatics
Managing sequence data on the command line
This section is aimed at those wanting to use the command line, and perhaps computing clusters, to work with sequence information (that is, genetic, genomic, proteomic and other bioinformatic data). While we recommend using a dedicated programming language for more complex work (this is easier to share, test and replicate), the command line remains a useful environment to acquire and manage data, especially as the first part of an analysis pipeline.
Sequence archives and repositories
All publicly-funded research generating sequence information is obliged to share their data, and this is typically done by placing it with one of the major sequence repositories, for example GenBank at the National Centre for Biotechnology Information (NCBI), or Ensembl at the European Bioinformatics Institute (EBI). Given the scale of these repository, and the varied working practices of many labs across the world, you will sometimes come across poorly-annotated, mislabeled or otherwise inaccurate data. However, just keep this in mind - the vast majority of deposited data is accurate and well-catalogued!
Exercises
Let’s work with some real data to answer some real research questions. Firstly, we are going to acquire a small genome to work with. Go to NCBI genbank, and in the search dropdown menu, select Genome
. While you are here, notice the other databases that NCBI hosts (each of the items in the dropdown is a database). Many of these are quite niche, but common ones to use are Genome
, Gene
, Protein
and SRA
, the latter for raw, unassembled sequence reads.
With Genome
as your search database, enter “SARS-CoV-2”, and run the search. You should get one result (if you get more, choose the top result). Note the information get here, and follow the link to the reference genome by clicking “Severe acute respiratory syndrome coronavirus 2”, near the top of the page. On this screen, get both the nucleotide and amino acid genome, by clicking the links next to “Download sequences in FASTA format”. Move these two files to a suitable folder to work in, and unzip them. Note their file extensions - faa
is “FASTA, amino acids”, fna
is “FASTA, nucleotides”, (these are just plain text files, like almost all files we work with on the command line). FASTQ files are the same as FASTA, except with read quality scores, and are typically untrimmed.
We’ll start with the nucleotide genome - view the file with less
to get a feel of what the data looks like.
Now let’s have a look at the proteome. Again, open it with less
and have a look at how it is structured and annotated, noting the differences compared to the nucleotide version of this genome.