Metagenome means “multiple genomes” and the study of culture independent genomic content in environment is called metagenomics. Because of the advent of powerful and economic next generation sequencing technology, sequencing has become cheaper and faster and thus the study of genes and phenotypes is transitioning from single organism to that of a community present in the natural environmental sample. Once sequence data are obtained from an environmental sample, the challenge is to process, assemble and bin the metagenome data in order to get as accurate and complete a representation of the populations present in the community or to get high confident draft assembly. In this paper we describe the existing bioinformatics workflow to process the metagenomic data. Next, we examine one way of parallelizing the sequence similarity program on a High Performance Computing (HPC) cluster since sequence similarity is the most common and frequently used technique throughout the metagenome data processing and analyzing steps. In order to address the challenges involved in analyzing the result file obtained from sequence similarity program, we developed a web application tool called Contig Analysis Tool (CAT). Later, we applied the tools and techniques to the real world virome metagenomic data i.e., to the genomes of all the viruses present in the environmental sample obtained from microbial mats derived from hot springs in Yellowstone National Park. There are several challenges associated with the assembly and binning of virome data particularly because of the following reasons: 1. Not many viral sequence data in the existing databases for sequence similarity. 2. No reference genome 3. No phylogenetic marker genes like the ones present in the bacteria and archaea. We will see how we overcame these problems by performing sequence similarity using CRISPR data and sequence composition using tetranucleotide analysis.
Gosrani, Sheetal, "Metagenome – Processing and Analysis" (2012). Master's Projects. 222.