The information technology and bio-informatics laboratories of the Institut de Génomique (IG) intervene in the generation of data and their primary and secondary processing. In those initial stages, the data generated by sequencers are analyzed in order to estimate their quality, filter them, interpret them and annotate them. The calculated information is then organized and distributed in comprehensible form to the biology research teams.
System infrastructure, computation, storage and network
The IT infrastructure of Genoscope and the CNRGH is centered on the data. The storage capacity based on the servers of files interfaced with the network is of the order of 1 PB (petabyte). The computation resources directly connected with the data servers are mainly bioprocessor servers, x86_64 (about 500 cores, typically 8 GB/core).
Production management
The monitoring of sample and sequencing operation management is ensured by a laboratory integrated management system (LIMS) developed in-house or with contract organizations. The tools enable daily monitoring of operations, tracking all the processes from sample receipt through DNA extraction to sequencing and computerized analyses while also enabling centralization of the metrics enabling quality control of the data generated.
Data production quality control
The 'raw' data generated by sequencing are managed by a set of IT procedures which compute a set of 'quality metrics' intended to verify that the sequencing operations have been correctly implemented in compliance with the specifications. The results of the various calculations conducted (e.g. calculation of the 'sequence coverage rate', duplication rate, contamination, etc.) are then reviewed and validated by the quality team prior to data provision to the scientific teams.
Bioinformatics pipelines
Data interpretation is implemented by a suite of software, the bioinformatics pipelines. The scientific software is developed by our teams and by a broad scientific community enabling provision of numerous tools to the biologist teams. At the CNG, we support pipelines in various fields, of which:
- Detection and annotation of variants: the varscope pipeline enables the detection and automatic annotation of the polymorphisms present in a genome from the reads obtained by Illumina sequencing. The data may be obtained by targeted sequencing, exome sequencing or whole genome sequencing. The pipeline enables detection of polymorphisms of various types: point mutations, single-nucleotide polymorphisms (SNPs), small insertions and deletions, copy number variations (CNVs), etc.
- Analysis of tumor genomes: a specific extension of the pipeline enables inventorying the mutations appearing in the genome of tumor cells (somatic mutations) by comparison of the genome of healthy tissues (germinal genome).
- RNAseq analysis: the pipeline analyzes RNAseq data (data obtained by sequencing messenger RNA) and enables estimation of the rate of expression of genes in the samples, genetic mutation mapping and splicing events modifying or potentially impairing the function associated with a gene.
- Epigenetics: this pipeline enables mapping the data generated by bisulfite sequencing and calculating the methylated positions in the genome. The pipeline also enables researchers to analyze the methylation rates of their samples.