The trend of the demand in terms of processing capability (1,000 – 10,000 – 100,000 cores) and storage capabilities (10 – 100 – 1,000 PB) imposes use of technologies different from those of the departmental systems, even those of significant size:
- the file servers and NFS protocol are to give way to distributed file systems such as Lustre or GPFS,
- the interconnection of processing nodes makes use of short lag time technologies such as Infiniband,
- the power supply and peripherals must be appropriate,
- the floor area must enable installation of the computers and their subsequent upgrades.
For all the above reasons, we established a partnership with the CEA TGCC Very Large Computation Center at Bruyères-le-Chatel. The TGCC teams master the technologies, thus providing us with access to configurations whose size is several orders of magnitude greater than a departmental system.
The CEA TGCC is an infrastructure dedicated to high-performance computation capable of hosting supercomputers on a petaflop scale and designed on the basis of an architecture oriented toward the data. At the TGCC, the CCRT is to deploy an extension which is to be dedicated to users of the France Génomique project.
The data storage and processing e-infrastructure set up by the CEA/DIF teams will enable the users of France Génomique to benefit from a medium-term storage space (scale: scientific projects lasting several years) of several petabytes connected to several thousand scalar computation cores by a high-performance interconnection.
Mutually held with that of the CCRT, the e-infrastructure is also designed to be progressive with a view to meeting all the genomic challenges of the future.
Equipment and capacities
The France Génomique dedicated configuration is as follows:
- 180 bi-processor nodes (Intel Sandy Bridge E5-2680, 2.7 GHz, 8 cores) with 128 GB of memory per node, i.e. 2,880 cores (Bull),
- 2 very high capacity memory systems Bullx S6410 with 2 TB of memory,
- 9 hybrid blades fitted with Kepler GPU NVIDIA.
This is an extension of the Airain configuration of the CCRT installed at the TGCC. Data hosting will be implemented using the following storage configuration:
- Medium-term storage with an overall file system of 5 PB, of which 2 PB on disks (hierarchical storage system Lustre + IBM HPSS),
- System for archiving the initial data.
Principal implementations:
In order to characterize a set of 83 protein families with no known function and including some 60,000 sequences, the Genoscope researchers conducted a modeling campaign using the CCRT Titane supercomputer. The phase, which would have necessitated 280,000 hours of computation, was implemented on 4,000 processors in just 70 hours. In addition to the results, the researchers created a catalog of specific structural signatures for each of the families studied. The catalog will give the biochemists precious information for the discovery of new enzymatic activities.
Genoscope has already been using the computation resources of the TGCC/CCRT for several years, particularly via the DARI calls for projects. In that context, the TARA OCEANS project benefited from over 3.5 million hours of computation in order to study the diversity of marine organisms. In order to do so, various sequence analysis tools were run on the infrastructure: BLAST, BLAT, InterProScan & CDDsearch. Specific codes were designed and deployed in order to adapt the tools to the technical constraints of TGCC machine operation (massive parallelization of the data, implementation control, error rerun, short-unit jobs).
Accreditations / Quality system:
The CEA/DIF teams have developed internationally recognized expertise and skills in the fields of management of very large volumes of data (contribution to open source developments, EOFS steering, etc.) and in the definition and management of very large computation centers. User assistance and support teams are available to help users optimize their use of the center's resources.
A dedicated application support team has been deployed by the Institute of Genomics (CEA) on behalf of France Génomique