The classification of genes in international databases needs reconsidering. How can a gene and the function of its corresponding protein be aligned exactly without testing? Researchers at the CEA's François Jacob Institute of Biology have found a way.
The human genome contains approximately 25,000 genes. That is a lot, more than bacteria in any case (around 4,000 genes), but surprisingly less than mice (30,000)—and paramecia (40,000!). New sequencing technologies have enabled a leap in the inventory of genomes and the genes within them. All genes have a precise function carried out by way of the proteins resulting from their expression. But what genes are responsible for what functions? Testing the 88 million known genes recorded in the European database UniProt one by one to answer that question would be impossible.
"Scientist look at similarities between proteins to extrapolate from one protein function to another," explains Véronique de Berardinis, biologist at the François Jacob Institute of Biology. "But exactly how similar do two proteins need to be to say that they have the same function?" Often done automatically by computers, the determination of similarities is sometimes dubious due to a lack of experimental data, and thus a limited grasp of the complexity within the families of proteins, where a change in just a couple of amino acids can change the function of the entire protein. For example, some human proteins are annotated in databases in the same way that the equivalent protein in Escherichia coli is. Thus, "the predicted functions for a significant number of proteins are questionable," affirms Ms. de Berardinis. Inversely, different proteins can ultimately have the same activity, a phenomenon called convergence. "This is the case for two protein families, MetA and MetX, both involved in the fabrication of methionine, an essential amino acid for living organisms," underlines the biologist. "Both families were identified nearly 40 years ago, and both ensure a specific step in this metabolic pathway, but in different manners. To complete our understanding of the complexity of these two families, we selected around 100 representative proteins and tested their activities." What they found was that, contrary to what was thought to be known, many of the proteins in the two families had the same function. Thus, in this case, aligning functions and protein sequences was not sufficient. What finally gave a solution was the minutiose study of the three-dimensional structure of these enzymes and particularly their active sites (where the chemical reaction takes place).
"Our studies showed that functions are dependent on the topology of the active site," explains Ms. de Berardinis. Their results have worldwide repercussions because UniProt will now be based on the functional annotation rules proposed by the CEA's researchers. "The 10,000 annotations for the MetA and MetX proteins in UniProt have been updated and all new genomes will be correctly annotated for these two essential families," adds Ms. de Berardinis. Furthermore, the team's work showed that 10% of the MetX proteins are, in reality, involved in the biosynthesis of cysteine (another essential amino acid) by way of a previously unidentified molecule, O-succinyl-L-serine.
And not stopping there, they also provided another novel discovery, showing that the MetA and MetX families underwent evolutionary pressure, two times in the past, to converge toward a similar function.