There are billions of nucleotides in the human genome, and researchers once thought that there could be as many as 100,000 protein-coding genes encoded within the human genome. One of the main goals of the Human Genome Project was to identify protein-coding genes in the genomic sequence. When the vast majority of the sequence was completed around 2003, however, there seemed to be only about 20,000 protein-coding genes, which only occupy about two percent of the human genome. Since then, we've learned that there are other sequences with important functions that do not code for protein, like regulatory RNA sequences. Now researchers have suggested that there are 7,200 gene segments in the human genome that may potentially be used to generate new proteins. This work has been reported in Nature Biotechnology.
Many short sequences of DNA called open reading frames (ORFs) have been found in the genome. There has been evidence that some of these ORFs are transcribed, and many have biological functions, but few are included in reference databases, and they have remained relatively obscure.
Researchers are seeking to put these ORFs into those reference materials so that more researchers can find them if the sequences are relevant to their work. Scientists often compare sequences in their research to reference databases to learn more about those sequences or genes, such as whether they appear in other species, or whether they carry mutations.
ORFs that interact with parts of the ribosome, an organelle that generates proteins from mRNA, were first assembled into a standardized catalog, even though much of the data was obtained from different labs in various ways.
The study authors wanted to answer some fundamental questions as well, such as exactly what constitutes a gene or protein, and whether ribosomes only generate proteins, or if ribosomes can also make other types of molecules. Now, they have suggested that reference databases for the human genome should be revised. Ensembl-GENCODE is integrating the new ORF catalog, and others such as UniProt and HGNC are supporting the effort.
"It's tremendously exciting to enable the research community with our new catalog," said Dr. Sebastiaan van Heesch, a group leader at the Princess Máxima Center for pediatric oncology. "It's too soon to say whether all of the unexplored sections of DNA truly represent proteins, but we can clearly see that something unexplored is happening across the human genome and that the world should be paying attention."
"For too long, the scientific community has been mostly left in the dark about these ORFs," said Jonathan Mudge of the European Bioinformatics Institute (EMBL-EBI). "We're very proud that our work will be able to let researchers across the world start to study them. This is the point at which they enter the mainstream of genomic and medical science, an effort which we expect to have wide-ranging ripple effects."
Sources: Max Delbrück Center for Molecular Medicine, Nature Biotechnology