The Human Genome Project was declared complete in 2002. But it wasn't exactly finished. Most of the sequence, about 92 percent, had been totally deciphered, particularly the sections that contain protein-coding genes. But the genome also holds long stretches of repetitive sequences that can be very difficult to unravel using traditional or advanced DNA sequencing techniques. Now the gaps in the sequence have finally been filled in. The work has been reported in Science.
Those repetitive sequences were once dismissed as "junk DNA," but researchers have been finding more sections of that junk that have important biological functions. Since they do not code for protein, studying them can be extremely challenging. But not only are they thought to be connected to some diseases, they may be essential to certain biological functions, making them important to understand.
The effort to map the elusive portions of the genome was named the Telomere-to-Telomere (T2T) Consortium, because the caps that sit on the ends of chromosomes and protect them are called telomeres. Like the dense middles of chromosomes, called centromeres, telomeres are also full of repetitive sequences that are hard to sequence. Those centromeres are also a critical part of DNA replication and cell division.
In the early days of sequencing, specific sections of the genome could be amplified; a selected sequence was targeted with small molecules called primers, which match short sections on the ends of those specific sequences. Once amplified into many copies by an enzyme, each base of that specific sequence can then be tagged with a fluorescent molecule, then the sequence of fluorescent colors is read as bases of DNA by a machine.
More advanced sequencing methods took a different approach. In next-gen sequencing, portions of the genome are chopped into tiny parts that are then sequenced and finally assembled together like puzzle pieces to create a long sequence. Repetition in the genome is difficult for both methods to deal with, and a third-generation sequencing technique was engineered. In third-generation or nanopore sequencing, much longer reads are possible. A single molecule of DNA is passed through a nanopore, and every base is read electronically.
Merfin is another tool that researchers created for this work. Merfin can correct mistakes made in the sequencing process, automatically detecting and correcting those errors.
"Stretches of identical base pairs, such as AAA," can be difficult for current technologies to read, explained postdoctoral researcher Giulio Formenti, PhD, who developed Merfin. "There are often errors in those sequences, even now. Merfin corrects them."
The researchers are hoping that the techniques used to finish the human genome sequence, which were presented in a Nature Methods paper, will help scientists understand diseases that are associated with structural repeats in the centromere. "We are finally digging into what we once called junk DNA, because we could not understand it or look at it accurately," Formenti said. "Now that these sequences are no longer missing from the human reference genome, we can begin to map the origins of these diseases."
Cancer has been linked to centromere defects, for example. When some heterochromatic centromere genes are overactive, cancer cells divide wildly. Now that we have the sequence of the complete human genome, scientists can learn more about these mysterious regions.
Sources: Rockefeller University, Nature Methods, Science