Gene sequences are made up of nucleotide bases, which are 'read' by the cell's machinery in triplets; three nucleotide bases are interpreted as one codon, and every codon has a corresponding amino acid, except for the codons that direct the cell to start or stop transcribing a gene. There is a lot of redundancy and repetition in the genome. Some regions of the genome are long repeats as well. Most proteins also contain repeats; around 70 percent of human proteins carry a sequence in which one amino acid is repeated over and over again, with a few other amino acids interspersed. These features are called low complexity regions (LCRs), and many organisms have them. They have been shown to be involved in various cellular processes including DNA binding and cell adhesion. Researchers have now developed a technique to identify and assess LCRs to learn more about their roles.
This method was used to study all of the proteins carried by eight species, including humans and a bacterium. The researchers determined that LCRs can vary from one species to another, but they often have a similar function - helping their protein move into a larger complex. There were differences between different species' LCRs, which seemed to correspond to functions that were specific to that species, like generating cell walls in plants.
In this work, the researchers generated a dotplot matrix, which allowed them to represent the amino acid sequences visually. Every protein studied was assigned a matrix, and with computational tools, thousands were compared at once. The LCRs could then be categorized by the amino acid that was most prevalent. Proteins were also categorized by the type and number of LCRs they had.
For example, the human protein RPA43 contains three LCRs rich in lysine. RPA43 itself is a subunit of a crucial enzyme called RNA polymerase 1, which synthesizes ribosomal RNA. The lysine LCRs appear to help the protein integrate into the organelle that synthesizes ribosomes, the nucleolus.
Some LCRs were found to be conserved among species, meaning they don't change much even when species carrying them are very far apart on the evolutionary tree. Some organelles that are also highly conserved carry many conserved LCRs, including the nucleolus and the nuclear speckle. LCRs that play a role in general functions like assembling the extracellular matrix were also very similar.
"These sequences seem to be important for the assembly of certain parts of the nucleolus," noted co-lead study author Byron Lee, an MIT graduate student. "Some of the principles that are known to be important for higher order assembly seem to be at play because the copy number, which might control how many interactions a protein can make, is important for the protein to integrate into that compartment."
There were differences too. Plants tend to have unique LCRs in the proteins that make up scaffolding in their cell walls, and these are not found in other organisms.
"There's so much to explore, because we can expand this map to essentially any species. That gives us the opportunity and the framework to identify new biological assemblies," Lee noted.
A freely available tool called BLAST can be used to study and compare protein sequences. The video above outlines how to use it.
Sources: Massachusetts Institute of Technology, eLife