One of the challenges of facing cancer researchers is coming up with a clearly defined classification system. Cancer is a diverse disease with dozens of general groups. Each broad group, such as liver or breast cancer, is often further classified into sub-types. In breast cancer, there are the estrogen (ER), progesterone (PR), HER2+, and triple-negative breast cancer (TNBC) subtypes with TNBC the least understood.
TNBC is essentially a grouping of several other subtypes of breast cancer. It is a diverse sub-type, resistant to further classification. Several studies have been conducted to identify any biomarkers that could separate one group from another but lack an efficient method to do so.
A team from the University of Laval in Quebec proposed to use machine learning to interpret the data from other studies and identify any unique biomarkers in TNBC. Machine learning is a programming tool that can be taught to interpret and classify data with minimal oversight. In theory, using it to analyze the available data could identify overlooked biomarkers from a group of studies where no one study showed enough evidence for further study.
The team conducted their analysis by pulling data from eight hundred and seventy-seven patients in The Cancer Genome Atlas (TCGA). Twenty initial gene biomarkers with altered expression compared to healthy cells were identified. Out of these twenty, the top three with the highest statistical significance were chosen. The genes TBC1D9, MFGE, and SLC16A6, were used in the rest of the study for validation.
A quick analysis of these genes and patient outcome found that TBC1D9 overexpression was associated with a better TNBC prognosis and with a better progression-free (PF) survival rate. MFGE8 overexpression, on the other hand, was associated with a poor prognosis. Both findings were validated in separate datasets and proved the method was effective at determining biomarker candidates for TNBC. SLC16A6 lacked enough support and was dropped form further testing.
While it did not proceed past data analysis, follow up research was done on both candidates. Examining possible binding partners, TBC1D9 was predicted to play a role in overall cellular stability. MFGE8, on the other hand, was linked to proteins with roles in pro-tumor activities such as tumor immunity and metastasis.
With a treasure trove of data produced by scientists every year, data analysis is beginning to play a critical role in every field. For breast cancer, TNBC is the most diverse sub-type and has thus far evaded easy classification. This study managed to identify several promising candidates through machine learning and further elucidated how two of the candidates might be involved in TNBC.
The team concluded, “The approach described in this study combines multiple disciplines linking clinical information, -omics data, machine learning algorithms and bioinformatics tools, and has proved to be useful and adequate to provide candidate genes that deserve to be pursued further.”
Sources: Nature Scientific Reports, Real Engineering