One of the ambitious goals of modern biology is to disentangle the links between genome and phenome in living organisms. In other words, how does the information coded into an organism's DNA relate to the functions and forms performed and expressed by that organism? There are many complexities involved in answering that question. One successful approach was to look simply at patterns of gene presence and absence in a balanced way across the tree of life and use supervised clustering to define sets of genes linking distant organisms to a common function (see here). We are looking to extend these models to metagenomic inference and to go beyond proteins to understand gene regulation across diversity.
The computational tool "PredictTrophicMode" was developed as one aspect of this project.
Heatmaps like this form the core of the predictive models. Each row is a cluster of proteins that act in a similar functional process. The colors represent a weighted score relating the number of proteins each organism (columns) has for that process. Predictions are based on machine learning inference of the patterns found in heatmaps like this.
Another way to visualize the multiple-dimensional data displayed in heatmaps is to reduce the dimensionality using principal components. When this is done, clusters of organisms emerge based on their capacity for a functional process.
We found that the genes that are predictive of phagocytosis in eukaryotes do not have a single evolutionary origin. They come from archaea, bacteria, and apparent eukaryote lineage innovation.
We linked cell eating (phagocytosis) in eukaryotes to its prokaryote roots by searching across diversity.