TY - JOUR
T1 - Density-based binning of gene clusters to infer function or evolutionary history using GeneGrouper
AU - McFarland, Alexander G.
AU - Kennedy, Nolan W.
AU - Mills, Carolyn E.
AU - Tullman-Ercek, Danielle
AU - Huttenhower, Curtis
AU - Hartmann, Erica Marie
N1 - Funding Information:
This work was supported by the Searle Leadership Fund (E.M.H.), Biotechnology Training Program (A.G.M.); Army Research Office [W911NF-19-1-0298 to D.T.-E.]; National Science Foundation Graduate Research Fellowships Program [DGE-1842165 to N.W.K.]; and National Institute of Health, National Institute of Diabetes and Digestive and Kidney Diseases [R24DK110499 to C.H.].
Funding Information:
The authors extend our gratitude to all users that helped test GeneGrouper. This research was supported in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology.
Publisher Copyright:
© 2022 Oxford University Press. All rights reserved.
PY - 2022/2/1
Y1 - 2022/2/1
N2 - Motivation: Identifying variant forms of gene clusters of interest in phylogenetically proximate and distant taxa can help to infer their evolutionary histories and functions. Conserved gene clusters may differ by only a few genes, but these small differences can in turn induce substantial phenotypes, such as by the formation of pseudogenes or insertions interrupting regulation. Particularly as microbial genomes and metagenomic assemblies become increasingly abundant, unsupervised grouping of similar, but not necessarily identical, gene clusters into consistent bins can provide a population-level understanding of their gene content variation and functional homology. Results: We developed GeneGrouper, a command-line tool that uses a density-based clustering method to group gene clusters into bins. GeneGrouper demonstrated high recall and precision in benchmarks for the detection of the 23-gene Salmonella enterica LT2 Pdu gene cluster and four-gene Pseudomonas aeruginosa PAO1 Mex gene cluster among 435 genomes spanning mixed taxa. In a subsequent application investigating the diversity and impact of gene-complete and -incomplete LT2 Pdu gene clusters in 1130 S.enterica genomes, GeneGrouper identified a novel, frequently occurring pduN pseudogene. When investigated in vivo, introduction of the pduN pseudogene negatively impacted microcompartment formation. We next demonstrated the versatility of GeneGrouper by clustering distant homologous gene clusters and variable gene clusters found in integrative and conjugative elements.
AB - Motivation: Identifying variant forms of gene clusters of interest in phylogenetically proximate and distant taxa can help to infer their evolutionary histories and functions. Conserved gene clusters may differ by only a few genes, but these small differences can in turn induce substantial phenotypes, such as by the formation of pseudogenes or insertions interrupting regulation. Particularly as microbial genomes and metagenomic assemblies become increasingly abundant, unsupervised grouping of similar, but not necessarily identical, gene clusters into consistent bins can provide a population-level understanding of their gene content variation and functional homology. Results: We developed GeneGrouper, a command-line tool that uses a density-based clustering method to group gene clusters into bins. GeneGrouper demonstrated high recall and precision in benchmarks for the detection of the 23-gene Salmonella enterica LT2 Pdu gene cluster and four-gene Pseudomonas aeruginosa PAO1 Mex gene cluster among 435 genomes spanning mixed taxa. In a subsequent application investigating the diversity and impact of gene-complete and -incomplete LT2 Pdu gene clusters in 1130 S.enterica genomes, GeneGrouper identified a novel, frequently occurring pduN pseudogene. When investigated in vivo, introduction of the pduN pseudogene negatively impacted microcompartment formation. We next demonstrated the versatility of GeneGrouper by clustering distant homologous gene clusters and variable gene clusters found in integrative and conjugative elements.
UR - http://www.scopus.com/inward/record.url?scp=85130443526&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85130443526&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btab752
DO - 10.1093/bioinformatics/btab752
M3 - Article
C2 - 34734968
AN - SCOPUS:85130443526
VL - 38
SP - 612
EP - 620
JO - Bioinformatics
JF - Bioinformatics
SN - 1367-4803
IS - 3
ER -