TY - GEN
T1 - LSHvec
T2 - 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
AU - Shi, Lizhen
AU - Chen, Bo
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/1/18
Y1 - 2021/1/18
N2 - Drawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of distinct k-mers is far more than the number of different words in our vocabulary, making the model too huge to be stored in memory. Second, sequencing errors create lots of novel k-mers (noise), which significantly degrade model performance. In this work, we introduce LSHvec, a model that leverages Locality Sensitive Hashing (LSH) for k-mer encoding to overcome these challenges. After k-mers are LSH encoded, we adopt the skip-gram with negative sampling to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrate that k-mer encoding using LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than using alternative encoding methods. We validate that LSHvec is robust on reads with high sequencing error rates and works well with any sequencing technologies. In addition, the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and taxonomic classification. Finally, We demonstrate the unprecedented capability of LSHvec by participating in the second round of CAMI challenges and show that LSHvec is able to handle metagenome datasets that exceed Terabytes in size through distributed training across multiple nodes.
AB - Drawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of distinct k-mers is far more than the number of different words in our vocabulary, making the model too huge to be stored in memory. Second, sequencing errors create lots of novel k-mers (noise), which significantly degrade model performance. In this work, we introduce LSHvec, a model that leverages Locality Sensitive Hashing (LSH) for k-mer encoding to overcome these challenges. After k-mers are LSH encoded, we adopt the skip-gram with negative sampling to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrate that k-mer encoding using LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than using alternative encoding methods. We validate that LSHvec is robust on reads with high sequencing error rates and works well with any sequencing technologies. In addition, the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and taxonomic classification. Finally, We demonstrate the unprecedented capability of LSHvec by participating in the second round of CAMI challenges and show that LSHvec is able to handle metagenome datasets that exceed Terabytes in size through distributed training across multiple nodes.
KW - locality sensitive hashing
KW - metagenomic analysis
KW - neural network
KW - word embedding
UR - http://www.scopus.com/inward/record.url?scp=85112397298&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112397298&partnerID=8YFLogxK
U2 - 10.1145/3459930.3469521
DO - 10.1145/3459930.3469521
M3 - Conference contribution
AN - SCOPUS:85112397298
T3 - Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
BT - Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
PB - Association for Computing Machinery, Inc
Y2 - 1 August 2021 through 4 August 2021
ER -