LSHvec: A vector representation of DNA sequences using locality sensitive hashing and FastText word embeddings

Lizhen Shi, Bo Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Drawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of distinct k-mers is far more than the number of different words in our vocabulary, making the model too huge to be stored in memory. Second, sequencing errors create lots of novel k-mers (noise), which significantly degrade model performance. In this work, we introduce LSHvec, a model that leverages Locality Sensitive Hashing (LSH) for k-mer encoding to overcome these challenges. After k-mers are LSH encoded, we adopt the skip-gram with negative sampling to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrate that k-mer encoding using LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than using alternative encoding methods. We validate that LSHvec is robust on reads with high sequencing error rates and works well with any sequencing technologies. In addition, the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and taxonomic classification. Finally, We demonstrate the unprecedented capability of LSHvec by participating in the second round of CAMI challenges and show that LSHvec is able to handle metagenome datasets that exceed Terabytes in size through distributed training across multiple nodes.

Original languageEnglish (US)
Title of host publicationProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450384506
DOIs
StatePublished - Jan 18 2021
Event12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States
Duration: Aug 1 2021Aug 4 2021

Publication series

NameProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

Conference

Conference12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
Country/TerritoryUnited States
CityVirtual, Online
Period8/1/218/4/21

Keywords

  • locality sensitive hashing
  • metagenomic analysis
  • neural network
  • word embedding

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'LSHvec: A vector representation of DNA sequences using locality sensitive hashing and FastText word embeddings'. Together they form a unique fingerprint.

Cite this