Efficient estimation of children’s language exposure in two bilingual communities

Margaret Cychosz*, Anele Villanueva, Adriana Weisleder

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

10 Scopus citations


Purpose: The language that children hear early in life is associated with their speech-language outcomes. This line of research relies on naturalistic observations of children’s language input, often captured with daylong audio recordings. However, the large quantity of data that daylong recordings generate requires novel analytical tools to feasibly parse thousands of hours of naturalistic speech. This study outlines a new approach to efficiently process and sample from daylong audio recordings made in two bilingual communities, Spanish–English in the United States and Quechua–Spanish in Bolivia, to derive estimates of children’s language exposure. Method: We employed a general sampling with replacement technique to efficiently estimate two key elements of children’s early language environments: (a) proportion of child-directed speech (CDS) and (b) dual language exposure. Proportions estimated from random sampling of 30-s segments were compared to those from annotations over the entire daylong recording (every other segment), as well as parental report of dual language exposure. Results: Results showed that approximately 49 min from each recording or just 7% of the overall recording was required to reach a stable proportion of CDS and bilingual exposure. In both speech communities, strong correlations were found between bilingual language estimates made using random sampling and all-day annotation techniques. A strong association was additionally found for CDS estimates in the United States, but this was weaker at the Bolivian site, where CDS was less frequent. Dual language estimates from the audio recordings did not correspond well to estimates derived from parental report collected months apart. Conclusions: Daylong recordings offer tremendous insight into children’s daily language experiences, but they will not become widely used in developmental research until data processing and annotation time substantially decrease. We show that annotation based on random sampling is a promising approach to efficiently estimate ambient characteristics from daylong recordings that cannot currently be estimated via automated methods.

Original languageEnglish (US)
Pages (from-to)3843-3866
Number of pages24
JournalJournal of Speech, Language, and Hearing Research
Issue number10
StatePublished - Oct 2021

ASJC Scopus subject areas

  • Speech and Hearing
  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'Efficient estimation of children’s language exposure in two bilingual communities'. Together they form a unique fingerprint.

Cite this