Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study

<h3>Background</h3><p dir="ltr">As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of th...

Full description

Saved in:
Bibliographic Details
Main Author: Shahan Ali Memon (18812038) (author)
Other Authors: Saquib Razak (18812041) (author), Ingmar Weber (149886) (author)
Published: 2020
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1864513512752021504
author Shahan Ali Memon (18812038)
author2 Saquib Razak (18812041)
Ingmar Weber (149886)
author2_role author
author
author_facet Shahan Ali Memon (18812038)
Saquib Razak (18812041)
Ingmar Weber (149886)
author_role author
dc.creator.none.fl_str_mv Shahan Ali Memon (18812038)
Saquib Razak (18812041)
Ingmar Weber (149886)
dc.date.none.fl_str_mv 2020-01-27T09:00:00Z
dc.identifier.none.fl_str_mv 10.2196/13347
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/Lifestyle_Disease_Surveillance_Using_Population_Search_Behavior_Feasibility_Study/26022313
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Health sciences
Epidemiology
Health services and systems
Public health
Information and computing sciences
Data management and data science
Human-centred computing
noncommunicable diseases
lifestyle disease surveillance
infodemiology
infoveillance
Google Trends
Web search
nowcasting
public health
digital epidemiology
dc.title.none.fl_str_mv Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <h3>Background</h3><p dir="ltr">As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines.</p><h3>Objective</h3><p dir="ltr">The aims of this study were to (1) employ a corrective approach improving previous methods; (2) study the key limitations in using Google Trends for lifestyle disease surveillance; and (3) test the generalizability of our methodology to other countries beyond the United States.</p><h3>Methods</h3><p dir="ltr">For each of the target variables (diabetes, obesity, and exercise), prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends were collected. These data were denormalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from denormalized Google Trends indices. Models were tested on a held-out set and compared against baselines from the literature as well as a trivial last year equals this year baseline. A similar analysis was done using a multivariate spatio-temporal model where the previous year’s prevalence was included as a covariate. This model was modified to create a time-lagged regression analysis framework. Finally, a hierarchical time-lagged multivariate spatio-temporal model was created to account for subnational trends in the data. The model trained on US data was, then, applied in a transfer learning framework to Canada.</p><h3>Results</h3><p dir="ltr">In the US context, our proposed models beat the performances of the prior work, as well as the trivial baselines. In terms of the mean absolute error (MAE), the best of our proposed models yields 24% improvement (0.72-0.55; P<.001) for diabetes; 18% improvement (1.20-0.99; P=.001) for obesity, and 34% improvement (2.89-1.95; P<.001) for exercise. Our proposed across-country transfer learning framework also shows promising results with an average Spearman and Pearson correlation of 0.70 for diabetes and 0.90 and 0.91 for obesity, respectively.</p><h3>Conclusions</h3><p dir="ltr">Although our proposed models beat the baselines, we find the modeling of lifestyle diseases to be a challenging problem, one that requires an abundance of data as well as creative modeling strategies. In doing so, this study shows a low-to-moderate validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. We envision qualitative analyses to be a more practical use of Google Trends in the context of lifestyle disease surveillance. For the quantitative analyses, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models.</p><h2>Other Information</h2><p dir="ltr">Published in: Journal of Medical Internet Research<br>License: <a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.2196/13347" target="_blank">https://dx.doi.org/10.2196/13347</a></p>
eu_rights_str_mv openAccess
id Manara2_84a67037eb813e85ddd71a726d208ad2
identifier_str_mv 10.2196/13347
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/26022313
publishDate 2020
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility StudyShahan Ali Memon (18812038)Saquib Razak (18812041)Ingmar Weber (149886)Health sciencesEpidemiologyHealth services and systemsPublic healthInformation and computing sciencesData management and data scienceHuman-centred computingnoncommunicable diseaseslifestyle disease surveillanceinfodemiologyinfoveillanceGoogle TrendsWeb searchnowcastingpublic healthdigital epidemiology<h3>Background</h3><p dir="ltr">As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines.</p><h3>Objective</h3><p dir="ltr">The aims of this study were to (1) employ a corrective approach improving previous methods; (2) study the key limitations in using Google Trends for lifestyle disease surveillance; and (3) test the generalizability of our methodology to other countries beyond the United States.</p><h3>Methods</h3><p dir="ltr">For each of the target variables (diabetes, obesity, and exercise), prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends were collected. These data were denormalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from denormalized Google Trends indices. Models were tested on a held-out set and compared against baselines from the literature as well as a trivial last year equals this year baseline. A similar analysis was done using a multivariate spatio-temporal model where the previous year’s prevalence was included as a covariate. This model was modified to create a time-lagged regression analysis framework. Finally, a hierarchical time-lagged multivariate spatio-temporal model was created to account for subnational trends in the data. The model trained on US data was, then, applied in a transfer learning framework to Canada.</p><h3>Results</h3><p dir="ltr">In the US context, our proposed models beat the performances of the prior work, as well as the trivial baselines. In terms of the mean absolute error (MAE), the best of our proposed models yields 24% improvement (0.72-0.55; P<.001) for diabetes; 18% improvement (1.20-0.99; P=.001) for obesity, and 34% improvement (2.89-1.95; P<.001) for exercise. Our proposed across-country transfer learning framework also shows promising results with an average Spearman and Pearson correlation of 0.70 for diabetes and 0.90 and 0.91 for obesity, respectively.</p><h3>Conclusions</h3><p dir="ltr">Although our proposed models beat the baselines, we find the modeling of lifestyle diseases to be a challenging problem, one that requires an abundance of data as well as creative modeling strategies. In doing so, this study shows a low-to-moderate validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. We envision qualitative analyses to be a more practical use of Google Trends in the context of lifestyle disease surveillance. For the quantitative analyses, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models.</p><h2>Other Information</h2><p dir="ltr">Published in: Journal of Medical Internet Research<br>License: <a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.2196/13347" target="_blank">https://dx.doi.org/10.2196/13347</a></p>2020-01-27T09:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.2196/13347https://figshare.com/articles/journal_contribution/Lifestyle_Disease_Surveillance_Using_Population_Search_Behavior_Feasibility_Study/26022313CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/260223132020-01-27T09:00:00Z
spellingShingle Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
Shahan Ali Memon (18812038)
Health sciences
Epidemiology
Health services and systems
Public health
Information and computing sciences
Data management and data science
Human-centred computing
noncommunicable diseases
lifestyle disease surveillance
infodemiology
infoveillance
Google Trends
Web search
nowcasting
public health
digital epidemiology
status_str publishedVersion
title Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
title_full Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
title_fullStr Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
title_full_unstemmed Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
title_short Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
title_sort Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
topic Health sciences
Epidemiology
Health services and systems
Public health
Information and computing sciences
Data management and data science
Human-centred computing
noncommunicable diseases
lifestyle disease surveillance
infodemiology
infoveillance
Google Trends
Web search
nowcasting
public health
digital epidemiology