Abstract
Purpose: Smoking is known to be a modifiable risk behavior that causes various health problems that include cancer and respiratory disease. Moreover, the literature reveals that adolescent smoking behaviors are likely to persist through adulthood, and this is the case in countries worldwide. In South Korea, despite many effeorts to reduce smoking among Korean adolescents, this modifiable risk behavior remains a significant social problem. An effective intervention to target and modify the behavior of adolescents concerning smoking must understand and address the factors that underlie and influence the behavior of smoking. These factors canbe surfaced in data using an appropriate approach. Machine learning is an approach that is well suited to reveal patterns of infromation in large, complex datasets that are useful in predicting outcomes (Chekround, 2016). For example, machine learning has been used to predict readmission in in-patients (Mortazavi, 2016; Frizzell, 2016). However, this approach had not yet been applied to address an adolescents risk behavior, such as smoking. Therefore, the goal of this study was to identify the predictors of adolescents smoking behaviors in South Korea using a machine-learning approach.
Methods: The 2015 Korean Youth Risk Behviors Web-based Survey (KYRBS) was used as the data source of this study. The KYRBS is an annual, nationwide survey conducted in South Korea to examine health behaviors that include cigarette smoking, individual hygiene, and alcohol consumption. Data gatered in the 2015 KYRBS was collected via self-report questionnaires responded to by 68,043 students in grades 7 through 12 in randomly-selected 800 schools in South Korea. For this study, we used 5,123 surveys which completed items concerning smooking on the questionnaires. This study utilized the machine-learning pipeline developed by Fayyad (1996) and Yoon (2015). To reduce the "surse of dimensionality," in which a high number of inter-related variables in large dataset interfere with the accuracy of the machine-learning model, we selected clinically meaningful features based on the concpetual framework for adolescent risk behaviors (Jessor, 1991). Then, we applied three machine learning algorithms embedded in Weka (i.e., J48, Naïve Bayes, and Logistic Regression) to build a predictive model for the smoking behavior of the adolescents represented by the KYRBY dataset. The final model was selected based on the accuracy of not only the predictive model, but also the F-measure calculated using precision and recall rate.
Results: Through the feature selection process, we classified 40 features into three predictive categories. Among three machine algorithms we applied, we found that the Logistic Regression algorithm demonstrated the highest level of accuracy (i.e., 84.0% of adolescent smokers were correctly classified; F-measure = 0.795). Using this model, grade (-0.06) and alcohol consumption (-0.56) were the top two features with the highest coefficietns. In other words, middle school students and students who had never drank alcohol were highly associated with the behavior of smoking.
Conclusion: Our studey demonstrates that a machine-learning approach is effective in identifying behavioral predictors from a large, complex dataset— in this case, the behavioral predicators associated with smoking using the KYRBY. However, our study results were inconsistent with those reported in the literature. Previous study shooed that increasing grade and previous alcohol consumption were associated with adolescents' smoking behaviors (Mendol, 2013; Talip, 2015). Further study with association between smoking behaviors and alcohol consumption among Korean adolescent is needed. Although this study did have some limitations (e.g., the data from the KYRBY is cross-sectional), our machine-learning approach shows promise, and subsequent research using longitudinal data can take into account the trends of association implicit in creating a predictive model.
Sigma Membership
Gamma
Type
Poster
Format Type
Text-based Document
Study Design/Type
N/A
Research Approach
N/A
Keywords:
Adolescents, Cigarette Smoking, Machine Learning
Recommended Citation
Chung, Sophia and Li, Youngji, "An application of machine learning for the identification of adolescent smoking risk factors" (2017). INRC (Congress). 230.
https://www.sigmarepository.org/inrc/2017/posters_2017/230
Conference Name
28th International Nursing Research Congress
Conference Host
Sigma Theta Tau International
Conference Location
Dublin, Ireland
Conference Year
2017
Rights Holder
All rights reserved by the author(s) and/or publisher(s) listed in this item record unless relinquished in whole or part by a rights notation or a Creative Commons License present in this item record.
All permission requests should be directed accordingly and not to the Sigma Repository.
All submitting authors or publishers have affirmed that when using material in their work where they do not own copyright, they have obtained permission of the copyright holder prior to submission and the rights holder has been acknowledged as necessary.
Review Type
Abstract Review Only: Reviewed by Event Host
Acquisition
Proxy-submission
An application of machine learning for the identification of adolescent smoking risk factors
Dublin, Ireland
Purpose: Smoking is known to be a modifiable risk behavior that causes various health problems that include cancer and respiratory disease. Moreover, the literature reveals that adolescent smoking behaviors are likely to persist through adulthood, and this is the case in countries worldwide. In South Korea, despite many effeorts to reduce smoking among Korean adolescents, this modifiable risk behavior remains a significant social problem. An effective intervention to target and modify the behavior of adolescents concerning smoking must understand and address the factors that underlie and influence the behavior of smoking. These factors canbe surfaced in data using an appropriate approach. Machine learning is an approach that is well suited to reveal patterns of infromation in large, complex datasets that are useful in predicting outcomes (Chekround, 2016). For example, machine learning has been used to predict readmission in in-patients (Mortazavi, 2016; Frizzell, 2016). However, this approach had not yet been applied to address an adolescents risk behavior, such as smoking. Therefore, the goal of this study was to identify the predictors of adolescents smoking behaviors in South Korea using a machine-learning approach.
Methods: The 2015 Korean Youth Risk Behviors Web-based Survey (KYRBS) was used as the data source of this study. The KYRBS is an annual, nationwide survey conducted in South Korea to examine health behaviors that include cigarette smoking, individual hygiene, and alcohol consumption. Data gatered in the 2015 KYRBS was collected via self-report questionnaires responded to by 68,043 students in grades 7 through 12 in randomly-selected 800 schools in South Korea. For this study, we used 5,123 surveys which completed items concerning smooking on the questionnaires. This study utilized the machine-learning pipeline developed by Fayyad (1996) and Yoon (2015). To reduce the "surse of dimensionality," in which a high number of inter-related variables in large dataset interfere with the accuracy of the machine-learning model, we selected clinically meaningful features based on the concpetual framework for adolescent risk behaviors (Jessor, 1991). Then, we applied three machine learning algorithms embedded in Weka (i.e., J48, Naïve Bayes, and Logistic Regression) to build a predictive model for the smoking behavior of the adolescents represented by the KYRBY dataset. The final model was selected based on the accuracy of not only the predictive model, but also the F-measure calculated using precision and recall rate.
Results: Through the feature selection process, we classified 40 features into three predictive categories. Among three machine algorithms we applied, we found that the Logistic Regression algorithm demonstrated the highest level of accuracy (i.e., 84.0% of adolescent smokers were correctly classified; F-measure = 0.795). Using this model, grade (-0.06) and alcohol consumption (-0.56) were the top two features with the highest coefficietns. In other words, middle school students and students who had never drank alcohol were highly associated with the behavior of smoking.
Conclusion: Our studey demonstrates that a machine-learning approach is effective in identifying behavioral predictors from a large, complex dataset— in this case, the behavioral predicators associated with smoking using the KYRBY. However, our study results were inconsistent with those reported in the literature. Previous study shooed that increasing grade and previous alcohol consumption were associated with adolescents' smoking behaviors (Mendol, 2013; Talip, 2015). Further study with association between smoking behaviors and alcohol consumption among Korean adolescent is needed. Although this study did have some limitations (e.g., the data from the KYRBY is cross-sectional), our machine-learning approach shows promise, and subsequent research using longitudinal data can take into account the trends of association implicit in creating a predictive model.