Regression Tree Approach to Estimation of Health Insurance Premium

Başak Bulut Karageyik

Araştırma Makalesi

Regression Tree Approach to Estimation of Health Insurance Premium

Yıl 2023, Cilt: 16 Sayı: 2, 81 - 99, 31.12.2023

Başak Bulut Karageyik

Öz

This paper proposes an approach to predicting insurance premiums in health insurance by combining traditional generalized linear models (GLM) with advanced machine learning-driven regression tree analysis. The study first uses GLM on real complementary health insurance data to examine the importance of variables, focusing on those variables that have a large impact on premium estimates. Subsequently, it is investigated whether the variables identified as significant by GLM can also be identified as significant by regression tree analysis. In the application of machine learning, the effect of stratified sampling in accordance with the data structure in terms of the risk variables considered in premium forecasts is also analyzed. This study contributes to the actuarial understanding of premium estimation and provides insurers with a concrete framework to help them negotiate the complex world of health insurance data. By integrating the advantages of GLM and regression trees, this study provides a comprehensive comparison for insurers to adapt to changing risk factors. This study represents a innovative attempt to incorporate a regression tree methodology, providing a novel and accurate estimation of premium amounts in the realm of insurance analysis.

Anahtar Kelimeler

Actuarial premium estimation, Regression tree, Machine learning techniques, Generalized linear models

Kaynakça

[1] P. McCullagh, J. A. Nelder,1989, Generalized Linear Models 2nd ed.. London: Chapman and Hall.
[2] A. E. Renshaw ,1991, Actuarial graduation practice and generalized linear and non-linear models. J Inst. Act., 118, 295-312.
[3] A. E. Renshaw, P. Verrall, 1994, A Stochastic Model Underlying The Chain Ladder Technique. In Proceedings of the XXV ASTIN Colloquium, Cannes.
[4] S. Haberman, A. E. Renshaw, 1996, Generalized Linear Models and Actuarial Science. Journal of the Royal Statistical Society. Series D The Statistician, 454, 407–436. https://doi.org/10.2307/2988543
[5] A. J. Dobson, 2002, An Introduction to Generalized Linear Models Second Edition. London: Chapman and Hall/CRC.
[6] D. Andersen, S. Feldblum, C. Modlin, D. Schirmacher, E. Schirmacher, N. Thandi, 2005, A Practitioner’s Guide to Generalized Linear Models Second Edition. CAS Study Note.
[7] K. Antonio, J. Beirlant, 2007, Actuarial statistics with generalized linear mixed models. Insurance Mathematics & Economics, 40, pp. 58-76. https://doi.org/10.1016/J.INSMATHECO.2006.02.013.
[8] P. De Jong, G. Heller, 2008, Generalized Linear Models for Insurance Data International Series on Actuarial Science. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511755408
[9] M. V. Wüthrich, M. Merz, 2008, Stochastic claims reserving methods in insurance. John Wiley & Sons.
[10] E. Ohlsson, B. Johansson, 2010, Non-Life Insurance Pricing with Generalized Linear Models. Springer.
[11] E. W. Frees, 2015, Analytics of insurance markets. Annual Review of Financial Economics, 7, 253–77
[12] Z. Quan, Insurance Analytics with Tree-Based Models. PhD thesis, University of Connecticut, 2019.
[13] W. Gardner, C. Lidz, E. Mulvey, E. C. Shaw, 1996, A comparison of actuarial methods for identifying repetitively violent patients with mental illnesses. Law and Human Behavior, 20, 35-48.
[14] H. Steadman, E. Silver, J. Monahan, P. Appelbaum, P. Robbins, E. Mulvey, T. Grisso, L. Roth, S. Banks, 2000, A Classification Tree Approach to the Development of Actuarial Violence Risk Assessment Tools. Law and Human Behavior, 24, 83-100. https://doi.org/10.1023/A:1005478820425.
[15] L. Guelman, 2012, Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 393, 3659–67.
[16] J. William, M. Martin, C. Chojenta, D. Loxton, 2018, An actuarial investigation into maternal hospital cost risk factors for public patients. Annals of Actuarial Science, 12, 106 - 129. https://doi.org/10.1017/S174849951700015X.
[17] M. V. Wuthrich, C. Buser, 2023, Data Analytics for Non-Life Insurance Pricing. Swiss Finance Institute Research Paper No. 16-68. Available at SSRN: https://ssrn.com/abstract=2870308 or http://dx.doi.org/10.2139/ssrn.2870308
[18] L. Diao, C. Weng, 2019, Regression Tree Credibility Model. North American Actuarial Journal, 232, 169-196. DOI: 10.1080/10920277.2018.1554497
[19] J. Baillargeon, L. Lamontagne, É. Marceau, 2020, Mining Actuarial Risk Predictors in Accident Descriptions Using Recurrent Neural Networks. Risks. https://doi.org/10.3390/risks9010007.
[20] S. Tober, 2020, Tree-based Machine Learning Models with Applications in Insurance Frequency Modelling Dissertation. Retrieved from https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-276233
[21] R. Henckaerts, M.-P. Côté, K. Antonio, R. Verbelen, 2021, Boosting Insights in Insurance Tariff Plans with Tree-Based Machine Learning Methods. North American Actuarial Journal, 252, 255-285. DOI: 10.1080/10920277.2020.1745656
[22] B. Rokicki, K. Ostaszewski, 2022, Actuarial Credibility Approach in Adjusting Initial Cost Estimates of Transport Infrastructure Projects. Sustainability. https://doi.org/10.3390/su142013371.
[23] R. Richman, 2021a, AI in actuarial science—a review of recent advances—part 1. Ann. Actuar. Sci., 152, 207–29
[24] R. Richman, 2021b, AI in actuarial science—a review of recent advances—part 2. Ann. Actuar. Sci., 152, 230–58
[25] B. Wong, J. Christopher, H. Cossette, L. Lamontagne, E. Marceau, 2021, Machine Learning in P&C Insurance: A Review for Pricing and Reserving. Risks, 91, 4. https://doi.org/10.3390/risks9010004
[26] Z. Quan, 2019, Insurance Analytics with Tree-Based Models Doctoral Dissertations No. 2374. Retrieved from https://digitalcommons.lib.uconn.edu/dissertations/2374
[27] J. Neyman, 1934, On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558-625
[28] J. Neyman, E. S. Pearson, 1933, On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231, 289–337. http://www.jstor.org/stable/91247
[29] R. Singh, N. S. Mangat, 1996, Stratified Sampling. In: Elements of Survey Sampling, Vol. 15. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-1404-4_5
[30] V. L. Parsons, 2014, Stratified sampling. Wiley StatsRef: Statistics Reference Online, 1-11.
[31] E. Liberty, K. Lang, K. Shmakov, 2016, June. Stratified sampling meets machine learning. In International conference on machine learning pp. 2320-2329. PMLR.
[32] Y. Ye, Q. Wu, J. Z. Huang, M. K. Ng, X. Li, 2013, Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition, 463, 769-787.
[33] T. Yu, X. Zhai, S. Sra, 2019, Near Optimal Stratified Sampling. ArXiv, abs/1906.11289.
[34] Y. Lu, Y. Park, L. Chen, Y. Wang, C. De Sa, D. Foster, 2021, July. Variance reduced training with stratified sampling for forecasting models. In International Conference on Machine Learning pp. 7145-7155. PMLR.
[35] J. Fox, 2008, Applied Regression Analysis and Generalized Linear Models, 2nd Edn. Thousand Oaks, CA: Sage.
[36] J.F. Magee, 1964, Decision trees for decision making, Harvard Business Review, pp. 126-138.
[37] S.K. Murthy, 1998, Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining Knowl Discovery 2(4):345–389
[38] R.L. Keeney, 1982, Decision Analysis: An Overview. Operations Research, 30(5).
[39] L.Tjen-Sien, L. Wei-Yin, S.Yu-Shan, 2000, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40:203–228
[40] S.B. Kotsiantis, 2013, Decision trees: a recent overview. Artif Intell Rev 39, 261–283
[41 ] L. Breiman, J. Friedman, R. Olshen, C. J. Stone, 1984, Classification and regression Trees. Wadsworth, Belmont, CA.
[42] J.R. Quinlan, 1986, Induction of decision trees. Mach Learn 1, 81–106.
[43] J.R. Quinlan, 1993, C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
[44] G.V. Kass, 1980, "An Exploratory Technique for Investigating Large Quantities of Categorical Data". Applied Statistics. 29 (2): 119–127
[45] J. Gehrke, R. Ramakrishnan, V. Ganti, 2000, RainForest: a framework for fast decision tree construction of large datasets. Data Mining Knowl Discovery 4(2–3):127–162
[46] H. A. Chipman, E. I. George, R. E. McCulloch, 1998, Bayesian CART model search. Journal of the American Statistical Association, 93443, 935-960 pp.
[47] J. Morgan, 2014, Classification and regression tree analysis. Boston: Boston University, 298.
[48] D. L. Verbyla, 1987, Classification trees: a new discrimination tool. Canadian Journal of Forest Research, 17, 9, 1150–1152.
[49] L. A. Clark, D. Pregibon, 1992, Tree-based models. In: Statistical models Eds. Chambers JM, Hastie TJ. Pacific Grove, CA: Wadsworth, p 377–419.
[50] G. De’ath, K. E. Fabricius, 2000, Classification and Regression Trees: A Powerful yet Simple Technique for Ecological Data Analysis. Ecology, 81, 3178-3192
[51] R Core Team , 2021, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. RStudio 2023.09.1

Sağlık Sigortası Primi Tahmininde Regresyon Ağacı Yaklaşımı

Yıl 2023, Cilt: 16 Sayı: 2, 81 - 99, 31.12.2023

Başak Bulut Karageyik

Öz

Bu çalışma, geleneksel genelleştirilmiş doğrusal modelleri (GLM) gelişmiş makine öğrenimi odaklı regresyon ağacı analizi ile birleştirerek sağlık sigortasında sigorta primlerini tahmin etmeye yönelik bir yaklaşım önermektedir. Çalışmada ilk olarak değişkenlerin önemini incelemek için gerçek tamamlayıcı sağlık sigortası verileri üzerine GLM uygulanmakta ve prim tahminleri üzerinde büyük etkisi olan değişkenlere odaklanılmaktadır. Daha sonra, GLM tarafından önemli olarak tanımlanan değişkenlerin regresyon ağacı analizi ile de önemli olarak tanımlanıp tanımlanamayacağı araştırılmaktadır. Makine öğrenmesi uygulamasında, prim tahminlerinde dikkate alınan risk değişkenleri açısından veri yapısına uygun olarak tabakalı örneklemenin etkisi de analiz edilmektedir. Bu çalışma, prim tahminine ilişkin aktüeryal anlayışa katkıda bulunmakta ve sigortacılara sağlık sigortası verilerinin karmaşık dünyasında müzakere etmelerine yardımcı olacak somut bir çerçeve sunmaktadır. GLM ve regresyon ağaçlarının avantajlarını bir araya getiren bu çalışma, sigortacıların değişen risk faktörlerine uyum sağlamaları için kapsamlı bir karşılaştırma sunmakta ve sigorta analizi alanında prim tutarlarının yeni ve doğru bir şekilde tahmin edilmesini sağlayan bir regresyon ağacı metodolojisini içeren yenilikçi bir çalışmayı temsil etmektedir.

Anahtar Kelimeler

Aktüeryal prim tahmini, Regresyon ağacı, Makine öğrenme teknikleri, Genelleştirilmiş doğrusal modeller

Kaynakça

[1] P. McCullagh, J. A. Nelder,1989, Generalized Linear Models 2nd ed.. London: Chapman and Hall.
[2] A. E. Renshaw ,1991, Actuarial graduation practice and generalized linear and non-linear models. J Inst. Act., 118, 295-312.
[3] A. E. Renshaw, P. Verrall, 1994, A Stochastic Model Underlying The Chain Ladder Technique. In Proceedings of the XXV ASTIN Colloquium, Cannes.
[4] S. Haberman, A. E. Renshaw, 1996, Generalized Linear Models and Actuarial Science. Journal of the Royal Statistical Society. Series D The Statistician, 454, 407–436. https://doi.org/10.2307/2988543
[5] A. J. Dobson, 2002, An Introduction to Generalized Linear Models Second Edition. London: Chapman and Hall/CRC.
[6] D. Andersen, S. Feldblum, C. Modlin, D. Schirmacher, E. Schirmacher, N. Thandi, 2005, A Practitioner’s Guide to Generalized Linear Models Second Edition. CAS Study Note.
[7] K. Antonio, J. Beirlant, 2007, Actuarial statistics with generalized linear mixed models. Insurance Mathematics & Economics, 40, pp. 58-76. https://doi.org/10.1016/J.INSMATHECO.2006.02.013.
[8] P. De Jong, G. Heller, 2008, Generalized Linear Models for Insurance Data International Series on Actuarial Science. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511755408
[9] M. V. Wüthrich, M. Merz, 2008, Stochastic claims reserving methods in insurance. John Wiley & Sons.
[10] E. Ohlsson, B. Johansson, 2010, Non-Life Insurance Pricing with Generalized Linear Models. Springer.
[11] E. W. Frees, 2015, Analytics of insurance markets. Annual Review of Financial Economics, 7, 253–77
[12] Z. Quan, Insurance Analytics with Tree-Based Models. PhD thesis, University of Connecticut, 2019.
[13] W. Gardner, C. Lidz, E. Mulvey, E. C. Shaw, 1996, A comparison of actuarial methods for identifying repetitively violent patients with mental illnesses. Law and Human Behavior, 20, 35-48.
[14] H. Steadman, E. Silver, J. Monahan, P. Appelbaum, P. Robbins, E. Mulvey, T. Grisso, L. Roth, S. Banks, 2000, A Classification Tree Approach to the Development of Actuarial Violence Risk Assessment Tools. Law and Human Behavior, 24, 83-100. https://doi.org/10.1023/A:1005478820425.
[15] L. Guelman, 2012, Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 393, 3659–67.
[16] J. William, M. Martin, C. Chojenta, D. Loxton, 2018, An actuarial investigation into maternal hospital cost risk factors for public patients. Annals of Actuarial Science, 12, 106 - 129. https://doi.org/10.1017/S174849951700015X.
[17] M. V. Wuthrich, C. Buser, 2023, Data Analytics for Non-Life Insurance Pricing. Swiss Finance Institute Research Paper No. 16-68. Available at SSRN: https://ssrn.com/abstract=2870308 or http://dx.doi.org/10.2139/ssrn.2870308
[18] L. Diao, C. Weng, 2019, Regression Tree Credibility Model. North American Actuarial Journal, 232, 169-196. DOI: 10.1080/10920277.2018.1554497
[19] J. Baillargeon, L. Lamontagne, É. Marceau, 2020, Mining Actuarial Risk Predictors in Accident Descriptions Using Recurrent Neural Networks. Risks. https://doi.org/10.3390/risks9010007.
[20] S. Tober, 2020, Tree-based Machine Learning Models with Applications in Insurance Frequency Modelling Dissertation. Retrieved from https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-276233
[21] R. Henckaerts, M.-P. Côté, K. Antonio, R. Verbelen, 2021, Boosting Insights in Insurance Tariff Plans with Tree-Based Machine Learning Methods. North American Actuarial Journal, 252, 255-285. DOI: 10.1080/10920277.2020.1745656
[22] B. Rokicki, K. Ostaszewski, 2022, Actuarial Credibility Approach in Adjusting Initial Cost Estimates of Transport Infrastructure Projects. Sustainability. https://doi.org/10.3390/su142013371.
[23] R. Richman, 2021a, AI in actuarial science—a review of recent advances—part 1. Ann. Actuar. Sci., 152, 207–29
[24] R. Richman, 2021b, AI in actuarial science—a review of recent advances—part 2. Ann. Actuar. Sci., 152, 230–58
[25] B. Wong, J. Christopher, H. Cossette, L. Lamontagne, E. Marceau, 2021, Machine Learning in P&C Insurance: A Review for Pricing and Reserving. Risks, 91, 4. https://doi.org/10.3390/risks9010004
[26] Z. Quan, 2019, Insurance Analytics with Tree-Based Models Doctoral Dissertations No. 2374. Retrieved from https://digitalcommons.lib.uconn.edu/dissertations/2374
[27] J. Neyman, 1934, On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558-625
[28] J. Neyman, E. S. Pearson, 1933, On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231, 289–337. http://www.jstor.org/stable/91247
[29] R. Singh, N. S. Mangat, 1996, Stratified Sampling. In: Elements of Survey Sampling, Vol. 15. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-1404-4_5
[30] V. L. Parsons, 2014, Stratified sampling. Wiley StatsRef: Statistics Reference Online, 1-11.
[31] E. Liberty, K. Lang, K. Shmakov, 2016, June. Stratified sampling meets machine learning. In International conference on machine learning pp. 2320-2329. PMLR.
[32] Y. Ye, Q. Wu, J. Z. Huang, M. K. Ng, X. Li, 2013, Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition, 463, 769-787.
[33] T. Yu, X. Zhai, S. Sra, 2019, Near Optimal Stratified Sampling. ArXiv, abs/1906.11289.
[34] Y. Lu, Y. Park, L. Chen, Y. Wang, C. De Sa, D. Foster, 2021, July. Variance reduced training with stratified sampling for forecasting models. In International Conference on Machine Learning pp. 7145-7155. PMLR.
[35] J. Fox, 2008, Applied Regression Analysis and Generalized Linear Models, 2nd Edn. Thousand Oaks, CA: Sage.
[36] J.F. Magee, 1964, Decision trees for decision making, Harvard Business Review, pp. 126-138.
[37] S.K. Murthy, 1998, Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining Knowl Discovery 2(4):345–389
[38] R.L. Keeney, 1982, Decision Analysis: An Overview. Operations Research, 30(5).
[39] L.Tjen-Sien, L. Wei-Yin, S.Yu-Shan, 2000, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40:203–228
[40] S.B. Kotsiantis, 2013, Decision trees: a recent overview. Artif Intell Rev 39, 261–283
[41 ] L. Breiman, J. Friedman, R. Olshen, C. J. Stone, 1984, Classification and regression Trees. Wadsworth, Belmont, CA.
[42] J.R. Quinlan, 1986, Induction of decision trees. Mach Learn 1, 81–106.
[43] J.R. Quinlan, 1993, C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
[44] G.V. Kass, 1980, "An Exploratory Technique for Investigating Large Quantities of Categorical Data". Applied Statistics. 29 (2): 119–127
[45] J. Gehrke, R. Ramakrishnan, V. Ganti, 2000, RainForest: a framework for fast decision tree construction of large datasets. Data Mining Knowl Discovery 4(2–3):127–162
[46] H. A. Chipman, E. I. George, R. E. McCulloch, 1998, Bayesian CART model search. Journal of the American Statistical Association, 93443, 935-960 pp.
[47] J. Morgan, 2014, Classification and regression tree analysis. Boston: Boston University, 298.
[48] D. L. Verbyla, 1987, Classification trees: a new discrimination tool. Canadian Journal of Forest Research, 17, 9, 1150–1152.
[49] L. A. Clark, D. Pregibon, 1992, Tree-based models. In: Statistical models Eds. Chambers JM, Hastie TJ. Pacific Grove, CA: Wadsworth, p 377–419.
[50] G. De’ath, K. E. Fabricius, 2000, Classification and Regression Trees: A Powerful yet Simple Technique for Ecological Data Analysis. Ecology, 81, 3178-3192
[51] R Core Team , 2021, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. RStudio 2023.09.1

Toplam 51 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	İstatistiksel Analiz, Risk Analizi
Bölüm	Makaleler
Yazarlar	Başak Bulut Karageyik 0000-0003-4080-9165
Erken Görünüm Tarihi	29 Aralık 2023
Yayımlanma Tarihi	31 Aralık 2023
Gönderilme Tarihi	4 Aralık 2023
Kabul Tarihi	28 Aralık 2023
Yayımlandığı Sayı	Yıl 2023 Cilt: 16 Sayı: 2

Kaynak Göster

IEEE	B. Bulut Karageyik, “Regression Tree Approach to Estimation of Health Insurance Premium”, JSSA, c. 16, sy. 2, ss. 81–99, 2023.

Makale Dosyaları

Tam Metin