Araştırma Makalesi
BibTex RIS Kaynak Göster

Doğrusal Regresyonda Aykırı Değer Tespit Algoritmaları için Dayanıklı Bir Temel Küme Seçim Yöntemi

Yıl 2024, Sayı: 10, 76 - 85, 31.12.2024
https://doi.org/10.52693/jsas.1512794

Öz

Bu çalışmanın temel motivasyonu, makul bir kirlilik seviyesine kadar doğrusal regresyonda aykırı değerlerin teşhisi ve tespiti için etkili bir algoritma geliştirmektir. Algoritma başlangıçta doğrusal cebir düzeyinde şapka matrisinin sağlam bir versiyonunu elde eder. İlk aşamada elde edilen temel alt küme, hızlı LTS (Least Trimmed Squares) regresyon algoritmasında tanımlandığı gibi konsantrasyon adımlarıyla iyileştirilir. Yöntem, temel alt küme seçim durumu olarak başka bir algoritmaya takılabilir. Algoritma hem X hem de Y yönündeki aykırı değerlere karşı %25 oranında etkilidir. Algoritmanın karmaşıklığı gözlem ve parametre sayısı ile doğrusal olarak artmaktadır. Algoritma iteratif hesaplamalar gerektirmediği için oldukça hızlıdır. Algoritmanın belirli bir kirlilik seviyesine karşı başarısı simülasyonlarla gösterilmiştir.

Kaynakça

  • [1] X. Gao and Y. Feng, “Penalized weighted least absolute deviation regression,” Statistics and its interface, vol. 11, no. 1, pp. 79–89, 2018.
  • [2] P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for large data sets,” Data mining and knowledge discovery, vol. 12, pp. 29–45, 2006.
  • [3] D. C. Hoaglin and R. E. Welsch, “The hat matrix in regression and anova,” The American Statistician, vol. 32, no. 1, pp. 17–22, 1978.
  • [4] J. W. Tukey et al., Exploratory data analysis. Reading, MA, 1977, vol. 2.
  • [5] A. S. Hadi and J. S. Simonoff, “Procedures for the identification of multiple outliers in linear models,” Journal of the American statistical association, vol. 88, no. 424, pp. 1264–1272, 1993
  • [6] N. Billor, A. S. Hadi, and P. F. Velleman, “Bacon: blocked adaptive computationally efficient outlier nominators,” Computational statistics & data analysis, vol. 34, no. 3, pp. 279–298, 2000.
  • [7] N. Billor, S. Chatterjee, and A. S. Hadi, “A re-weighted least squares method for robust regression estimation,” American journal of mathematical and management sciences, vol. 26, no. 3-4, pp. 229–252, 2006.
  • [8] D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, 2005.
  • [9] S. Barratt, G. Angeris, and S. Boyd, “Minimizing a sum of clipped convex functions,” Optimization Letters, vol. 14, pp. 2443–2459, 2020.
  • [10] S. Chatterjee and M. Mächler, “Robust regression: A weighted least squares approach”, Communications in Statistics-Theory and Methods, vol. 26, no. 6, pp. 1381–1394, 1997.
  • [11] M. H. Satman, “A new algorithm for detecting outliers in linear regression,” International Journal of statistics and Probability, vol. 2, no. 3, p. 101, 2013.
  • [12] L. Huo, T.-H. Kim, and Y. Kim, “Robust estimation of covariance and its application to portfolio optimization,” Finance Research Letters, vol. 9, no. 3, pp. 121–134, 2012.
  • [13] P. J. Rousseeuw, “Least median of squares regression,” Journal of the American statistical association, vol. 79, no. 388, pp. 871–880, 1984.
  • [14] D. M. Hawkins and D. Olive, “Applications and algorithms for least trimmed sum of absolute deviations regression,” Computational Statistics & Data Analysis, vol. 32, no. 2, pp. 119–134, 1999.
  • [15] D. M. Hawkins, D. Bradu, and G. V. Kass, “Location of several outliers in multiple-regression data using elemental sets,” Technometrics, vol. 26, no. 3, pp. 197–208, 1984.
  • [16] D. De Menezes, D. M. Prata, A. R. Secchi, and J. C. Pinto, “A review on robust m-estimators for regression analysis,” Computers & Chemical Engineering, vol. 147, p. 107254, 2021.
  • [17] P. Rousseeuw and V. Yohai, “Robust regression by means of s-estimators,” in Robust and Non-linear Time Series Analysis: Proceedings of a Workshop Organized by the Sonderforschungs-bereich 123 “Stochastische Mathematische Modelle”, Heidelberg 1983. Springer, pp. 256–272, 1984.
  • [18] M.H. Satman. "A genetic algorithm based modification on the LTS algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5, pp. 644-652, 2012.
  • [19] M.H. Satman, S. Adiga, G. Angeris, and E. Akadal. "LinRegOutliers: A Julia package for detecting outliers in linear regression." Journal of Open Source Software 6, no. 57: 2892, 2021.
  • [20] J. Bezanson, S. Karpinski, V.B. Shah, V. B., and A. Edelman. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145, 2012

A Robust Initial Basic Subset Selection Method for Outlier Detection Algorithms in Linear Regression

Yıl 2024, Sayı: 10, 76 - 85, 31.12.2024
https://doi.org/10.52693/jsas.1512794

Öz

The main motivation of this study is to develop an efficient algorithm for diagnosing and detecting outliers in linear regression up to a reasonable level of contamination. The algorithm initially obtains a robust version of the hat matrix at the linear algebra level. The basic subset obtained in the first stage is improved through concentration steps as defined in the fast-LTS (Least Trimmed Squares) regression algorithm. The method can be plugged into another algorithm as a basic subset selection state. The algorithm is effective against outliers in both X and Y directions by a rate of 25%. The complexity of the algorithm increases linearly with the number of observations and parameters. The algorithm is quite fast as it does not require iterative calculations. The success of the algorithm against a specific contamination level is demonstrated through simulations.

Kaynakça

  • [1] X. Gao and Y. Feng, “Penalized weighted least absolute deviation regression,” Statistics and its interface, vol. 11, no. 1, pp. 79–89, 2018.
  • [2] P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for large data sets,” Data mining and knowledge discovery, vol. 12, pp. 29–45, 2006.
  • [3] D. C. Hoaglin and R. E. Welsch, “The hat matrix in regression and anova,” The American Statistician, vol. 32, no. 1, pp. 17–22, 1978.
  • [4] J. W. Tukey et al., Exploratory data analysis. Reading, MA, 1977, vol. 2.
  • [5] A. S. Hadi and J. S. Simonoff, “Procedures for the identification of multiple outliers in linear models,” Journal of the American statistical association, vol. 88, no. 424, pp. 1264–1272, 1993
  • [6] N. Billor, A. S. Hadi, and P. F. Velleman, “Bacon: blocked adaptive computationally efficient outlier nominators,” Computational statistics & data analysis, vol. 34, no. 3, pp. 279–298, 2000.
  • [7] N. Billor, S. Chatterjee, and A. S. Hadi, “A re-weighted least squares method for robust regression estimation,” American journal of mathematical and management sciences, vol. 26, no. 3-4, pp. 229–252, 2006.
  • [8] D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, 2005.
  • [9] S. Barratt, G. Angeris, and S. Boyd, “Minimizing a sum of clipped convex functions,” Optimization Letters, vol. 14, pp. 2443–2459, 2020.
  • [10] S. Chatterjee and M. Mächler, “Robust regression: A weighted least squares approach”, Communications in Statistics-Theory and Methods, vol. 26, no. 6, pp. 1381–1394, 1997.
  • [11] M. H. Satman, “A new algorithm for detecting outliers in linear regression,” International Journal of statistics and Probability, vol. 2, no. 3, p. 101, 2013.
  • [12] L. Huo, T.-H. Kim, and Y. Kim, “Robust estimation of covariance and its application to portfolio optimization,” Finance Research Letters, vol. 9, no. 3, pp. 121–134, 2012.
  • [13] P. J. Rousseeuw, “Least median of squares regression,” Journal of the American statistical association, vol. 79, no. 388, pp. 871–880, 1984.
  • [14] D. M. Hawkins and D. Olive, “Applications and algorithms for least trimmed sum of absolute deviations regression,” Computational Statistics & Data Analysis, vol. 32, no. 2, pp. 119–134, 1999.
  • [15] D. M. Hawkins, D. Bradu, and G. V. Kass, “Location of several outliers in multiple-regression data using elemental sets,” Technometrics, vol. 26, no. 3, pp. 197–208, 1984.
  • [16] D. De Menezes, D. M. Prata, A. R. Secchi, and J. C. Pinto, “A review on robust m-estimators for regression analysis,” Computers & Chemical Engineering, vol. 147, p. 107254, 2021.
  • [17] P. Rousseeuw and V. Yohai, “Robust regression by means of s-estimators,” in Robust and Non-linear Time Series Analysis: Proceedings of a Workshop Organized by the Sonderforschungs-bereich 123 “Stochastische Mathematische Modelle”, Heidelberg 1983. Springer, pp. 256–272, 1984.
  • [18] M.H. Satman. "A genetic algorithm based modification on the LTS algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5, pp. 644-652, 2012.
  • [19] M.H. Satman, S. Adiga, G. Angeris, and E. Akadal. "LinRegOutliers: A Julia package for detecting outliers in linear regression." Journal of Open Source Software 6, no. 57: 2892, 2021.
  • [20] J. Bezanson, S. Karpinski, V.B. Shah, V. B., and A. Edelman. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145, 2012
Toplam 20 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Ekonometrik ve İstatistiksel Yöntemler
Bölüm Araştırma Makaleleri
Yazarlar

Mehmet Hakan Satman 0000-0002-9402-1982

Erken Görünüm Tarihi 24 Aralık 2024
Yayımlanma Tarihi 31 Aralık 2024
Gönderilme Tarihi 8 Temmuz 2024
Kabul Tarihi 16 Aralık 2024
Yayımlandığı Sayı Yıl 2024 Sayı: 10

Kaynak Göster

IEEE M. H. Satman, “A Robust Initial Basic Subset Selection Method for Outlier Detection Algorithms in Linear Regression”, JSAS, sy. 10, ss. 76–85, Aralık 2024, doi: 10.52693/jsas.1512794.