Detection and Diagnostic Methods of Multiple Influential Points in Binary Logistic Regression Model in Animal Breeding

Burcu Mestav

doi:10.29133/yyutbd.638226

Research Article

Hayvan Islahında İkili Lojistik Regresyon Modelinde Çoklu Etki Noktalarının Tespit ve Teşhis Yöntemleri

Year 2019, Volume: 29 Issue: 4, 677 - 688, 31.12.2019

Burcu Mestav

https://doi.org/10.29133/yyutbd.638226

Abstract

Çoklu etkili gözlem noktaları ikili lojistik regresyon modellerinde parametre tahminlerini olumsuz yönde etkilemekte ve sonuçların yanlış yorumlanmasına sebep olmaktadır. Bir etkili gözlem noktası verilerin geri kalanının genel eğimini takip etmeyen ve x bakımından aşırı değere sahip olan bir veri noktasıdır. Veri seti içinde yaklaşık % 10 ve üzerinde etkili gözlem noktasının bulunması parametre tahminlerini etkilediği için bu noktaların tespit ve teşhisi oldukça önemlidir. Çoklu etkili gözlem noktalarının tespit ve teşhisinde grafiksel (saçılım grafiği ve kutu grafiği gibi) ve analitik yöntemler kullanılmaktadır. En yaygın kullanılan teşhis yöntemleri Pearson Artıklar, Student Türü Artıklar, Şapka Matrisi, Cook Uzaklığı, DFFITS, DFBETA vb. yöntemlerdir. Ancak bu yöntemler çoklu etkili gözlem noktalarının olması durumunda maskeleme problemleri ile karşılaşmakta ve teşhiste başarısız olmaktadır. Bir çok istatistikçi bu problemle başedebilmek için Genelleştirilmiş Standartlandırılmış Pearson Artığı (GSPA), Genelleştirilmiş Ağırlıklar (GA) gibi yeni yöntemler geliştirmiş ve önermiştir. Bu çalışmada, Romney ırkı koyunlardan elde edilen sütten kesim ağırlığı (SKA), Bir yaş canlı ağırlığı (BYCA), yapağı ağırlığı (YA) ve doğurganlık oranı (DO) değişkenlerine ait içinde çoklu etkili gözlem noktası (%15) bulunan veri seti ile çalışılmış ve DO üzerine SKA, BYCA ve YA değişkenlerinin etkisi ikili lojistik regresyon modeli ile modellenmiştir. Çalışmanın amacı çoklu etkili gözlem noktalarını grafiksel yöntemlerle tespit edip yaygın olarak kullanılan ve yeni geliştirilmiş yöntemlerin bu veri noktalarının teşhisindeki performanlarını incelemektir. Çalışmanın sonucunda yaygın olarak kullanılan yöntemlerin çoklu etkili gözlem noktalarını maskelediği ancak yeni önerilen yöntemlerin bu noktaları başarılı şekilde teşhis ettiği gözlenmiştir.

Keywords

Çoklu Etkili Gözlem Noktası, Genelleştirilmiş Standartlandırılmış Pearson Artığı (GSPA), , Genelleştirilmiş Uyum Farkı (GDFFITS)

References

Aktaş, A. H. & Doğan, Ş. (2014). Effect of live weight and age of Akkaraman ewes at mating on multiple birth rate, growth traits, and survival rate of lambs. Turk. J. Vet. Anim. Sci., 38, 176–182. doi:10.5194/aab-58-451-2015.
Aktaş, A. H., Dursun, Ş., Doğan, Ş., Kiyma, Z., Demirci, U., & Halıcı, İ. (2015). Effects of ewe live weight and age on reproductive performance, lamb growth, and survival in Central Anatolian Merino sheep. Arc. Anim. Breed., 58, 451-459. doi:10.5194/aab-58-451-2015.
Baeza-Rodríguez, J. J., Montaño-Bermúdez, M., Vega-Murillo, V. E., & Arechavaleta-Velasco, M. E. (2018). Linear and logistic models for multiple-breed genetic analysis of heifer fertility in Mexican Simmental–Simbrah beef cattle. Journal of Applied Animal Research, 46(1), 534-540. doi:10.1080/09712119.2017.1357559.
Belsly, D. A., Kuh, E. & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential data and Source of Collinearity. Wiley, New York.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics. 19(1), 15-18. doi:10.1080/00401706.1977.10489493.
Copas, J. B. (1988). Binary regression model for contaminated data (with discuss). Journal of the Royal Statistical Society, Series B., 50, 225-265.
Estaghvirou, S. B. O., Ogutu, J. O. & Piepho, H.-P. (2014). Influence of Outliers on Accuracy Estimation in Genomic Prediction in Plant Breeding. G3-Genes Genomes Genetics. 4, 2317-2328. doi:10.1534/g3.114.011957.
Eyduran, E., Özdemir, T., Çak, B., & Alarslan, E. (2005). Using of logistic regression in animal science. Journal of Applied Sciences. 5(10), 1753-1756. doi:10.3923/jas.2005.1753.1756.
Fridendly, M., & Meyer, D. (2015). Discrete Data Analysis with R Visualization and Modeling Techniques for Categorical and Count Data. USA.
Gaskins, C. T., Snowder, G. D., Westman, M. K., & Evans, M. (2005). Influence of body weight, age and weight gain on fertility and prolificacy in four breeds of ewe lambs. J. Anim. Sci., 83, 1680-1689. doi:10.2527/2005.8371680x.
Gebre, T., Deneke, Y., & Begna, F. (2018). Seroprevalence and Associated Risk Factors of Peste Des Petits Ruminants (PPR) in Sheep and Goats in Four Districts of Bench Maji and Kafa Zones, South West Ethiopia. Global Veterinaria, 20 (6), 260-270. doi: 10.5829/idosi.gv.2018.260.270.
Habshah, M., Norazan, M. R., & Imon, A. H. M. R.(2009). The performance of diagnostic-robust Generalized Potentials for the identification of multiple High Leverage Points in Linear Regression. Journal of Applied Statistics, 36, 507-520. doi: 10.1080/02664760802553463.
Hadi, A. S. (1992). A new measure of overall potential influence in linear regression. Computational Statistic Data Analysis, 14, 1-27. doi:10.1016/0167-9473(92)90078-T.
Hadi, A. S. & Simonoff, J. S. (1993). Procedure for the Identification of outliers in linear models. J. Am. Stat. Asssoc., 88, 1264-1272. doi: 10.2307/2291266.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P.J., & Sathel, W. A.(1986). Robust Statistics: The Approach Based on Influence Function. Wiley, New York.
Heslot, N., Jannink, J. L. & Sorrells, M. E. (2013). Using genomic prediction to characterize environments and optimize prediction accuracy in applied breeding data. Crop Sci., 53, 921-933. doi:10.2135/cropsci2012.07.0420.
Hilbe, J. M. (2009). Logistic Regression Models. CRC Press, Newyork.
Hosmer, D. W. & Lemeshow, S. (2000). Applied Logistic Regression. 2nd ed. Wiley, New York.
Imon, A. H. M. R. (2006). Identification of High Leverage points in logistic regression. Pak. J. Stat., 22, 147-156.
Imon, A. H. M. R. & Hadi, A. S. (2008). Identification of multiple outliers in logistic regression. Communications in Statistics Theory and Methods, 37, 1697-1709. doi: 10.1080/03610920701826161.
Jennings, D. E. (1986). Outliers and residual distribution in logistic regression. Journal of American Statistical Association, 81, 987-990.
Korkmaz, M., Güney, S., & Yiğiter, Ş. Y. (2012). The importance of Logistic regression implementations in the Turkish livestock sector and logistic regression implementation/fields. J. Agric. Fac. HR.U., 16(2), 25-36.
Midi, H. & Ariffin, S. B. (2013). Modified standardized pearson residual for the identification of outliers in logistic regression model. Journal of Applied Sciences, 13, 828-836. doi:10.3923/jas.2013.828.836.
Nurunnabi, A. A. M., Imon, R. A. H. M., & Nasser, M. (2010). Identification of multiple observations in logistic regression. Journal of Applied Statistics, 37, 1605-1624. doi:10.1080/02664760903104307.
Nurunnabi, A., & Nasser, M. (2011). Outlier diagnostics in logistic regression: A supervised learning technique. 2009 International Conference on Medicine Learning and Computing IPCSIT, 3, 90-95.
Nurunnabi, A. A. M. & West, G. (2012). Outlier detection in logistic regression: A quest for reliable knowledge from predictive modeling and classification. Paper presented at the 12th International Conference on Data Mining Workshops.
Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9, 705-724.
R-project (2018). The Comprehensive R Archive Network R for windows version 3.5.1.
Sanizah, A., Habshah, M., & Norazan, M. R. (2011). Diagnostic for residual outliers using deviance component in binary logistic regression. World Applied Sciences Journal, 14 (8), 1125-1130.
Sarkar, S. K., Midi, H. & Rana, S. (2011). Detection of outliers and influeantial observations in binary logistic regression: An empirical study. Journal of applied Statistics, 11(1), 26-35. doi:10.3923/jas.2011.26.35.
Takma, Ç., Güneri, Ö. İ., & Gevrekçi, Y. (2016). Investigation of Stillbirth rate using logistic regression analysis in Holstein Friesian calves. Ege Ünv. Ziraat Fak. Derg., 53(3), 245-250. doi:10.20289/zfdergi.389278.
Via, S., Conte, G., Mason-Foley, C., & Mills, K. (2012). Localizing FSToutliers on a QTL map reveals evidence for large genomic regions of reduced gene exchange during speciation-with- gene-flow. Mol. Ecol. 21, 5546–5560. doi:10.1111/mec.12021.
Welsch, R. E. (1982). Influence Functions and Regression Diagnostics. Modern Data Analysis, New York Academic Press.
Yakubu A., Muhammed, M.M. & Musa-Azara, I.S. (2014). Application of multivariate logistic regression model to assess factors of importance influencing prevalence of abortion and stillbirth in Nigerian Goat Breeds. Biotechnology Animal Husbandry, 30,79-88. doi: 10.2298/BAH1401079Y.

Detection and Diagnostic Methods of Multiple Influential Points in Binary Logistic Regression Model in Animal Breeding

Year 2019, Volume: 29 Issue: 4, 677 - 688, 31.12.2019

Burcu Mestav

https://doi.org/10.29133/yyutbd.638226

Abstract

Multiple influential points adversely affect parameter estimation in binary logistic regression models and lead to misinterpretation of results. An influential point is a data point that does not follow the overall slope of remaining data and has extreme value in terms of x. Since the presence of approximately 10% of influential points in a dataset affects parameter estimates, detection and diagnosis of these points greatly matter. Graphical (such as scatter graph and box graph) and analytical methods are adopted in the detection and diagnosis of multiple influential points. Among the commonly used diagnostic methods are Pearson residuals, Standardized Pearson Residuals (SPR), Cook Distance (CD), Hat matrix, DFFITS, and DFBETA. However, these methods mask problems and fail to diagnose if there are multiple influential points. Many statisticians have developed and proposed new diagnostic methods, such as Generalized Standardized Pearson Residual (GSPR) and Generalized Weights (GW), to overcome this problem. This study exploited a dataset containing multiple influential points (15%) for weaning weight (WW), yearling weight (YW), fleece weight (FW), and fertility rate (FR) of Romney ewes and modelled the effects of WW, TW and FW variables on FR by binary logistic regression model. This study is intended to determine the multiple influential points by graphical methods and to examine the performance of commonly used and newly developed methods in the diagnosis of these data points. As a result, it was observed that the commonly used methods mask multiple influential points and the new proposed methods competently identify these points.

Keywords

Multiple Influential Point, Generalized Standardized Pearson Residual (GSPR), ,

References

Aktaş, A. H. & Doğan, Ş. (2014). Effect of live weight and age of Akkaraman ewes at mating on multiple birth rate, growth traits, and survival rate of lambs. Turk. J. Vet. Anim. Sci., 38, 176–182. doi:10.5194/aab-58-451-2015.
Aktaş, A. H., Dursun, Ş., Doğan, Ş., Kiyma, Z., Demirci, U., & Halıcı, İ. (2015). Effects of ewe live weight and age on reproductive performance, lamb growth, and survival in Central Anatolian Merino sheep. Arc. Anim. Breed., 58, 451-459. doi:10.5194/aab-58-451-2015.
Baeza-Rodríguez, J. J., Montaño-Bermúdez, M., Vega-Murillo, V. E., & Arechavaleta-Velasco, M. E. (2018). Linear and logistic models for multiple-breed genetic analysis of heifer fertility in Mexican Simmental–Simbrah beef cattle. Journal of Applied Animal Research, 46(1), 534-540. doi:10.1080/09712119.2017.1357559.
Belsly, D. A., Kuh, E. & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential data and Source of Collinearity. Wiley, New York.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics. 19(1), 15-18. doi:10.1080/00401706.1977.10489493.
Copas, J. B. (1988). Binary regression model for contaminated data (with discuss). Journal of the Royal Statistical Society, Series B., 50, 225-265.
Estaghvirou, S. B. O., Ogutu, J. O. & Piepho, H.-P. (2014). Influence of Outliers on Accuracy Estimation in Genomic Prediction in Plant Breeding. G3-Genes Genomes Genetics. 4, 2317-2328. doi:10.1534/g3.114.011957.
Eyduran, E., Özdemir, T., Çak, B., & Alarslan, E. (2005). Using of logistic regression in animal science. Journal of Applied Sciences. 5(10), 1753-1756. doi:10.3923/jas.2005.1753.1756.
Fridendly, M., & Meyer, D. (2015). Discrete Data Analysis with R Visualization and Modeling Techniques for Categorical and Count Data. USA.
Gaskins, C. T., Snowder, G. D., Westman, M. K., & Evans, M. (2005). Influence of body weight, age and weight gain on fertility and prolificacy in four breeds of ewe lambs. J. Anim. Sci., 83, 1680-1689. doi:10.2527/2005.8371680x.
Gebre, T., Deneke, Y., & Begna, F. (2018). Seroprevalence and Associated Risk Factors of Peste Des Petits Ruminants (PPR) in Sheep and Goats in Four Districts of Bench Maji and Kafa Zones, South West Ethiopia. Global Veterinaria, 20 (6), 260-270. doi: 10.5829/idosi.gv.2018.260.270.
Habshah, M., Norazan, M. R., & Imon, A. H. M. R.(2009). The performance of diagnostic-robust Generalized Potentials for the identification of multiple High Leverage Points in Linear Regression. Journal of Applied Statistics, 36, 507-520. doi: 10.1080/02664760802553463.
Hadi, A. S. (1992). A new measure of overall potential influence in linear regression. Computational Statistic Data Analysis, 14, 1-27. doi:10.1016/0167-9473(92)90078-T.
Hadi, A. S. & Simonoff, J. S. (1993). Procedure for the Identification of outliers in linear models. J. Am. Stat. Asssoc., 88, 1264-1272. doi: 10.2307/2291266.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P.J., & Sathel, W. A.(1986). Robust Statistics: The Approach Based on Influence Function. Wiley, New York.
Heslot, N., Jannink, J. L. & Sorrells, M. E. (2013). Using genomic prediction to characterize environments and optimize prediction accuracy in applied breeding data. Crop Sci., 53, 921-933. doi:10.2135/cropsci2012.07.0420.
Hilbe, J. M. (2009). Logistic Regression Models. CRC Press, Newyork.
Hosmer, D. W. & Lemeshow, S. (2000). Applied Logistic Regression. 2nd ed. Wiley, New York.
Imon, A. H. M. R. (2006). Identification of High Leverage points in logistic regression. Pak. J. Stat., 22, 147-156.
Imon, A. H. M. R. & Hadi, A. S. (2008). Identification of multiple outliers in logistic regression. Communications in Statistics Theory and Methods, 37, 1697-1709. doi: 10.1080/03610920701826161.
Jennings, D. E. (1986). Outliers and residual distribution in logistic regression. Journal of American Statistical Association, 81, 987-990.
Korkmaz, M., Güney, S., & Yiğiter, Ş. Y. (2012). The importance of Logistic regression implementations in the Turkish livestock sector and logistic regression implementation/fields. J. Agric. Fac. HR.U., 16(2), 25-36.
Midi, H. & Ariffin, S. B. (2013). Modified standardized pearson residual for the identification of outliers in logistic regression model. Journal of Applied Sciences, 13, 828-836. doi:10.3923/jas.2013.828.836.
Nurunnabi, A. A. M., Imon, R. A. H. M., & Nasser, M. (2010). Identification of multiple observations in logistic regression. Journal of Applied Statistics, 37, 1605-1624. doi:10.1080/02664760903104307.
Nurunnabi, A., & Nasser, M. (2011). Outlier diagnostics in logistic regression: A supervised learning technique. 2009 International Conference on Medicine Learning and Computing IPCSIT, 3, 90-95.
Nurunnabi, A. A. M. & West, G. (2012). Outlier detection in logistic regression: A quest for reliable knowledge from predictive modeling and classification. Paper presented at the 12th International Conference on Data Mining Workshops.
Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9, 705-724.
R-project (2018). The Comprehensive R Archive Network R for windows version 3.5.1.
Sanizah, A., Habshah, M., & Norazan, M. R. (2011). Diagnostic for residual outliers using deviance component in binary logistic regression. World Applied Sciences Journal, 14 (8), 1125-1130.
Sarkar, S. K., Midi, H. & Rana, S. (2011). Detection of outliers and influeantial observations in binary logistic regression: An empirical study. Journal of applied Statistics, 11(1), 26-35. doi:10.3923/jas.2011.26.35.
Takma, Ç., Güneri, Ö. İ., & Gevrekçi, Y. (2016). Investigation of Stillbirth rate using logistic regression analysis in Holstein Friesian calves. Ege Ünv. Ziraat Fak. Derg., 53(3), 245-250. doi:10.20289/zfdergi.389278.
Via, S., Conte, G., Mason-Foley, C., & Mills, K. (2012). Localizing FSToutliers on a QTL map reveals evidence for large genomic regions of reduced gene exchange during speciation-with- gene-flow. Mol. Ecol. 21, 5546–5560. doi:10.1111/mec.12021.
Welsch, R. E. (1982). Influence Functions and Regression Diagnostics. Modern Data Analysis, New York Academic Press.
Yakubu A., Muhammed, M.M. & Musa-Azara, I.S. (2014). Application of multivariate logistic regression model to assess factors of importance influencing prevalence of abortion and stillbirth in Nigerian Goat Breeds. Biotechnology Animal Husbandry, 30,79-88. doi: 10.2298/BAH1401079Y.

There are 34 citations in total.

Details

Primary Language	English
Subjects	Zootechny (Other)
Journal Section	Articles
Authors	Burcu Mestav 0000-0003-0864-5279
Publication Date	December 31, 2019
Acceptance Date	November 28, 2019
Published in Issue	Year 2019 Volume: 29 Issue: 4

Cite

APA	Mestav, B. (2019). Detection and Diagnostic Methods of Multiple Influential Points in Binary Logistic Regression Model in Animal Breeding. Yuzuncu Yıl University Journal of Agricultural Sciences, 29(4), 677-688. https://doi.org/10.29133/yyutbd.638226

Article Files

Full Text

Yuzuncu Yil University Journal of Agricultural Sciences by Van Yuzuncu Yil University Faculty of Agriculture is licensed under a Creative Commons Attribution 4.0 International License.