The problem of statistical modelling and identifying the significant variables in large data sets is common nowadays. This paper deals with the statistical analysis of two large dimensional data sets; we firstly conduct a seismic hazard sensitivity analysis using seismic data from Greece acquired during the years 1962−2003, and then analyze Trauma data collected in an annual registry conducted during the year 2005 by the Hellenic Trauma and Emergency Surgery Society involving 30 General Hospitals in Greece. The main purpose of both analyses is to extract high-level knowledge for the domain user or decision-maker. Eight non parametric classifiers derived from data mining methods (Multilayer Perceptrons (MLP) Neural Networks, Radial Basis Function Neural (RBFN) Networks, Bayesian Networks, Support Vector Machines (SVMs), Classification and Regression Tree (C&RT), Chi-square Automatic Interaction Detection (CHAID), C5.0 algorithm and Quick, Unbiased, Efficient Statistical Tree (QUEST)) are employed in this work, and are compared to Logistic Regression and ℓ1-norm SVM in terms of overall classification accuracy, sensitivity, specificity, and Area under the ROC curve (AUROC). The goal of this paper is twofold; assess the importance of several input variables in order to detect the possible risk factors of large earthquakes or to prevent trauma deaths, and examine which classifiers are most suited for a large dimensional data analysis, detecting effectively complex nonlinear relationships and potentially lead to more accurate predictions.
Digital Object Identifier (DOI)
Parpoula, Christina; Drosou, Krystallenia; and Koukouvinos, Christos
"Large-Scale Statistical Modelling via Machine Learning Classifiers,"
Journal of Statistics Applications & Probability: Vol. 2
, Article 3.
Available at: https://dc.naturalspublishing.com/jsap/vol2/iss3/3