Chang-Yun Lin
Department of Applied Mathematics and Institute of Statistics, National Chung Hsing University
In experimental designs, it is usually assumed that the data follow normal distributions and the models have linear structures. In practice, experimenters may encounter different types of responses and uncertain model structures. If this is the case, traditional methods, such as the ANOVA and regression, are not suitable for data analysis and model selection. We introduce the random forest analysis, which is a powerful machine learning method capable for analyzing various types of data with complicated model structures. To perform model selection and factor identification with the random forest method, we propose a forward stepwise algorithm and develop python codes based on maximizing the OOB score and R2 score. Three examples with different types of designs and responses are provided. We compare the performance of the proposed method and some frequently used analysis methods. Results show that the forward stepwise random forest analysis requires simplest data preprocessing and selects models that have high prediction accuracy.