본문 바로가기

카테고리 없음

RANDOM FORESTS FOR CLASSIFICATION IN ECOLOGY

Cutler, D. R. et al. Random forests for classification in ecology. Ecology 88, 2783–2792 (2007).

Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature.

요청 들어와 리뷰 중.  Partial dependence 란 말을 다른 RF 논문에선 못 본 것 같은데 뭔지 읽는 중. 정리되면 게시하겠음. 


"Partial dependence plots (Hastie et al. 2001; see also Appendix C) may be used to graphically characterize relationships between individual predictor variables and predicted probabilities of species presence obtained from RF"

"T
he partial dependence of the function f on the variable Xj is the expectation of f with respect to all the variables except Xj."

fj(Xj) = EX(-j) [ f(X)] 

   

대충 conditional distribution 비슷한 거라고 생각하면 될 듯. 결과 값을 Predictors의 다항 함수로 보고 한 개나 두 개만 변화시키면서 response 값 자체 보는 것. 왜 dependence 란 말을 썼는지 좀 의문임. 


"RF do not have simple representa- tions such as a formula (e.g., logistic regression) or pictorial graph (e.g., classification trees) that character- izes the entire classification function, and this lack of simple representation can make ecological interpretation difficult. Partial dependence plots for one or two predictor variables at a time may be constructed for any ‘‘blackbox’’ classifier (Hastie et al. 2001:333)  If the classification function is dominated by individual vari- able and low order interactions, then these plots can be an effective tool for visualizing the classification results, but they are not helpful for characterizing or interpreting high-order interactions."

PD 플롯이 이렇게 결과를 해석할 때 도움 줄 수 있단 맥락에서 도입. 기본적으로 RF 결과물은 해석하기가 난감한 것이 사실이다.

"
RF is not a tool for traditional statistical inference. It is not suitable for ANOVA or hypothesis testing. It does not compute P values, or regression coefficients, or confidence intervals. The variable importance measure in RF may be used to subjectively identify ecologically important variables for interpretation, but it does not automatically choose subsets of variables in the way that variable subset selection methods do. Rather, RF characterizes and exploits structure in high dimensional data for the purposes of classification and prediction."