[R 분석] Random Forest 알고리즘

티스토리 뷰

R/Analysis

[R 분석] Random Forest 알고리즘

Aaron 2019. 1. 17. 23:15

[R 분석] Random Forest 알고리즘

참고글 : [데이터 분석] Random Forest 알고리즘

[R 분석] Random Forest 매개변수 튜닝

randomForest(x, y = NULL, # x, y 분리해서 적용 가능, 보통 formula를 많이 사용

xtest = NULL, ytest = NULL, # test 데이터셋을 같이 적용시키면 동시에 테스트를 수행(보통 같이 적용하지 않음)

ntree = 500, # 트리의 개수

mtry = n, # 각 노드 설정 시 설명변수 후보 개수(후보군)

replace = TRUE) # random forest는 default로 복원추출을 허용

> install.packages('randomForest')

> library(randomForest)

# 1. sampling

> sn <- sample(1:nrow(iris), size = nrow(iris)*0.7)

> train <- iris[sn,] # 70%의 랜덤 (train)데이터

> test <- iris[-sn,] # # 나머지 30%의 랜덤 (test)데이터

# 2. medel 생성

> forest_m <- randomForest(Species ~ ., data=train)

> forest_m

Call:

randomForest(formula = Species ~ ., data = iris)

Type of random forest: classification # 분류 목적(종류)

Number of trees: 500 # 트리의 개수(default : 500)

No. of variables tried at each split: 2 # mtry의 개수

OOB estimate of error rate: 4.76% # 오분류율

Confusion matrix:

setosa versicolor virginica class.error <- 실제 데이터

setosa 38 0 0 0.00000000

versicolor 0 35 2 0.05405405 # versicolor로 예측했지만 2개의 데이터를 virginica로 예측 실패(2개의 오분류)

virginica 0 3 27 0.10000000 # virginica로 예측했지만 3개의 데이터를 versicolor로 예측 실패(3개의 오분류)

# test 데이터

> forest_m$predicted # 학습된 모델을 통한 train data 의 예측값 확인

135 21 74 80 108 56 90 4

versicolor setosa versicolor versicolor virginica versicolor versicolor setosa

> forest_m$importance # 각 feature importance(각 불순도 기반 설명변수 중요도) 확인

MeanDecreaseGini

Sepal.Length 6.625629

Sepal.Width 1.436852

Petal.Length 29.485235

Petal.Width 31.479693 # Petal.Width 가 가장 중요한 요소

> forest_m$mtry # 모델의 mtry 값 확인

[1] 2

> forest_m$ntree # 모델의 ntree 값 확인

[1] 500

# 3. 모델을 통한 예측

> new_data <- iris[10,-5] + 0.2

> predict(forest_m, newdata = new_data, type = 'class') # 500개 트리의 다중투표 결과

setosa

Levels: setosa versicolor virginica

> iris[10,'Species']

[1] setosa

Levels: setosa versicolor virginica

# 4. 모델 평가

# 4-1) test data에 대한 score 확인

> prd_v <- predict(forest_m, newdata = test, type = 'class')

> sum(prd_v == test$Species) / nrow(test) * 100

[1] 95.55556

# 4-2) train data에 대한 score 확인

> prd_v2 <- predict(forest_m, newdata = train, type = 'class')

> sum(prd_v2 == train$Species) / nrow(train) * 100

[1] 100

# 5. 모델 시각화

> layout(matrix(c(1,2),nrow=1),width=c(4,1))

> par(mar=c(5,4,4,0)) # 오른쪽 마진 제거

> plot(forest_m)

> par(mar=c(5,0,4,2)) # 왼쪽 마진 제거

> plot(c(0,1),type="n", axes=F, xlab="", ylab="")

> legend("top", colnames(forest_m$err.rate),col=1:4,cex=0.8,fill=1:4)

참고: KIC 캠퍼스 머신러닝기반의 빅데이터분석 양성과정

저작자표시

'R > Analysis' 카테고리의 다른 글

[R 분석] 중요도가 높은 핵심 변수 선택하기 (0)	2019.01.18
[R 분석] Random Forest 매개변수 튜닝 (1)	2019.01.18
[R 분석] Decision Tree 매개변수 튜닝 (0)	2019.01.17
[R 분석] 조건부 추론 나무 (0)	2019.01.16
[R 분석] 종속변수의 그룹(class) 별 데이터 개수 균등하게 맞추기 (0)	2019.01.16

최근에 올라온 글

최근에 달린 댓글

링크

Total

Today

Yesterday

TAG more

Data Makes Our Future

티스토리 뷰

[R 분석] Random Forest 알고리즘

'R > Analysis' 카테고리의 다른 글

티스토리툴바