美文网首页
决策树与随机森林

决策树与随机森林

作者: 北欧森林 | 来源:发表于2021-05-02 18:36 被阅读0次
PART I 决策树 (Decision Tree)

决策树基本知识


image.png image.png

决策树何时停止生长:
(I) all leaf nodes are pure with entropy of zero;
(II) a prespecified minimum change in purity cannot be made with any splitting methods;
(III) the number of observations in the leaf node reaches the pre-specified minimum one.

  1. 加载并查看数据集
data(airquality)
str(airquality)

# 'data.frame': 153 obs. of  6 variables:
#   $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
# $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
# $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
# $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
# $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
# $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
  1. 插补Ozone变量中的缺失值
set.seed(888)
airquality[is.na(airquality$Ozone),1] <- sample(airquality[!is.na(airquality$Ozone),1],37) #使用非缺失值进行插补
summary(airquality$Ozone)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 1.00   16.00   30.00   41.31   63.00  168.00 
  1. 拟合模型
library(party)
airct <- ctree(Ozone ~ ., data = airquality,controls = ctree_control(maxsurrogate = 3))
airct

# Conditional inference tree with 4 terminal nodes
# 
# Response:  Ozone 
# Inputs:  Solar.R, Wind, Temp, Month, Day 
# Number of observations:  153 
# 
# 1) Temp <= 76; criterion = 1, statistic = 47.479
# 2)*  weights = 61 
# 1) Temp > 76
# 3) Wind <= 6.3; criterion = 1, statistic = 24.235
# 4)*  weights = 21 
# 3) Wind > 6.3
# 5) Temp <= 84; criterion = 0.964, statistic = 7.182
# 6)*  weights = 45 
# 5) Temp > 84
# 7)*  weights = 26 
  1. 结果可视化
plot(airct)
image.png

查看具体每个节点的情况

plot(airct, inner_panel = node_boxplot, edge_panel = function(...) invisible(),tnex = 1)
image.png
inner <- nodes(airct, c(1,3,5,7))
layout(matrix(1:length(inner), ncol = length(inner)/2))
out <- sapply(inner, function(i) {
  splitstat <- i$psplit$splitstatistic
  x <- airquality[[i$psplit$variableName]][splitstat >0]
  plot(x, splitstat[splitstat > 0],main =
              paste("Node",i$nodeID), xlab = i$psplit$variableName,
            ylab = "Statistic",ylim = c(0, 10), cex.axis = 1.2, cex.lab =
              1.2,cex.main = 1.2)
  abline(v = i$psplit$splitpoint, lty = 4)
})
image.png

连续型变量有多种分割方法,用一个统计量statistic可以描述每种分割方法的好差程度,statistic值越大说明分割越好

  1. 决策树的预测
ind <- sample(2, nrow(airquality), replace=TRUE, prob
              = c(0.7,0.3))
newData <- airquality[ind==2,]
newpred <- predict(airct, newdata= newData)
plot(newpred,newData$Ozone,xlab="Ozone value predicted by decision tree",
     ylab="Observed ozone value")
image.png
PART II 随机森林 (Random Forest)
image.png
  1. 拟合随机森林模型
aircf<-cforest(Ozone ~ ., data = airquality)
aircf

# Random Forest using Conditional Inference Trees
# 
# Number of trees:  500 
# 
# Response:  Ozone 
# Inputs:  Solar.R, Wind, Temp, Month, Day 
# Number of observations:  153 
  1. 评估预测效果
predforest <- predict(aircf, newdata= newData)
plot(predforest,newData$Ozone,ylab="Observed ozone value",
     xlab="Predicted ozone value based on random forest")
image.png
PART III Model based recursive partitioning
airmod <- mob(Ozone ~Temp+Day|Solar.R+Wind+Month, data = airquality)
# Variables after symbol “|” are partitioning variables
plot(airmod)
image.png
airmod
# 1) Wind <= 6.3; criterion = 0.999, statistic = 25.078
# 2)*  weights = 23 
# Terminal node model
# Gaussian GLM with coefficients:
#   (Intercept)         Temp          Day  
# -96.4060       2.0736      -0.1832  
# 
# 1) Wind > 6.3
# 3)*  weights = 123 
# Terminal node model
# Gaussian GLM with coefficients:
#   (Intercept)         Temp          Day  
# -90.0534       1.5811       0.2236  
# 

coef(airmod)
# (Intercept)     Temp        Day
# 2   -96.40597 2.073623 -0.1831996
# 3   -90.05342 1.581143  0.2235920

sctest(airmod,node = 1)
# Solar.R         Wind        Month
# statistic 6.0347325 25.077906381 22.955024196
# p.value   0.9661175  0.001153747  0.003099691

参考资料

  1. 章仲恒教授丁香园课程:决策树与随机森林
  2. Annals of Translational Medicine: Big-data Clinical Trial Column
  3. Zhongheng Zhang Decision tree modeling using R

相关文章

  • 随机森林-Python

    这里随机森林分类器的预测可视化与决策树差不多,因为随机森林就是决策树投票得到的结果。代码: 关键代码:plt.sc...

  • 随机森林

    1、什么是随机森林? 随机森林就是用随机的方式建立一个森林,在森林里有很多决策树组成,并且每一棵决策树之间是没有关...

  • 随机森林原理

    1、什么是随机森林?随机森林就是用随机的方式建立一个森林,在森林里有很多决策树组成,并且每一棵决策树之间是没有关联...

  • 1 . spark ml 随机森林练习代码讲解

    一,算法简介 随机森林是决策树的集成算法。随机森林包含多个决策树来降低过拟合的风险。随机森林同样具有易解释性、可处...

  • 随机森林分类器

    随机森林,是用随机的方式建立一个森林,森林里面有很多的决策树组成,随机森林的每一棵决策树之间是没有关联的。在得到森...

  • sklearn-随机森林分类器

    随机森林(1.11.2.1),随机森林的参数属性方法和决策树差不多。(RandomForestClassifier...

  • 用决策树和随机森林解决泰坦尼克号沉没问题

    决策树和随机森林既可以解决分类问题,也可以解决预测问题。 随机森林属于集成算法,森林从字面理解就是由多棵决策树构成...

  • 决策树与随机森林及其在SparkMllib中的使用

    一.概念 决策树和随机森林:决策树和随机森林都是非线性有监督的分类模型。 决策树是一种树形结构,树内部每个节点表示...

  • 决策树与随机森林(三)--提升

    转自July--4月机器学习算法班 由决策树和随机森林引发思路 随机森林的决策树分布采样建立,相对独立。 思考: ...

  • 随机森林(Random Forest)

    随机森林(较详细) #1决策树学习 #2随机森林 (1)属于集成学习(Ensemble Learning)的方法。...

网友评论

      本文标题:决策树与随机森林

      本文链接:https://www.haomeiwen.com/subject/allirltx.html