美文网首页
MLLib实践Naive Bayes

MLLib实践Naive Bayes

作者: wlu | 来源:发表于2017-02-25 20:43 被阅读0次

引言

本文基于Spark (1.5.0) ml库提供的pipeline完整地实践一次文本分类。pipeline将串联单词分割(tokenize)、单词频数统计(TF),特征向量计算(TF-IDF),朴素贝叶斯(Naive Bayes)模型训练等。
本文将基于“20 NewsGroups” 数据集训练并测试Naive Bayes模型。这二十个新闻组数据集合是收集大约20,000新闻组文档,均匀的分布在20个不同的集合。我将使用'20news-bydate.tar.gz'文件,因为该数据集中已经将数据划分为两类:train和test,非常方便我们对模型进行训练和评价。

20news-bydate.tar.gz - 20 Newsgroups sorted by date; duplicates and some headers removed (18846 documents)

Naive Bayes算法介绍

NB算法属于有监督分类算法,对输入数据:


代码中的几点注解:

  • 各类数据根据所在的子文件夹来分类,我们在写代码时需要利用子文件夹名称,这时可以通过调用sc.wholeTextFiles(...)函数得到RDD(String,String)类型的原始数据,_1表示文件的绝对路径,_2表示该文件的内容。我们进一步从_1中截取出子文件夹的名称f.split("/").takeRight(2).head.
  • pipeline框架基于DataFrame,所有我们需要将RDD转为DataFrame:
import sqlContext.implicits._
labelNameAndData.toDF("id", "sentence").cache()```
- 所有的转换都使用ml提供的类,未做任何定制或改动,当前模型在测试集上的准确度为82%。

代码:
```scala
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.{Logging, SparkConf, SparkContext}


object NBTest extends App with Logging {
  def createRawDf(s: String) = {
    //sc.setLogLevel("INFO")
    val fileNameData = sc.wholeTextFiles(s)

    val uniqueLabels = Array("alt.atheism", "comp.graphics", "comp.os.ms-windows.misc", "comp.sys.ibm.pc.hardware", "comp.sys.mac.hardware", "comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball", "rec.sport.hockey", "sci.crypt", "sci.electronics", "sci.med", "sci.space", "soc.religion.christian", "talk.politics.guns", "talk.politics.mideast", "talk.politics.misc", "talk.religion.misc")
    val uniqueLabelsBc = sc.broadcast(uniqueLabels)

    val labelNameAndData = fileNameData
      .map { case (f, data) => (f.split("/").takeRight(2).head, data) }
      .mapPartitions {
        itrs =>
          val labelIdMap = uniqueLabelsBc.value.zipWithIndex.toMap
          itrs.map {
            case (labelName, data) => (labelIdMap(labelName), data)
          }
      }

    import sqlContext.implicits._
    labelNameAndData.toDF("id", "sentence").cache()

  }

  def createTrainPpline() = {
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")

    val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")

    val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

    //val vecAssembler = new VectorAssembler().setInputCols(Array("features")).setOutputCol("id")

    val nb = new NaiveBayes().setFeaturesCol("features").setLabelCol("id")

    new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
  }

  val conf = new SparkConf().setMaster("local[2]").setAppName("nb")
    .set("spark.ui.enabled", "false")
  val sc = new SparkContext(conf)
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)

  val training = createRawDf("file:////root/work/test/20news-bydate-train/*")

  val ppline = createTrainPpline()
  val nbModel = ppline.fit(training)

  val test = createRawDf("file:////root/work/test/20news-bydate-test/*")
  val testRes = nbModel.transform(test)

  val evaluator = new MulticlassClassificationEvaluator().setLabelCol("id")
  val accuracy = evaluator.evaluate(testRes)
  println("Test Error = " + (1.0 - accuracy))

}

相关文章

网友评论

      本文标题:MLLib实践Naive Bayes

      本文链接:https://www.haomeiwen.com/subject/wuhnwttx.html