美文网首页
天天随手记,持续更新中(2018-05-02)

天天随手记,持续更新中(2018-05-02)

作者: 叨逼叨小马甲 | 来源:发表于2018-05-02 23:31 被阅读0次
  1. 降维方法:
  • principal component analysis
  • conical correlation analysis
  • singular value decomposition
  1. 原始数据预处理,三步骤
  • data preprocessing
  • feature engineering
  • feature selection;其中特征选择又有3方法,即
    - filter;select the best subset
    - wrapper; generate a subset---->learning algorithm 循环;
    - embedded method; generate a subset---->learning algorithm + performance 循环;
  1. The process of machine learning机器学习步骤


    image.png
  2. Some classification algorithms

  • nearest neighbour
  • Linear svm
  • RBF svm
  • Gaussian process
  • decision tree
  • random forest
  • neural net
  • ada boost
  • naive bayes
  • QDA


    image.png
  1. 几种算法
    A. Regression

    • Ordinal Regression序数回归: data in rank ordered categories
    • Poisson Regression: predicts event counts
    • Fast forest quantile regression: predicts a distribution
    • Linear regression: fast training, linear model
    • Bayesian linear regression: linear model, small data sets
    • neural network regression: accurate, long training times
    • decision forest regression: accurate, fast training times
    • boosted decision tree regression: accurate, fast training times, large memory footprint
      B. Clustering
    • K-means: unsupervised learning
      C. Anomaly detection 异常检测
    • PCA-Based Anomaly detection: fast training times
    • Two-class classification: under 100 features, aggressive boundary
      D. Two-class classification
    • two-class SVM: under 100 features, linear model
    • two-class averaged perceptron: fast training, linear model
    • two-class bayes point machine: fast training, linear model
    • two-class decision forest
    • two-class regression
    • two-class boosted decision tree
    • two-class decision jungle
    • two-class locally deep SVM
    • two-class neural network
      E. Multiclass Classification
    • multiclass logistic regression
    • multiclass neural network
    • multiclass decision forest
    • multiclass decision jungle
    • one-v-all multiclass: depend on the two-class classifier
  2. Semi-supervised learning
    Between supervised learning and unsupervised learning; 少部分数据有label,大多数数据没有label; 有高准确率,且与supervised learning相比,它训练成本低很多。

  3. Reinforcement Learning增强学习
    从一系列动作中,学习到最大反馈方程,此处反馈方程可以是“bad actions”或“good action”; 增强学习常常用于自动驾驶中,即通过周遭环境的一系列反馈来做出决定。


    image.png
  4. 机器学习算法,分类图


    image.png
  5. 一个tip
    如果训练过程中,数据结果很好,但在评估阶段结果很差,那很有可能是overfitting了。

  6. 常用validation的三种方法

    • hold-out validation,预留校验数据;适用大数据样本

    • k-fold cross validation,将训练集分成k等份;适用小数据样本


      image.png
    • leave-one-out validation(LOOCV),特殊的k-fold交叉校验,重复直至每个观察样本都作为过了校验数据。

  7. 评估模型的几种方法


    image.png
  • A. accuracy(精确率), precision(查准率),recall(查全率)
    如何判断哪个模型效果最好,可以通过F score,相关定义方程如下:


    image.png

    F越大越好

  • B. ROC curves


    image.png
    image.png

    其中ROC 曲线图的优点是不受类分布(不平衡类分布)的 影响

  • C. AUC (area under curve)


    image.png

    其中,auc越高越好

  • D. R平方,coefficient of determination,【0,1】
    It is a standard way of measuring how well the model fits the data.


    image.png

    缺点是:R总是这增长,从不会减少,所以数据更多的模型,它的R值总是更大,就会认为该模型更好;此外,如果训练数据更高阶,那么噪声很容易被误认为待训练数据,即噪声参与了模型的训练

image.png
  1. 一个tip
    有时候一个准确率很高的模型并不能说它是有用的,比如,一个模型说99%无癌症,1%有癌症,这是一个样本分布不均匀的案例, 此时需要建立两个模型,模型A用来判定有癌症,模型B用来判定无癌症

  2. Bias和Variance问题
    underfit属于high bias
    overfit属于high variant
    判断模型的好坏的过程中,如果训练集效果很好,但是校验集不好,那么是high variance问题(即overfit);如果训练集和校验集效果都不好,那么是high bias问题(即underfit)。
    解决方法:


    image.png

相关文章

网友评论

      本文标题:天天随手记,持续更新中(2018-05-02)

      本文链接:https://www.haomeiwen.com/subject/bbthrftx.html