美文网首页
2020-07-[01,02] 暑期学习日更计划(Classif

2020-07-[01,02] 暑期学习日更计划(Classif

作者: Reza_ | 来源:发表于2020-07-02 00:15 被阅读0次

机器学习部分

classification

  李宏毅老师的两节课:classification+ logistic regression两节课,涉及了很多概率论的内容,说实话,数学推导的那部分实在没怎么看懂,总感觉自己一直在学数学。
  把课后作业答案的代码贴一下,用numpy写得解决分类问题,真心有点强。英文的注释是本来就有的,中文的是我加上去的哈哈哈。
  对于逻辑回归问题,也就是分类问题,有两种生成模型的方案。1.判别模型 Discriminative model。2.生成模型generative model。这两个模型有一些复杂,现在还没有完全搞清楚,留下次再补。

判别模型

数据读取与预处理部分

import numpy as np

np.random.seed(0)
X_train_fpath = './hw2_data/X_train'
Y_train_fpath = './hw2_data/Y_train'
X_test_fpath = './hw2_data/X_test'
output_fpath = './output_{}.csv'

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)                              
 #next()函数,先把训练数据存放到到缓存区
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
#舍弃数据中由于跨行产生的换行符,并按照 逗号 分割数据
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    # This function normalizes specific columns of X.
    # The mean and standard variance of training data will be reused when processing testing data.
    #
    # Arguments:
    #     X: data to be processed
    #     train: 'True' when processing training data, 'False' for testing data
    #     specific_column: indexes of the columns that will be normalized. If 'None', all columns
    #         will be normalized.
    #     X_mean: mean value of training data, used when train = 'False'
    #     X_std: standard deviation of training data, used when train = 'False'
    # Outputs:
    #     X: normalized data
    #     X_mean: computed mean value of training data
    #     X_std: computed standard deviation of training data

    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
        X_std  = np.std(X[:, specified_column], 0).reshape(1, -1)

    X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8)
     #Z-score标准化,和linear regression中的一样
    return X, X_mean, X_std

def _train_dev_split(X, Y, dev_ratio = 0.25):
    # This function spilts data into training set and development set.
    train_size = int(len(X) * (1 - dev_ratio))
    return X[:train_size], Y[:train_size], X[train_size:], Y[train_size:]
#把数据集按比例分为 训练集 和 发展集 ???我觉得更准确的应该叫验证集???

# Normalize training and testing data
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)
    
# Split data into training set and development set
dev_ratio = 0.1
X_train, Y_train, X_dev, Y_dev = _train_dev_split(X_train, Y_train, dev_ratio = dev_ratio)

train_size = X_train.shape[0]
dev_size = X_dev.shape[0]
test_size = X_test.shape[0]
data_dim = X_train.shape[1]
print('Size of training set: {}'.format(train_size))
print('Size of development set: {}'.format(dev_size))
print('Size of testing set: {}'.format(test_size))
print('Dimension of data: {}'.format(data_dim))

函数定义部分

def _shuffle(X, Y):
    # This function shuffles two equal-length list/array, X and Y, together.
    randomize = np.arange(len(X))
    np.random.shuffle(randomize)
    return (X[randomize], Y[randomize])
#根据索引随机打乱数据

def _sigmoid(z):
    # Sigmoid function can be used to calculate probability.
    # To avoid overflow, minimum/maximum output value is set.
    return np.clip(1 / (1.0 + np.exp(-z)), 1e-8, 1 - (1e-8))
#sigmoid函数,返回一个在0-1之间的值 ,np.clip()会用后两个参数替换数组中所有比小大值小,或比最大值大的值

def _f(X, w, b):
    # This is the logistic regression function, parameterized by w and b
    #
    # Arguements:
    #     X: input data, shape = [batch_size, data_dimension]
    #     w: weight vector, shape = [data_dimension, ]
    #     b: bias, scalar
    # Output:
    #     predicted probability of each row of X being positively labeled, shape = [batch_size, ]
    return _sigmoid(np.matmul(X, w) + b)

def _predict(X, w, b):
    # This function returns a truth value prediction for each row of X 
    # by rounding the result of logistic regression function.
    return np.round(_f(X, w, b)).astype(np.int)
    #预测值取整并强制设置为整型,因为这是个二项分布,只有0和1。。。

def _accuracy(Y_pred, Y_label):
    # This function calculates prediction accuracy
    acc = 1 - np.mean(np.abs(Y_pred - Y_label))
    return acc
#返回准确率

定义交叉熵函数与梯度函数

交叉熵二分类函数
def _cross_entropy_loss(y_pred, Y_label):
    # This function computes the cross entropy.
    #
    # Arguements:
    #     y_pred: probabilistic predictions, float vector
    #     Y_label: ground truth labels, bool vector
    # Output:
    #     cross entropy, scalar
    cross_entropy = -np.dot(Y_label, np.log(y_pred)) - np.dot((1 - Y_label), np.log(1 - y_pred))
    return cross_entropy
#交叉熵为损失函数

def _gradient(X, Y_label, w, b):
    # This function computes the gradient of cross entropy loss with respect to weight w and bias b.
    y_pred = _f(X, w, b)
    pred_error = Y_label - y_pred
    w_grad = -np.sum(pred_error * X.T, 1)
    b_grad = -np.sum(pred_error)
    return w_grad, b_grad
  #梯度函数和linear regression中没什么太大区别,就是多了一个b的偏置项

训练

# Zero initialization for weights ans bias
w = np.zeros((data_dim,)) 
b = np.zeros((1,))
#为参数w,和偏置b赋初始值

# Some parameters for training    
max_iter = 10
batch_size = 8
learning_rate = 0.2
#超参数

# Keep the loss and accuracy at every iteration for plotting
train_loss = []
dev_loss = []
train_acc = []
dev_acc = []

# Calcuate the number of parameter updates
step = 1

# Iterative training
for epoch in range(max_iter):
    # Random shuffle at the begging of each epoch
    X_train, Y_train = _shuffle(X_train, Y_train)
        
    # Mini-batch training
    for idx in range(int(np.floor(train_size / batch_size))):
        X = X_train[idx*batch_size:(idx+1)*batch_size]
        Y = Y_train[idx*batch_size:(idx+1)*batch_size]
#一次只取一个批次的数据

        # Compute the gradient
        w_grad, b_grad = _gradient(X, Y, w, b)
            
        # gradient descent update
        # learning rate decay with time
        w = w - learning_rate/np.sqrt(step) * w_grad
        b = b - learning_rate/np.sqrt(step) * b_grad
#learning_rate随着训练的进行,不断衰减步长
        step = step + 1
            
    # Compute loss and accuracy of training set and development set
    y_train_pred = _f(X_train, w, b)
    Y_train_pred = np.round(y_train_pred)
    train_acc.append(_accuracy(Y_train_pred, Y_train))
    train_loss.append(_cross_entropy_loss(y_train_pred, Y_train) / train_size)

    y_dev_pred = _f(X_dev, w, b)
    Y_dev_pred = np.round(y_dev_pred)
    dev_acc.append(_accuracy(Y_dev_pred, Y_dev))
    dev_loss.append(_cross_entropy_loss(y_dev_pred, Y_dev) / dev_size)

print('Training loss: {}'.format(train_loss[-1]))
print('Development loss: {}'.format(dev_loss[-1]))
print('Training accuracy: {}'.format(train_acc[-1]))
print('Development accuracy: {}'.format(dev_acc[-1]))

画出准确率和损失曲线

import matplotlib.pyplot as plt

# Loss curve
plt.plot(train_loss)
plt.plot(dev_loss)
plt.title('Loss')
plt.legend(['train', 'dev'])
plt.savefig('loss.png')
plt.show()

# Accuracy curve
plt.plot(train_acc)
plt.plot(dev_acc)
plt.title('Accuracy')
plt.legend(['train', 'dev'])
plt.savefig('acc.png')
plt.show()
accuracy loss

生成模型

因为概率论学得很仓促,所以这个模型一直不是很理解, 只能先贴代码了

数据预处理

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

# Normalize training and testing data
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)

得到期望值与方差

# Compute in-class mean
X_train_0 = np.array([x for x, y in zip(X_train, Y_train) if y == 0])
X_train_1 = np.array([x for x, y in zip(X_train, Y_train) if y == 1])

mean_0 = np.mean(X_train_0, axis = 0)
mean_1 = np.mean(X_train_1, axis = 0)  

# Compute in-class covariance
cov_0 = np.zeros((data_dim, data_dim))
cov_1 = np.zeros((data_dim, data_dim))

for x in X_train_0:
    cov_0 += np.dot(np.transpose([x - mean_0]), [x - mean_0]) / X_train_0.shape[0]
for x in X_train_1:
    cov_1 += np.dot(np.transpose([x - mean_1]), [x - mean_1]) / X_train_1.shape[0]

# Shared covariance is taken as a weighted average of individual in-class covariance.
cov = (cov_0 * X_train_0.shape[0] + cov_1 * X_train_1.shape[0]) / (X_train_0.shape[0] + X_train_1.shape[0])

计算权值与偏置并输出结果

# Compute inverse of covariance matrix.
# Since covariance matrix may be nearly singular, np.linalg.inv() may give a large numerical error.
# Via SVD decomposition, one can get matrix inverse efficiently and accurately.
u, s, v = np.linalg.svd(cov, full_matrices=False)
inv = np.matmul(v.T * 1 / s, u.T)

# Directly compute weights and bias
w = np.dot(inv, mean_0 - mean_1)
b =  (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1))\
    + np.log(float(X_train_0.shape[0]) / X_train_1.shape[0]) 

# Compute accuracy on training set
Y_train_pred = 1 - _predict(X_train, w, b)
print('Training accuracy: {}'.format(_accuracy(Y_train_pred, Y_train)))

概率论部分

1.矩估计,最大似然估计
2.估计量的评判标准,有效性,无偏性,一致性

相关文章

  • 【日更91】暑期学习

    儿子早晨7点多起床。7:30不到出门。今天上了数学物理。数学把期末试卷订正了一下讲了圆,物理复习了杠杆。我看到他的...

  • 暑期学习计划

    一考完试,孩子们都欢呼雀跃,终于放假了!安静下来还得好好制定一下假期学习生活计划,好习惯可不能放假。 ...

  • 2020-07-计划

    有好几个月计划的一些事都没有好好落地,这个月不贪多,要落实重点项。拉回年初计划中走偏的小羊 本月关键词:思考。工作...

  • 2019 暑期成长计划学习心得

    2019 暑期成长计划学习心得 我非常高兴地参加了2019年暑期成长计划的学习,在学习中,不仅认真观看了名师的讲课...

  • 使用GoogleNet和AlexNet迁移学习ECG

    今天的任务是依照这篇介绍的方法,使用GoogleNet和AlexNet迁移学习ECG。Signal Classif...

  • 暑期学习计划2018

    暑期学习计划(2018.7.1-8.31) 明确目标,详细计划,立刻行动,修正行动,坚持到底!!!玩中学,做中学!...

  • 暑期英语学习计划

    学习计划 学员姓名: 学员年龄:5年级 学员成绩:C 学习时长:每天1小时,共30天,共计30学时 学员目标: 1...

  • 2016暑期学习计划

    任务: 1.linux能使用命令行配置开发环境 2.html,css,js的学习 3.学计算机怎么能不懂网络?!暑...

  • 2020暑期学习计划

    经历了漫长的寒假,迎来了新形式的线上网课,终于盼来了复学的日子,可是刚上了2周多,就又到了放暑假的日子,娃们的...

  • 暑期学生学习计划

    长葛三中八年级暑期学生学习生活具体安排,有计划的完成暑假作业: 1.看一看家乡变化,关注社会发展。 ...

网友评论

      本文标题:2020-07-[01,02] 暑期学习日更计划(Classif

      本文链接:https://www.haomeiwen.com/subject/iandqktx.html