机器学习之泰坦尼克之灾

作者: Jiangyouhua | 来源:发表于2021-12-15 13:41 被阅读0次

kaggle之泰坦尼克之灾
机器学习：泰坦尼克之灾获救预测
干货来了！菜鸟入门最经典的机器学习项目，面试必考！
干货来了！菜鸟入门最经典的机器学习项目，面试必考！
泰坦尼克之灾_Kaggle
项目 0: 预测泰坦尼克号乘客生还率
机器学习数据集之泰坦尼克
泰坦尼克号数据分析以及幸存预测
《周易》乃“群经之首，大道之源”――成语
Machine learning：Titanic数据分析（一）导

Hi，大家好，我是姜友华。
泰坦尼克之灾是kaggle的一个入门案例，网上有大量的介绍，但由于版本升级的问题，部分方法已不适用。

这里是泰坦尼克之灾的项目，许多新人跟我们一样，从这里入手。

一、说明：

方案都是前人的，这里只对象我一样的初学者，作了一点点说明。

Kaggle自带了编辑器，也需要你的登录，与Colaboratory大体相同，不同的是，每一个代码块是独立的运行空间。也就是说，你在上一个代码块里引入或命名的，在这个代码块使用时，需要重新引入或命名。在Colaboratory是可以直接使用的。
泰坦尼克之灾需要解决的问题是，判断一个人在泰坦尼克号发生灾难时是否能获救。这是一个典型的二元分类任务。
如果你在Colaboratory中运行，需要你把相关数据下载下来。在我这里Colaboratory的反应速度比Kaggle快一些，所以，一开始我使用的是Colaboratory。现在我还是使用Colaboratory。

如果你也想在Colaboratory里处理，请下载泰坦尼克之灾的数据，并从Colaboratory的Files里上传到Session Storage里。
如果你想在Kaggle里处理，请在[Code]页面里，点击New Notebook或左上角的大加号Kaggle。

二、开始吧

1. 运行第一个Code。

Kaggle
打开Notebook后，有一块Python代码在等着我们。点击左上角的运行按钮。输出如下，告诉了我们数据的加载路径。

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv

Colaboratory
在Google Drive里，点击左上角的“+ New”，选择“More” > “Google Colaboratory”，打开一个新的Notebook。
点击左侧的“Files”图标，再点击“Upload to session storage”，上传刚才下载的scv数据。

2. 查看一下数据。

Kaggle

import pandas as pd
titanic = pd.read_csv('/kaggle/input/titanic/train.csv')
print(titanic.describe())
print(titanic.info)

Colaboratory

import pandas as pd
titanic = pd.read_csv('train.csv')
print(titanic.describe())
print(titanic.info)

这是Kaggle里的运行结果，Colaboratory会省略部分列，接下来的代码就没有什么不同了：

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
<bound method DataFrame.info of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ...    ...   
886                              Montvila, Rev. Juozas    male  27.0      0   
887                       Graham, Miss. Margaret Edith  female  19.0      0   
888           Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1   
889                              Behr, Mr. Karl Howell    male  26.0      0   
890                                Dooley, Mr. Patrick    male  32.0      0   

     Parch            Ticket     Fare Cabin Embarked  
0        0         A/5 21171   7.2500   NaN        S  
1        0          PC 17599  71.2833   C85        C  
2        0  STON/O2. 3101282   7.9250   NaN        S  
3        0            113803  53.1000  C123        S  
4        0            373450   8.0500   NaN        S  
..     ...               ...      ...   ...      ...  
886      0            211536  13.0000   NaN        S  
887      0            112053  30.0000   B42        S  
888      2        W./C. 6607  23.4500   NaN        S  
889      0            111369  30.0000  C148        C  
890      0            370376   7.7500   NaN        Q  

[891 rows x 12 columns]>

Columns：
- PassengerId，旅客ID；
- Survived，是否获救。1获救，0未获救；
- Pclass，几等舱，分1、2、3，1等舱最好，依次次之；
- Age，年龄；
- SibSp，弟妹个数；
- Parch，父母与小孩个数；
- Ticket，船票信息；
- Fare，票价；
- Name，姓名；
- Sex，性别；
- Cabin，客舱；
- Embarked，登船港口。
数据情况：
- 前面8项为数字，后面4项为字符串。
- 部分数据存在空白。

3. 特征工程。

先不管特征工程（Feature Engineering）。
我们需要做的是对相关因素进行整理，什么叫相关因素呢？一个事故发生了，被追责的相关人员就是相关因素。但我们需要对事不对人对，所以引起事件发生质变的各项事物就是相关因素，在这里也是所谓的特征。

在没有机器学习的年代，我们是如何为该问题进行求解的呢？方法是找出特征，加上权重，求出综合数，得出概率数。现在有了机器学习，我们需要做的是前一步或比前一步多一点，以求达到最佳解题，这就是所谓的特征工程。

我们已经粗略了解了特征工程，接下来让我们来实现它。

首先，找出特征：

女人与小孩。这个在电影里说得很明白，这里涉及到性别、年龄；
地位。这是社会规则，估计在这里也可以用，这里涉及到了客舱、票价和姓名，为什么会有姓名呢？因为在姓名名单里有他们对应社会地位的称呼。其中客舱与票价是对应关系，我们用数据更简单的客舱。
家庭。旧体小说里，求好汉饶命时会有“上有八旬老母，下有七岁顽童”，所以有父母与小孩估计有更大的可能被救；相反的是，兄弟姊妹多的被放弃的机会更大。
所以我们的特征有：Sex, Age, Pclass, Name, Parch, SibSp这6项。

其次，梳理数据。

Sex，转字符串为数字，male为0，female为1；

titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1

Age，分大人与小孩，>= 16为0，反之为1；缺的按平均数29.699118即为1，新的Age命名为IsChild

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
titanic.loc[titanic['Age'] >= 16, 'IsChild'] = 0
titanic.loc[titanic['Age'] < 16, 'IsChild'] = 1

Name，针对社会称呼进行映射，新的为Title。找出名字里带点的单词，然后再进行映射。

import re
def get_title(name):
  title_search = re.search('([A-Za-z]+)\.', name)

  if title_search:
    return title_search.group(1)
  return ''

titles = titanic['Name'].apply(get_title)
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 8, "Mlle": 9,
                 "Mme": 10, "Don": 11, "Lady": 12, "Countess": 13, "Jonkheer": 14, "Sir": 15, "Capt": 16, "Ms": 17
                 }
for k, v in title_mapping.items():
  titles[titles == k] = v

titanic["Title"] = titles

Pclass，Parch，SibSp，不变，不缺数据；

另外为了评估特征的有效性，我们还需要将其他的列转为数字。

Embarked，缺的按进人最多的港口处理。

titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2

Cabin，处理比较费力，不要了。

再次，方差分析。

from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt

predictors = ["Sex", "Age", "IsChild", "Pclass", "Title", "Parch", "SibSp",  "Fare", "Embarked"]

selector = SelectKBest(f_classif, k = 5)
selector.fit(titanic[predictors], titanic['Survived'])
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

方差图

最后，确认特征。

方差图的分析。

Sex影响力最大，留用。
Age没有IsChild有影响力，舍Age留IsChild。
Pclass影响力第二，留用。
Title影响力第三，留用。
Parch，SibSp好像没有什么影响力。这跟我们先前想的不一样，可能是因为不能直接呈现的原因，也就是说已这些特征区分的团体不具有独立性与识别性。
Fare有较高的影响力但基于Pclass在，先不考虑，舍去。
Embarked 港口，一家独大，这里不考虑使用，舍去。

 predictors = ['Sex', 'IsChild', 'Pclass', 'Title', 'Parch', 'SibSp']

使用Sklearn。
sklearn是一个Python第三方提供的非常强力的机器学习库。现在我们是API工程师，只需要调用输出就可以了。

4. 机器学习。

线性回归。

将数据集分3部分，2份用作学习，1份用作验证。运行3次，输出平均数。

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn import model_selection
from sklearn import metrics

kf = KFold(n_splits=3)
predictors = ["Sex", "IsChild", "Pclass", "Title", "Parch", "SibSp"]

alg = LinearRegression()
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print(scores)
print(scores.mean())

predictions = []
for train_index, test_index in kf.split(titanic):
  train_predictors = titanic[predictors].iloc[train_index, :]
  test_predictors = titanic[predictors].iloc[test_index, :]
  train_target = titanic['Survived'].iloc[train_index]
  alg.fit(train_predictors, train_target) 
  test_predictions = alg.predict(test_predictors)
  predictions.append(test_predictions)

predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0
accuracy = sum(predictions == titanic['Survived']) / len(predictions)
print(accuracy)

0.7968574635241302

逻辑回归

max_iter默认值为100，如果出现“STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.”错误，请加大它的值。如 LogisticRegression(random_state=1, max_iter=300)，得到的结果与线性回归一模一样。

from sklearn.linear_model import LogisticRegression

alg = LogisticRegression(random_state=1)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print(scores)
print(scores.mean())

0.8013468013468014

随机森林

得到的结果为最好。

from sklearn.ensemble import RandomForestClassifier

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print(scores.mean())

0.8215488215488215

混合模型

有人认为混合两个较高的模型，会得到一个更高的模型。个人感觉非常困惑，“两只火鸡成不了一只鹰”难道是假的。也许有的地方可用，所以也记了下来，参考着看看。

*这里是将高值的2倍+低值，再按3平均。

from sklearn.ensemble import GradientBoostingClassifier

algorithms = [
    RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2),
    LogisticRegression(random_state=1, max_iter=250)   
]
predictions = []
for train_index, test_index in kf.split(titanic):
    train_predictors = titanic[predictors].iloc[train_index, :]
    test_predictors = titanic[predictors].iloc[test_index, :]
    train_target = titanic['Survived'].iloc[train_index]
    full_test_prediction = []

    for i, alg in enumerate(algorithms):
        alg.fit(train_predictors, train_target) 
        test_prediction = alg.predict_proba(test_predictors.astype(float))[:, 1]
        full_test_prediction.append(test_prediction)

    test_predictions = (full_test_prediction[0] * 2 + full_test_prediction[1])/3
    predictions.append(test_predictions)

predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0
accuracy = sum(predictions == titanic['Survived'])/len(predictions)
print(accuracy)

0.8204264870931538

最好说明一下

当你将Age, Fare, Embarked，加入时，4个模型输出如下：

0.7968574635241302
0.7968574635241302
0.8361391694725029
0.8271604938271605

线性回归没有变化。
逻辑回归降到跟线性回归一样，同时因为max_iter默认的100，无法实现收敛，需要改为更大值。
随机森林的值有较大提高。
混合模型随着随机森林的值的提高而提高。

过拟合。

在这里，随机森林0.8215488215488215的准确率，是一个较高的值，但在机器学习里最高的不要定是最好的。跟所有的统计一样，因为样本数据是有限的，所以决定了这些数据具有倾向性。强调这些倾向性，会在样本数据里得到更好的结果，但在更大的范围会产生更大的错误，在机器学习里这种情形叫过拟合。

好，今天就到这里，我是姜友华，下次见。

kaggle之泰坦尼克之灾
项目介绍基于kaggle提供的泰坦尼克之灾数据，使用python与sklearn机器学习模块，预测乘客的存活状况...
机器学习：泰坦尼克之灾获救预测
机器学习环境：Pycharm + Python3.6数据来源：Kaggle网站 https://blog.csdn...
干货来了！菜鸟入门最经典的机器学习项目，面试必考！
今天小编带领大家完整的走完一个简单机器学习小的实战项目，这个项目是Kaggle上的经典项目《泰坦尼克号之灾》，也是...
干货来了！菜鸟入门最经典的机器学习项目，面试必考！
今天小编带领大家完整的走完一个简单机器学习小的实战项目，这个项目是Kaggle上的经典项目《泰坦尼克号之灾》，也是...
泰坦尼克之灾_Kaggle
小白根据前人经验尝试对泰坦尼克之灾进行预测分析一、确认目标预测乘客是否能在泰坦尼克之灾中幸存下来。二、数据探...
项目 0: 预测泰坦尼克号乘客生还率
机器学习工程师纳米学位机器学习基础项目 0: 预测泰坦尼克号乘客生还率 1912年，泰坦尼克号在第一次航行中就...
机器学习数据集之泰坦尼克
泰坦尼克号乘客数据集和鸢尾花数据集一样, 是机器学习中最常用的样例数据集之一下载数据集登录 https://w...
泰坦尼克号数据分析以及幸存预测
泰坦尼克号数据分析标签：机器学习深度学习这个是对kaggle上的泰坦尼克号的数据分析,通过对数据中的各个特征...
《周易》乃“群经之首，大道之源”――成语
成语⽆妄之灾出⾃：《周易·⽆妄》：“⽆妄之灾：或系之⽜，⾏⼈之得，⾢⼈之灾。” 解释：意思是没有胡...
Machine learning：Titanic数据分析（一）导
下一节：特征关系分析一、导览泰坦尼克之灾数据集本文译自kaggle上的处理泰坦尼克号数据集的这篇Noteboo...