美文网首页大数据机器学习与数据挖掘
[项目练习]信用卡违约分析

[项目练习]信用卡违约分析

作者: wnight | 来源:发表于2019-03-13 21:03 被阅读1次

本项目使用数据集

台湾某银行 2005 年 4 月到 9 月的信用卡数据,数据集一共包括 25 个字段


image.png

git地址

https://github.com/cystanford/credit_default
项目逻辑:
先对数据进行分析,再进行数据处理,最后利用多种分类器,探索不同分类器的分类效果

代码


# -*- coding: utf-8 -*-
# 信用卡违约率分析
import pandas as pd
from sklearn.model_selection import learning_curve, train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot as plt
import seaborn as sns
import time
# 数据加载
data = pd.read_csv('./UCI_Credit_Card.csv')
# 数据探索
print(data.shape) # 查看数据集大小
print(data.describe()) # 数据集概览
# 查看下一个月违约率的情况
next_month = data['default.payment.next.month'].value_counts()
print(next_month)
df = pd.DataFrame({'default.payment.next.month': next_month.index,'values': next_month.values})
print(df)
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.figure(figsize = (6,6))
plt.title('信用卡违约率客户\n (违约:1,守约:0)')
sns.set_color_codes("pastel")
sns.barplot(x = 'default.payment.next.month', y="values", data=df)
locs, labels = plt.xticks()
# 循环,为每个柱形添加文本标注,居中对齐
for xx, yy in zip(df['default.payment.next.month'],df['values']):
    plt.text(xx, yy+0.1, str(yy), ha='center')
plt.show()
# 特征选择,去掉ID字段、最后一个结果字段即可
data.drop(['ID'], inplace=True, axis =1) #ID这个字段没有用
target = data['default.payment.next.month'].values
columns = data.columns.tolist()
columns.remove('default.payment.next.month')
features = data[columns].values
# 30%作为测试集,其余作为训练集
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1)
    
# 构造各种分类器
classifiers = [
    SVC(random_state = 1, kernel = 'rbf'),    
    DecisionTreeClassifier(random_state = 1, criterion = 'gini'),
    RandomForestClassifier(random_state = 1, criterion = 'gini'),
    KNeighborsClassifier(metric = 'minkowski'),
    AdaBoostClassifier(random_state=1),
]
# 分类器名称
classifier_names = [
            'svc', 
            'decisiontreeclassifier',
            'randomforestclassifier',
            'kneighborsclassifier',
            'adaboostclassifier',
]
# 分类器参数
classifier_param_grid = [
            {'svc__C':[1], 'svc__gamma':[0.01]},
            {'decisiontreeclassifier__max_depth':[6,9,11]},
            {'randomforestclassifier__n_estimators':[3,5,6]} ,
            {'kneighborsclassifier__n_neighbors':[4,6,8]},
            {'adaboostclassifier__n_estimators': [10, 50, 100]},
]
 
# 对具体的分类器进行GridSearchCV参数调优
def GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, param_grid, score = 'accuracy'):
    response = {}
    gridsearch = GridSearchCV(estimator = pipeline, param_grid = param_grid, scoring = score, cv = 3)
    # 寻找最优的参数 和最优的准确率分数
    search = gridsearch.fit(train_x, train_y)
    print("GridSearch最优参数:", search.best_params_)
    print("GridSearch最优分数: %0.4lf" %search.best_score_)
    predict_y = gridsearch.predict(test_x)
    print("准确率 %0.4lf" %accuracy_score(test_y, predict_y))
    response['predict_y'] = predict_y
    response['accuracy_score'] = accuracy_score(test_y,predict_y)
    return response
 
for model, model_name, model_param_grid in zip(classifiers, classifier_names, classifier_param_grid):
    pipeline = Pipeline([
            ('scaler', StandardScaler()),
            (model_name, model)
    ])
    start = time.time()

    result = GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, model_param_grid , score = 'accuracy')

    end = time.time()

    print(end - start,'s')

运行结果

GridSearch最优参数: {'svc__C': 1, 'svc__gamma': 0.01}
GridSearch最优分数: 0.8174
准确率 0.8172
68.59484457969666 s
GridSearch最优参数: {'decisiontreeclassifier__max_depth': 6}
GridSearch最优分数: 0.8186
准确率 0.8113
1.8460278511047363 s
GridSearch最优参数: {'randomforestclassifier__n_estimators': 6}
GridSearch最优分数: 0.7998
准确率 0.7994
2.297856330871582 s
GridSearch最优参数: {'kneighborsclassifier__n_neighbors': 8}
GridSearch最优分数: 0.8040
准确率 0.8036
154.36387968063354 s
GridSearch最优参数: {'adaboostclassifier__n_estimators': 10}
GridSearch最优分数: 0.8187
准确率 0.8129
13.483576774597168 s

结论

  • 几种分类器中,AdaBoost一般效果都比较好,其中,相比50、100种,10种弱分类器已经够了。
  • 这种二分问题,支持向量机的效果比较好
  • 3万条数据,KNN和SVM就需要几分钟了,数据量上亿就很考验技术了。

参考文章

https://time.geekbang.org/column/article/85577

相关文章

网友评论

    本文标题:[项目练习]信用卡违约分析

    本文链接:https://www.haomeiwen.com/subject/krlxmqtx.html