基于样本数据(年龄、教育、工龄、地址、收入、负债率、信用卡负债、其他负债、违约),挖掘显著影响违约的因子。步骤如下:

import pandas as pd

In [2]:

filename = ‘/Users/frontc/book/ppdam/bankloan.xls’

In [3]:

data = pd.read_excel(filename)

In [29]:

data

Out[29]:

年龄 教育 工龄 地址 收入 负债率 信用卡负债 其他负债 违约
0 41 3 17 12 176 9.3 11.359392 5.008608 1
1 27 1 10 6 31 17.3 1.362202 4.000798 0
2 40 1 15 14 55 5.5 0.856075 2.168925 0
3 41 1 15 14 120 2.9 2.658720 0.821280 0
4 24 2 2 0 28 17.3 1.787436 3.056564 1
5 41 2 5 5 25 10.2 0.392700 2.157300 0
6 39 1 20 9 67 30.6 3.833874 16.668126 0
7 43 1 12 11 38 3.6 0.128592 1.239408 0
8 24 1 3 4 19 24.4 1.358348 3.277652 1
9 36 1 0 13 25 19.7 2.777700 2.147300 0
10 27 1 0 1 16 1.7 0.182512 0.089488 0
11 25 1 4 0 23 5.2 0.252356 0.943644 0
12 52 1 24 14 64 10.0 3.929600 2.470400 0
13 37 1 6 9 29 16.3 1.715901 3.011099 0
14 48 1 22 15 100 9.1 3.703700 5.396300 0
15 36 2 9 6 49 8.6 0.817516 3.396484 1
16 36 2 13 6 41 16.4 2.918216 3.805784 1
17 43 1 23 19 72 7.6 1.181952 4.290048 0
18 39 1 6 9 61 5.7 0.563274 2.913726 0
19 41 3 0 21 26 1.7 0.099008 0.342992 0
20 39 1 22 3 52 3.2 1.154816 0.509184 0
21 47 1 17 21 43 5.6 0.587552 1.820448 0
22 28 1 3 6 26 10.0 0.431600 2.168400 0
23 29 1 8 6 27 9.8 0.402192 2.243808 0
24 21 2 1 2 16 18.0 0.241920 2.638080 1
25 25 4 0 2 32 17.6 2.140160 3.491840 0
26 45 2 9 26 69 6.7 0.707319 3.915681 0
27 43 1 25 21 64 16.7 0.951232 9.736768 0
28 33 2 12 8 58 18.4 3.084208 7.587792 0
29 26 3 2 1 37 14.2 0.204906 5.049094 0
670 23 2 3 4 24 6.3 0.551880 0.960120 0
671 27 1 0 7 18 12.8 0.582912 1.721088 0
672 34 1 6 1 20 1.2 0.042480 0.197520 0
673 35 1 0 5 34 11.1 1.369962 2.404038 1
674 24 2 4 4 20 3.7 0.324120 0.415880 0
675 48 1 30 8 101 6.4 1.874560 4.589440 0
676 26 2 8 1 40 11.8 0.443680 4.276320 0
677 40 1 6 9 36 2.1 0.390852 0.365148 1
678 34 1 9 8 48 9.3 0.419616 4.044384 0
679 35 1 17 4 42 3.0 0.093240 1.166760 0
680 30 1 7 2 33 25.4 1.165098 7.216902 1
681 20 1 4 0 14 9.7 0.200984 1.157016 1
682 36 4 1 17 30 11.5 0.324300 3.125700 0
683 21 1 1 1 16 6.3 0.141120 0.866880 0
684 34 1 18 10 53 10.5 0.840315 4.724685 0
685 35 1 7 5 39 16.1 1.701609 4.577391 1
686 35 3 1 4 20 7.9 0.853200 0.726800 0
687 34 1 10 1 33 10.3 2.501664 0.897336 1
688 33 1 12 12 68 10.8 1.365984 5.978016 0
689 30 1 4 2 18 10.7 0.227268 1.698732 0
690 24 2 0 5 16 7.3 0.024528 1.143472 0
691 47 1 31 8 253 7.2 9.308376 8.907624 0
692 53 1 0 26 27 28.9 2.754459 5.048541 1
693 22 3 0 2 20 4.7 0.219020 0.720980 0
694 48 2 6 1 66 12.1 2.315940 5.670060 0
695 36 2 6 15 27 4.6 0.262062 0.979938 1
696 29 2 6 4 21 11.5 0.369495 2.045505 0
697 33 1 15 3 32 7.6 0.491264 1.940736 0
698 45 1 19 22 77 8.4 2.302608 4.165392 0
699 37 1 12 14 44 14.7 2.994684 3.473316 0

700 rows × 9 columns

In [17]:

x = data.iloc[:,:8].as_matrix() # iloc是取特定的行和列,逗号前为行,逗号后为列。前闭后开。这个是自变量

In [18]:

y = data.iloc[:,8].as_matrix() # 取的是index=8,即第九列。这个是因变量

In [19]:

from sklearn.linear_model import LogisticRegression as LR # 逻辑回归

from sklearn.linear_model import RandomizedLogisticRegression as RLR # 随机逻辑回归

In [20]:

rlr = RLR() # 初始化随机逻辑回归模型

In [21]:

rlr.fit(x,y) # 训练模型

Out[21]:

RandomizedLogisticRegression(C=1, fit_intercept=True,

​ memory=Memory(cachedir=None), n_jobs=1, n_resampling=200,

​ normalize=True, pre_dispatch=’3*n_jobs’, random_state=None,

​ sample_fraction=0.75, scaling=0.5, selection_threshold=0.25,

​ tol=0.001, verbose=False)

In [22]:

rlr.get_support() # 获取特征的筛选结果

Out[22]:

array([False, False, True, True, False, True, True, False], dtype=bool)

In [23]:

rlr.scores_ # 查看各特征的分数

Out[23]:

array([ 0.095, 0.09 , 1. , 0.4 , 0. , 0.985, 0.55 , 0.02 ])

In [24]:

x = data[data.columns[rlr.get_support()]].as_matrix() # 筛选好特征

/Users/frontc/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py:1275: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 9 but corresponding boolean dimension is 8

result = getitem(key)

In [26]:

lr = LR() # 建立逻辑回归模型

In [27]:

lr.fit(x,y) # 用筛选后的特征数据来训练模型

Out[27]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

​ intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,

​ penalty=’l2’, random_state=None, solver=’liblinear’, tol=0.0001,

​ verbose=0, warm_start=False)

In [28]:

lr.score(x,y) # 查看模型的平均正确率

Out[28]:

0.81428571428571428