一文梳理金融风控建模全流程(Python)

本文介绍: 。借助A卡模型，银行或金融机构能够根据申请人的个人信息和历史数据，对其进行风险评估。这种评估能够帮助金融机构判断借款人是否具备偿还贷款的能力和意愿。

▍目录

一、简介

二、目标定义与数据准备

三、安装scorecardpy包

四、数据检查

五、数据筛选

pip install scorecardpy

# -*- coding:utf-8 -*-

import scorecardpy as sc
import pandas as pd
import numpy as np

#scorecardpy自带数据
dat = sc.germancredit()

#查看数据行列
print("数据行列",dat.shape)
print("随机抽取5行数据n",dat.sample(5))

#统计每个变量的缺失占比情况
result1 = (dat.isnull().sum()/dat.shape[0]).map(lambda x:"{:.2%}".format(x))
print(result1)


#other_debtors_or_guarantors 缺失值过多，在建模时失去效用故删除
#other_installment_plans 这一列缺失占比也较高也可以删除
dat["other_installment_plans"].value_counts()
#telephone 对建模没有太大意义故删除，但收集数据时也要确保其是否被填写以保证客户真实性及后续操作，此处不讨论。
#删除三项内容
dat = dat.drop(columns=["other_debtors_or_guarantors","other_installment_plans","telephone"])

#查看数据的信息
dat.info()

Data columns (total 18 columns):
 #   Column                                                    Non-Null Count  Dtype   
---  ------                                                    --------------  -----   
 0   status_of_existing_checking_account                       1000 non-null   category
 1   duration_in_month                                         1000 non-null   int64   
 2   credit_history                                            1000 non-null   category
 3   purpose                                                   1000 non-null   object  
 4   credit_amount                                             1000 non-null   int64   
 5   savings_account_and_bonds                                 1000 non-null   category
 6   present_employment_since                                  1000 non-null   category
 7   installment_rate_in_percentage_of_disposable_income       1000 non-null   int64   
 8   personal_status_and_sex                                   1000 non-null   category
 9   present_residence_since                                   1000 non-null   int64   
 10  property                                                  1000 non-null   category
 11  age_in_years                                              1000 non-null   int64   
 12  housing                                                   1000 non-null   category
 13  number_of_existing_credits_at_this_bank                   1000 non-null   int64   
 14  job                                                       1000 non-null   category
 15  number_of_people_being_liable_to_provide_maintenance_for  1000 non-null   int64   
 16  foreign_worker                                            1000 non-null   category
 17  creditability                                             1000 non-null   object

# 经过人工筛选后，利用sc.var_filter自动判断数据建模可用性
dt_s = sc.var_filter(dat,y="creditability",iv_limit=0.02)

print(dat.shape) #手动删除后数据
print(dt_s.shape) #过滤变量后数据

(1000, 18)
(1000, 12)

train,test = sc.split_df(dt=dt_s,y="creditability").values()
#方法：随机抽样

#训练数据y的统计：
train.creditability.value_counts()

#scorecardpy默认使用决策树分箱，method=‘tree’
#这里使用卡方分箱，method=‘chimerge’
#返回的是一个字典数据，用pandas.concat()查看所有数据
bins = sc.woebin(dt_s,y="creditability",method="chimerge")
bins_df = pd.concat(bins).reset_index().drop(columns="level_0")

#制作变量分布图，此处把筛选后的11个变量均制作变量分布图后再进一步筛选可用变量

import matplotlib.pyplot as plt

sc.woebin_plot(bins["duration_in_month"])
sc.woebin_plot(bins["installment_rate_in_percentage_of_disposable_income"])
sc.woebin_plot(bins["present_employment_since"])
sc.woebin_plot(bins["savings_account_and_bonds"])
sc.woebin_plot(bins["purpose"])
sc.woebin_plot(bins["status_of_existing_checking_account"])
sc.woebin_plot(bins["credit_history"])
sc.woebin_plot(bins["housing"])
sc.woebin_plot(bins["property"])

# 此步骤需检查分箱的单调性、分箱数、IV值、每一箱是否合理等，再进行下一步的手动调整

#分箱调整
#scorecardpy可以自定义分箱，也可以自动分箱。此处选择手动分箱。手动分箱根据业务经验）

# 手动分箱
break_adj = {
    'age_in_years':[26,35,45],
    'credit_amount':[750,3000,5500]
}
bins_adj = sc.woebin(dt_s,y="creditability",breaks_list=break_adj) #调整后数据
bins_adj_df = pd.concat(bins_adj).reset_index().drop(columns="level_0")
bins_adj_df[bins_adj_df.variable.isin(["age_in_years",'credit_amount'])]

sc.woebin_plot(bins_adj["age_in_years"])
sc.woebin_plot(bins_adj['credit_amount'])

#逻辑回归，逻辑回归在金融建模中应用广泛
from sklearn.linear_model import LogisticRegression

y_train = train_woe.loc[:,"creditability"]
X_train =train_woe.loc[:,train_woe.columns!="creditability"]
y_test = test_woe.loc[:,"creditability"]
X_test = test_woe.loc[:,test_woe.columns!="creditability"]

lr=LogisticRegression(penalty='l1',C=0.9,solver='saga',n_jobs=-1)
lr.fit(X_train,y_train)

LogisticRegression(C=0.9, n_jobs=-1, penalty='l1', solver='saga')

[[0.7477907  0.77429024 0.03820958 0.33340333 0.42839855 0.33196916
  1.18999712 0.51742719 0.62251866 0.6756973  0.98183774]]

# VIF越高，多重共线性的影响越严重
# 在金融风险中我们使用经验法则:若VIF>4，则我们认为存在多重共线性

def checkVIF(df):
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    name = df.columns
    x = np.matrix(df)
    VIF_list = [variance_inflation_factor(x,i) for i in range(x.shape[1])]
    VIF = pd.DataFrame({'feature':name,"VIF":VIF_list})
    max_VIF = max(VIF_list)
    print(max_VIF)
    return VIF
checkVIF(train_woe)#计算训练集的VIF

train_pred = lr.predict_proba(X_train)[:,1]
test_pred =  lr.predict_proba(X_test)[:,1]

train_perf = sc.perf_eva(y_train,train_pred,title="train")
test_perf = sc.perf_eva(y_test,test_pred,title="test")

sc.perf_psi(
    score = {'train':train_score,'test':test_score},
    label = {'train':y_train,'test':y_test}
)

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

分箱简介风控

▍Scorecardpy库简介

▍目标定义与数据准备

▍数据安装与检查

▍数据筛选

▍数据划分

▍变量卡方分箱

▍手动分箱

▍建立模型

▍相关性分析

▍多重共线性检验VIF

▍KS和AUC

PSI稳定性指标

▍关键指标说明

发表回复取消回复