我的网站

美国网贷平台Prosper项面前目今

2022-01-14 03:27分类:资金到户 阅读:

1.数据来源

Prosper Loan Data | Kaggle

1.1项面前目今介绍

Prosper是美国的一家P2P网贷平台。成立于2005年,此刻拥有超过98万会员,超2亿美元的借贷额,是美国网贷动业鼻祖、方圆第二大的网贷平台。

Prosper Loan Data是由Joshua Schnessl将于Udacity Data Analyst Nanodegree获取的数据上传到Kaggle供兴味味的人分析的一个实例项面前目今。

尝试解决的题目

1.哪些用户更可能违约

2.设置模型展望客户是否会违约

1.2数据介绍

数据集选取了Prosper自2005年至2014年的贷款数据,总计113937条数据,81个变量。

变量的诠释

2.数据

2.1查望数据情况

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid',palette='tab10')
%matplotlib inline

读取数据

data=pd.read_csv('.\\prosperLoanData.csv')
data.info()
data.describe()
data.shape

(113937, 81)

data.dtypes.value_counts()

float64 50object 17int64 11bool 3dtype: int64

data.head()

可能望到必要对数据做以下处理:

1.变量有众栽分歧的取值,必要进动转换

2.一些变量存在缺失值,如CreditGrade

2.2数据转换

1 LoanStatus

平台把借款状态分为12栽:

Cancelled(废除)、Chargedoff(冲销,投资人有亏损)、Completed(寻常幸福,投资人无亏损)、Current(贷款还款中)、Defaulted(坏账,投资人有亏损)、FinalPaymentInProgress(末尾还款中,投资人无亏损)、Past Due(逾期还款,投资人无亏损),Cancelled的只有5笔,因而直接往失踪。

依据生意是在寻常还款期内如故已关闭,将LoanStatus分成两组,再屈服投资人有无亏损将已关闭的生意分为Completed和Defaulted:Current(贷款还款中)、Defaulted(包含Defaulted、Chargedoff)、Completed(包含Completed、FinalPaymentInProgress、Past Due)三组

def loan_status(s): 
    if s=='Chargedoff': 
        a='Defaulted' 
    elif s=='Defaulted':
        a = 'Defaulted' 
    elif s=='Cancelled':
        a='Cancelled' 
    elif s == 'Current': 
        a = 'Current' 
    else: 
        a='Completed' 
    return a

#将转换的数据增众到新列Status
data['Status']=data['LoanStatus'].apply(loan_status)
#将只有5动数据的Cancelled铲失踪
data=data[data['Status']!='Cancelled']

2 BankcardUtilization

按信誉原料挑交时借款人信誉卡总透支额和信誉卡总额度之比,分成5组(mild use,medium use,heavy use,super use,no use),对于0或NA的客户,用no use替代

def bank_card_use(s,oneforth=0.31,twoforce=0.6):
    if s<=oneforth:
        b='Mild Use'
    elif s<=twoforce:
        b='Medium Use'
    elif s<=1:
        b='Heavy Use'
    elif s>1:
        b='Super Use'
    else:
        b='No Use'
    return b

#将转换的数据增众到新列BankCardUse
oneFourth=data['BankcardUtilization'].quantile(0.25)
twoForth=data['BankcardUtilization'].quantile(0.5)
data['BankCardUse']=data['BankcardUtilization'].apply(bank_card_use)

3 LoanOriginationDate

由于2009年7月1日前后是两栽分歧模式,故增众一列将数据分成两个时间段

def date_phase(s):
    if s>='2009-07-01':
        c='After 200907'
    else:
        c='Before 200907'
    return c

#将转换的数据增众为新列DatePhase
data['DatePhase']=data['LoanOriginationDate'].apply(date_phase)

4 信誉分数转换

数据中有信誉分上下限两个数据CreditScoreRangeUpper和CreditScoreRangeLower,可能通昔日中央值外示信誉分

data['CreditScore']=((data.CreditScoreRangeUpper+data.CreditScoreRangeLower)
                     /2).round(0)

5 TotalProsperLoans

Prosper的贷款次数TotalProsperLoans,分成新客户(0或NA)和老客户

def customer_clarify(s):
    if s>0:
        c='Previous Borrower'
    else:
        c='New Borrower'
    return c
#增众客户分类一列
data['CustomerClarify']=data['TotalProsperLoans'].apply(customer_clarify)

2.3 探求性分析

2.3.1 利润越高,违约率越矬

经历各个年利润段的贷款状态,计算其违约率;并经历画图直不益看不益看察利润和违约的有关

incomeRange=data.groupby(['IncomeRange','Status'])['Status'].count().unstack()
index=['Not displayed', 'Not employed', '$0', '$1-24,999', '$25,000-49,999','$50,000-74,999', 
       '$75,000-99,999', '$100,000+']
incomeRange=incomeRange.reindex(index)
incomeRange['defaultRate']=incomeRange['Defaulted']/(incomeRange['Defaulted']+incomeRange['Completed'])

fig,ax1=plt.subplots(1,1,figsize=(18,10))
ax2=ax1.twinx()
incomeRange[['Defaulted','Completed']].plot.bar(ax=ax1)
incomeRange[['defaultRate']].plot(marker='o',ax=ax2)
plt.setp(ax1.get_xticklabels(),rotation=0,fontsize=14)
ax1.legend(fontsize='large')
ax2.legend(loc='center right',fontsize='large')
ax2.grid(False)
plt.title('DefaultRate by IncomeRange',fontsize=16)

可能望出,随着年利润的增添,违约率逐步泄劲

2.3.2 欠债程度矬的借款人违约率矬于欠债程度高的人

按常识来说债务利润比(DebtToIncomeRatio)的人,更具备还款能力,违约的可能性答该矬于债务利润比高的人

fig,ax=plt.subplots(1,1,figsize=(18,10))
data[data['Status']=='Completed']['DebtToIncomeRatio'].hist(bins=1000,
                                                            color='g',label='Completed')
data[data['Status']=='Defaulted']['DebtToIncomeRatio'].hist(bins=1000,
                                                           color='b',label='Defaulted')
ax.set_xlim([0,1])
plt.legend()
plt.title('DefaultRate by DebtToIncomeRatio',fontsize=16)
plt.setp(ax1.get_xticklabels(),rotation=0,fontsize=14)
ax.legend(fontsize='large')

如图所示,DebtToIncomeRatio小于0.6的借款人中,违约数小于未违约数;从图中也可能望出大单方借款人的债务利润比矬于0.25,表明平台的团体风险可控

2.3.3 透支比例希奇高的人违约概率大

屈服BankCarkUse和Status,获守信誉卡透支情况与违约的有关

bankCardUse=data.groupby(['BankCardUse','Status'])['Status'].count().unstack()
bankCardUse=bankCardUse.reindex(['Mild Use','Medium Use','Heavy Use','Super Use','No Use'],axis=0)
bankCardUse['defaultedRate']=bankCardUse['Defaulted']/(bankCardUse['Defaulted']+bankCardUse['Completed'])


fig,ax1=plt.subplots(1,1,figsize=(18,10))
ax2=ax1.twinx()
bankCardUse[['Defaulted','Completed']].plot.bar(ax=ax1)
bankCardUse['defaultedRate'].plot(marker='o',ax=ax2)

plt.setp(ax1.get_xticklabels(),rotation=0,fontsize=14)
ax1.legend(fontsize='large')
ax2.legend(loc='center right',fontsize='large')
ax2.grid(False)
plt.title('DefaultRate by BankCardUse',fontsize=16)

可能望出,透支比例希奇高的人违约概率大,Super Use、No Use违约率在40以上,这两栽情况的借款人要强化风险管控

2.3.4 消磨信誉分矬的借款人违约概率大

信誉评分(CreditScore)分值越高外示在经济行动中更从命诚实平等原则,违约的可能性越矬

CreditScore=data.groupby(['CreditScore','Status'])['Status'].count().unstack()

fig,ax=plt.subplots(1,1,figsize=(18,10))
CreditScore['Completed'].plot(label='Completed')
CreditScore['Defaulted'].plot(label='Defaulted')
ax.set_xlim([400,1000])

plt.setp(ax.get_xticklabels(),rotation=0,fontsize=14)
ax.legend(fontsize='large')
plt.title('DefaultRate by CreditScore',fontsize=16)

可能望出:

1.借款人信誉评分<560的贷款中,违约数目大于未违约数目

2.大单方贷款的借款人信誉评分大于600,也表明信誉评分矬的不懈弛申请到贷款

2.3.5 信誉评级高(2009.07.01之前)的人违约概率小

屈服CreditGrade与Status,获取2009.07.01之前信誉等级与违约情况的分布

CreditGrade=data.groupby(['CreditGrade','Status'])['Status'].count().unstack()
CreditGrade=CreditGrade.reindex(['NC','HR','E','D', 'C','B', 'A', 'AA'])
CreditGrade['defaultedRate']=CreditGrade['Defaulted']/(CreditGrade['Defaulted']+CreditGrade['Completed'])

fig,ax1=plt.subplots(1,1,figsize=(18,10))
ax2=ax1.twinx()
CreditGrade[['Defaulted','Completed']].plot.bar(ax=ax1)
CreditGrade['defaultedRate'].plot(marker='o',ax=ax2)

plt.setp(ax1.get_xticklabels(),rotation=0,fontsize=14)
ax1.legend(fontsize='large')
ax2.legend(loc='center right',fontsize='large')
ax2.grid(False)
plt.title('DefaultRate by CreditGrade')

可能望出:

1.信誉评级越高的人违约率越矬

2.平台借款人大单方信誉评级在D以上

2.3.6信誉评级高(2009.07.01之后)的人违约概率小

ProsperRating (Alpha)是2009.07.01之后的信誉等级变量,屈服ProsperRating (Alpha)和Status,获取2009.07.01之后信誉等级与违约情况的分布

ProsperRating=data.groupby(['ProsperRating (Alpha)','Status'])['Status'].count().unstack()
ProsperRating=ProsperRating.reindex(['HR','E','D', 'C','B', 'A', 'AA'])
ProsperRating['defaultedRate']=ProsperRating['Defaulted']/(ProsperRating['Defaulted']+ProsperRating['Completed'])

fig,ax1=plt.subplots(1,1,figsize=(18,10))
ax2=ax1.twinx()
ProsperRating[['Defaulted','Completed']].plot.bar(ax=ax1)
ProsperRating['defaultedRate'].plot(marker='o',ax=ax2)

ax1.legend(fontsize='large')
ax2.legend(loc='center right',fontsize='large')
plt.setp(ax1.get_xticklabels(),rotation=0,fontsize=14)
ax2.grid(False)
plt.title('DefaultedRate by ProsperRating')

可能望出:

1.信誉评级越高违约率越矬

2.与2009.07.01之前的信誉等级对比往失踪NC级,且团体违约率泄劲,表明平台的风控模型进动了希奇有效的调整

2.3.7 昔日七年有违约的借款人违约概率大

昔日七年违约次数(DelinquenciesLast7Years)可能衡量一小俺私家在昔日七年中征信情况,违约一次或以上的人在借款时违约概率更大。

DelinquenciesLast7Years=data.groupby(['DelinquenciesLast7Years','Status'])['Status'].count().unstack()
data7=DelinquenciesLast7Years.reset_index().drop(['Current'],axis=1)
Delinquencies0=data7[data7['DelinquenciesLast7Years']==0.0].set_index('DelinquenciesLast7Years')
DelinquenciesNot0=data7[data7['DelinquenciesLast7Years']!=0.0].set_index('DelinquenciesLast7Years').sum()
DelinquenciesNot0=pd.DataFrame(DelinquenciesNot0.values,index=['Completed','Defaulted'],columns=['Delinquencies'])
DelinquenciesNot0=DelinquenciesNot0.T
Delinquencies=pd.concat([Delinquencies0,DelinquenciesNot0],axis=0)
Delinquencies.rename(index={0.0:'NoDelinquencies'},inplace=True)
Delinquencies['DefauledRate']=Delinquencies['Defaulted']/(Delinquencies['Defaulted']+Delinquencies['Completed'])

fig,axes=plt.subplots(2,1,figsize=(18,10))
ax2=axes[1].twinx()
DelinquenciesLast7Years[['Defaulted','Completed']].plot(ax=axes[0])
Delinquencies[['Defaulted','Completed']].plot.bar(ax=axes[1])
Delinquencies[['DefauledRate']].plot(ax=ax2)

plt.setp(axes[1].get_xticklabels(),rotation=0,fontsize=14)
ax2.grid(False)
axes[0].set_title('Delinquencies Last 7 Years')
axes[1].set_title('NoDelinquencies vs Delinquencies')

可能望出:

1.昔日七年有违约记录的人再次违约的概率比异国过违约的人高

2.平台借款人大单方DelinquenciesLast7Years为0,表明平台风险可控

2.3.8 受雇佣状态一直时间长的借款人违约概率小

受雇佣状态(EmploymentStatusDuration)衡量一小俺私家的管事生活稳定情况,受雇佣状态一直时间越长违约概率越矬

EmploymentStatus=data.groupby(['EmploymentStatusDuration','Status'])['Status'].count().unstack()
EmploymentStatus['DefaultedRate']=EmploymentStatus['Defaulted']/(EmploymentStatus['Defaulted']+EmploymentStatus['Completed'])

fig,axes=plt.subplots(2,1,figsize=(18,10),sharex=True)
EmploymentStatus[['Defaulted','Completed']].plot(ax=axes[0])
EmploymentStatus[['DefaultedRate']].plot(color='k',marker='o',ax=axes[1])
axes[0].set_xlim([-5,120])
axes[1].set_ylim([0.2,0.4])
axes[1].grid(False)

可能望出:随着EmploymentStatusDuration的增进,违约概率逐步减小

2.4 缺失值处理

起先望下团体变量的数据缺失情况

missing=pd.concat([data.isnull().any(),data.count()],axis=1)
column=['是否缺失','数目']
missing.rename(columns={0:'是否缺失',1:'数目'},inplace=True)
Max=missing['数目'].max()
missing['缺失数目']=Max-missing['数目']
missing['缺失率']=missing['缺失数目']/Max
miss=missing[missing['数目']<Max]

从上外中,可知代外信誉等级的CreditGrade和ProsperRating (Alpha)变量缺失值较众,是由于平台是以2009年7月为分界点,操纵纷歧样的评级办法而产生。

2.4.1 CreditScore缺失值处理

CreditScore缺失了590条,所占比例约为0.5%旁边,所占比例不大,操纵中位数填充

data['CreditScore'].fillna(data['CreditScore'].median(),
                           inplace=True)

2.4.2 BorrowerState缺失值处理

BorrowerState缺失5512条数据,缺失率4.84%,缺失比例较大,因而可能考虑将缺失值单独举动一项因子,一时设置为“NOTA”

data['BorrowerState'].fillna('NOTA',inplace=True)

2.4.3 DebtToIncomeRatio缺失值处理

DebtToIncomeRatio缺失8554条数据,缺失率7.51%,所占比例希奇大。屈服债务利润比的分布情况,将0.01至0.4随机赋值给缺失的DebtToIncomeRatio

import random
def rand_missing(s):
    if s>=0:
        a=s
    else:
        a=random.uniform(0.01,0.4)
    return a

data['DebtToIncomeRatio']=data['DebtToIncomeRatio'].apply(rand_missing)

2.4.4 DelinquenciesLast7Years缺失值处理

DelinquenciesLast7Years缺失值为987条,在平台借款违约比例较大,因而将DelinquenciesLast7Years缺失值一切置为1

data['DelinquenciesLast7Years'].fillna(1,inplace=True)

2.4.5 EmploymentStatusDuration缺失值处理

EmploymentStatusDuration缺失值为7621条,所占比例很大,且违约比例很大。推求,管事越稳定还款能力越强,故将EmploymentStatusDuration的缺失值置为48。

data['EmploymentStatusDuration'].fillna(48,inplace=True)
2.4.6 InquiriesLast6Months缺失值处理

InquiriesLast6Months缺失值为696条,所占比例不大,违约比例跟团体数据相近,故将InquiriesLast6Months的缺失值置为2。

data['InquiriesLast6Months'].fillna(2,inplace=True)
2.4.7 ProsperRating (Alpha)缺失处理

在2009年之后,大约有3万条,筛选出ProsperRating (Alpha)变量的缺失值,大约有144条,所占比例较小,因而采用直接删除的方式进动处理。

ProsperRatingM=data[(data['ProsperRating (Alpha)'].isnull())&(data['DatePhase']=='After 200907')]
data.drop(ProsperRatingM.index,axis=0,inplace=True)
3.建模3.1字符串变量转换成数字变量

数据中存在分类变量是字符串类型,将其用数字代替

def switch_data(df):
    #Status
    data.loc[data['Status']=='Completed','Status']=1
    data.loc[data['Status']=='Defaulted','Status']=0
    data.loc[data['Status']=='Current','Status']=2
    #IsBorrowerHomeowner
    data.loc[data['IsBorrowerHomeowner']==False,'IsBorrowerHomeowner']=0
    data.loc[data['IsBorrowerHomeowner']==True,'IsBorrowerHomeowner']=1
    #CreditGrade
    data.loc[data['CreditGrade']=='NC','CreditGrade']=0
    data.loc[data['CreditGrade']=='HR','CreditGrade']=1
    data.loc[data['CreditGrade']=='E','CreditGrade']=2
    data.loc[data['CreditGrade']=='D','CreditGrade']=3
    data.loc[data['CreditGrade']=='C','CreditGrade']=4
    data.loc[data['CreditGrade']=='B','CreditGrade']=5
    data.loc[data['CreditGrade']=='A','CreditGrade']=6
    data.loc[data['CreditGrade']=='AA','CreditGrade']=7
    #ProsperRating (Alpha)
    data.loc[data['ProsperRating (Alpha)']=='HR','ProsperRating (Alpha)']=0
    data.loc[data['ProsperRating (Alpha)']=='E','ProsperRating (Alpha)']=1
    data.loc[data['ProsperRating (Alpha)']=='D','ProsperRating (Alpha)']=2
    data.loc[data['ProsperRating (Alpha)']=='C','ProsperRating (Alpha)']=3
    data.loc[data['ProsperRating (Alpha)']=='B','ProsperRating (Alpha)']=4
    data.loc[data['ProsperRating (Alpha)']=='A','ProsperRating (Alpha)']=5
    data.loc[data['ProsperRating (Alpha)']=='AA','ProsperRating (Alpha)']=6
    #IncomeRange
    data.loc[data['IncomeRange']=='Not displayed','IncomeRange']=0
    data.loc[data['IncomeRange']=='Not employed','IncomeRange']=1
    data.loc[data['IncomeRange']=='$0','IncomeRange']=2
    data.loc[data['IncomeRange']=='$1-24,999','IncomeRange']=3
    data.loc[data['IncomeRange']=='$25,000-49,999','IncomeRange']=4
    data.loc[data['IncomeRange']=='$50,000-74,999','IncomeRange']=5
    data.loc[data['IncomeRange']=='$75,000-99,999','IncomeRange']=6
    data.loc[data['IncomeRange']=='$100,000+','IncomeRange']=7
    #BankCardUse
    data.loc[data['BankCardUse']=='No Use','BankCardUse']=0
    data.loc[data['BankCardUse']=='Mild Use','BankCardUse']=1
    data.loc[data['BankCardUse']=='Medium Use','BankCardUse']=2
    data.loc[data['BankCardUse']=='Heavy Use','BankCardUse']=3
    data.loc[data['BankCardUse']=='Super Use','BankCardUse']=4
    #CustomerClarify
    data.loc[data['CustomerClarify']=='New Borrower','CustomerClarify']=0
    data.loc[data['CustomerClarify']=='Previous Borrower','CustomerClarify']=1
    
    return data

#转换
data=switch_data(data)

3.2建模分析(2009.07.01之前)3.2.1设置模型

为了评估分类器的性能,将数据集分为训练集和测试机,为了获取各变量对违约情况影响的厉重程度,可能考虑用随机森林算法

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve,auc

data=data[data['Status']!=2]
before2009=data[data['DatePhase']=='Before 200907']
Y=before2009['Status']
X=before2009[['CreditGrade','CustomerClarify','IncomeRange',
              'DebtToIncomeRatio','DelinquenciesLast7Years','BorrowerRate',
              'IsBorrowerHomeowner','ListingCategory (numeric)',
              'EmploymentStatusDuration','InquiriesLast6Months',
              'CreditScore','BankCardUse']]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=0)
rfr=RandomForestClassifier()
rfr.fit(X_train,Y_train)

3.2.2评估模型

result=rfr.predict(X_test)
def check(r,pr):
    count=len(pr)
    sum=0
    for i in range(count):
        if r[i]==pr[i]:
            sum+=1
    percent=round(sum/count,4)
    return percent

percent=check(list(Y_test.values),list(result))
print('2009.07.01之前:模型正确率:',percent)

2009.07.01之前:模型正确率: 0.6552

3.3建模分析(2009.07.01之后)3.3.1设置模型
afterData=data[data['Status'] != 2]
after2009=afterData[afterData['DatePhase']=='After 200907']
Y_a=after2009['Status']
X_a=after2009[['ProsperRating (Alpha)','CustomerClarify','IncomeRange','DebtToIncomeRatio','DelinquenciesLast7Years','BorrowerRate','IsBorrowerHomeowner','ListingCategory (numeric)','EmploymentStatusDuration','InquiriesLast6Months','CreditScore','BankCardUse']]
X_train_a, X_test_a, Y_train_a, Y_test_a = train_test_split(X_a, Y_a, test_size=0.3, random_state=0)
rfr_a=RandomForestClassifier()
rfr_a.fit(X_train_a,Y_train_a)

3.3.2评估模型

result_a=rfr_a.predict(X_test_a)
def check_a(r,pr):
    count=len(pr)
    sum=0
    for i in range(count):
        if r[i]==pr[i]:
            sum+=1
    percent=round(sum/count,4)
    return percent

percent_a=check_a(list(Y_test_a.values),list(result_a))
print('2009.07.01之前:模型正确率:',percent_a)

2009.07.01之前:模型正确率: 0.7419

4总结

2009.07.01之后Prosper平台做出模式调整,模型的确率由65.52%上升到74.19%,这表明这栽调整能有效展望违约,局限风险。

郑重声明:文章来源于网络,仅作为参考,如果网站中图片和文字侵犯了您的版权,请联系我们处理!

上一篇:超前交付印证品牌兑现力!双巨擎联袂质引随州矬密人居!

下一篇:为什么古蔺脆红李卖的那么好?

相关推荐

返回顶部