网站建设推广行业,湖北 个人网站备案时间,公司注册网站建设,企业网站建设的公司价格【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例#xff0c;欢迎点赞#xff0c;关注共同学习交流。 本文的主要任务是通过决策树与随机森林模型预测一个员工离职的可能性并帮助人事部门理解员工为何离职。 目录1.获取数据2.数据预处理3.分析数据3.… 【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例欢迎点赞关注共同学习交流。 本文的主要任务是通过决策树与随机森林模型预测一个员工离职的可能性并帮助人事部门理解员工为何离职。 目录1.获取数据2.数据预处理3.分析数据3.1 相关性分析3.2 进行 T-Test4. 建立预测模型Decision Tree V.S. Random Forest5. 模型评估5.1ROC 图5.2通过决策树分析不同的特征的重要性1.获取数据 关注GZH阿旭算法与机器学习回复“ML35”即可获取本文数据集、源码与项目文档 # 引入工具包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline# 读入数据到Pandas Dataframe df
df pd.read_csv(HR_comma_sep.csv, index_colNone)2.数据预处理
# 检测是否有缺失数据
df.isnull().any()satisfaction_level False
last_evaluation False
number_project False
average_montly_hours False
time_spend_company False
Work_accident False
left False
promotion_last_5years False
sales False
salary False
dtype: bool# 数据的样例
df.head()satisfaction_levellast_evaluationnumber_projectaverage_montly_hourstime_spend_companyWork_accidentleftpromotion_last_5yearssalessalary00.380.5321573010saleslow10.800.8652626010salesmedium20.110.8872724010salesmedium30.720.8752235010saleslow40.370.5221593010saleslow 注:“turnover”列为标签:1表示离职0表示不离职其他列均为特征值 # 重命名
df df.rename(columns{satisfaction_level: satisfaction, last_evaluation: evaluation,number_project: projectCount,average_montly_hours: averageMonthlyHours,time_spend_company: yearsAtCompany,Work_accident: workAccident,promotion_last_5years: promotion,sales : department,left : turnover})# 将预测标签‘是否离职’放在第一列
front df[turnover]
df.drop(labels[turnover], axis1, inplace True)
df.insert(0, turnover, front)
df.head()turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotiondepartmentsalary010.380.532157300saleslow110.800.865262600salesmedium210.110.887272400salesmedium310.720.875223500saleslow410.370.522159300saleslow
3.分析数据
14999 条数据, 每一条数据包含 10 个特征总的离职率 24%平均满意度为 0.61
df.shape(14999, 10)# 特征数据类型.
df.dtypesturnover int64
satisfaction float64
evaluation float64
projectCount int64
averageMonthlyHours int64
yearsAtCompany int64
workAccident int64
promotion int64
department object
salary object
dtype: objectturnover_rate df.turnover.value_counts() / len(df)
turnover_rate0 0.761917
1 0.238083
Name: turnover, dtype: float64# 显示统计数据
df.describe()turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotioncount14999.00000014999.00000014999.00000014999.00000014999.00000014999.00000014999.00000014999.000000mean0.2380830.6128340.7161023.803054201.0503373.4982330.1446100.021268std0.4259240.2486310.1711691.23259249.9430991.4601360.3517190.144281min0.0000000.0900000.3600002.00000096.0000002.0000000.0000000.00000025%0.0000000.4400000.5600003.000000156.0000003.0000000.0000000.00000050%0.0000000.6400000.7200004.000000200.0000003.0000000.0000000.00000075%0.0000000.8200000.8700005.000000245.0000004.0000000.0000000.000000max1.0000001.0000001.0000007.000000310.00000010.0000001.0000001.000000
# 分组的平均数据统计
turnover_Summary df.groupby(turnover)
turnover_Summary.mean()satisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotionturnover00.6668100.7154733.786664199.0602033.3800320.1750090.02625110.4400980.7181133.855503207.4192103.8765050.0473260.005321
3.1 相关性分析
# 相关性矩阵
corr df.corr()
#corr (corr)
sns.heatmap(corr, xticklabelscorr.columns.values,yticklabelscorr.columns.values)corrturnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotionturnover1.000000-0.3883750.0065670.0237870.0712870.144822-0.154622-0.061788satisfaction-0.3883751.0000000.105021-0.142970-0.020048-0.1008660.0586970.025605evaluation0.0065670.1050211.0000000.3493330.3397420.131591-0.007104-0.008684projectCount0.023787-0.1429700.3493331.0000000.4172110.196786-0.004741-0.006064averageMonthlyHours0.071287-0.0200480.3397420.4172111.0000000.127755-0.010143-0.003544yearsAtCompany0.144822-0.1008660.1315910.1967860.1277551.0000000.0021200.067433workAccident-0.1546220.058697-0.007104-0.004741-0.0101430.0021201.0000000.039245promotion-0.0617880.025605-0.008684-0.006064-0.0035440.0674330.0392451.000000 正相关的特征:
projectCount VS evaluation: 0.349333projectCount VS averageMonthlyHours: 0.417211averageMonthlyHours VS evaluation: 0.339742
负相关的特征:
satisfaction VS turnover: -0.388375
# 比较离职和未离职员工的满意度
emp_population df[satisfaction][df[turnover] 0].mean()
emp_turnover_satisfaction df[df[turnover]1][satisfaction].mean()print( 未离职员工满意度: str(emp_population))
print( 离职员工满意度: str(emp_turnover_satisfaction) )未离职员工满意度: 0.666809590479516
离职员工满意度: 0.440098011761409173.2 进行 T-Test 进行一个 t-test, 看离职员工的满意度是不是和未离职员工的满意度明显不同
import scipy.stats as stats
stats.ttest_1samp(a df[df[turnover]1][satisfaction], # 离职员工的满意度样本popmean emp_population) # 未离职员工的满意度均值Ttest_1sampResult(statistic-51.3303486754725, pvalue0.0)T-Test 显示pvalue (0) 非常小, 所以他们之间是显著不同的
degree_freedom len(df[df[turnover]1])LQ stats.t.ppf(0.025,degree_freedom) # 95%致信区间的左边界RQ stats.t.ppf(0.975,degree_freedom) # 95%致信区间的右边界print (The t-分布 左边界: str(LQ))
print (The t-分布 右边界: str(RQ))
The t-分布 左边界: -1.9606285215955626
The t-分布 右边界: 1.9606285215955621# 概率密度函数估计
fig plt.figure(figsize(15,4),)
axsns.kdeplot(df.loc[(df[turnover] 0),evaluation] , colorb,shadeTrue,labelno turnover)
axsns.kdeplot(df.loc[(df[turnover] 1),evaluation] , colorr,shadeTrue, labelturnover)
ax.set(xlabelEmployee Evaluation, ylabelFrequency)
ax.legend()
plt.title(Employee Evaluation Distribution - Turnover V.S. No Turnover)Text(0.5, 1.0, Employee Evaluation Distribution - Turnover V.S. No Turnover)# 概率密度函数估计
fig plt.figure(figsize(15,4))
axsns.kdeplot(df.loc[(df[turnover] 0),averageMonthlyHours] , colorb,shadeTrue, labelno turnover)
axsns.kdeplot(df.loc[(df[turnover] 1),averageMonthlyHours] , colorr,shadeTrue, labelturnover)
ax.legend()
ax.set(xlabelEmployee Average Monthly Hours, ylabelFrequency)
plt.title(Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover)Text(0.5, 1.0, Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover)# 概率密度函数估计
fig plt.figure(figsize(15,4))
axsns.kdeplot(df.loc[(df[turnover] 0),satisfaction] , colorb,shadeTrue, labelno turnover)
axsns.kdeplot(df.loc[(df[turnover] 1),satisfaction] , colorr,shadeTrue, labelturnover)
plt.title(Employee Satisfaction Distribution - Turnover V.S. No Turnover)
ax.legend()matplotlib.legend.Legend at 0x281a5a6b820from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
# 将string类型转换为整数类型
df[department] df[department].astype(category).cat.codes
df[salary] df[salary].astype(category).cat.codes# 产生X, y
target_name turnover
X df.drop(turnover, axis1)
y df[target_name]# 将数据分为训练和测试数据集
# 注意参数 stratify y 意味着在产生训练和测试数据中, 离职的员工的百分比等于原来总的数据中的离职的员工的百分比
X_train, X_test, y_train, y_test train_test_split(X,y,test_size0.15, random_state123, stratifyy)df.head()turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotiondepartmentsalary010.380.53215730071110.800.86526260072210.110.88727240072310.720.87522350071410.370.52215930071
4. 建立预测模型Decision Tree V.S. Random Forest
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier# 决策树
dtree tree.DecisionTreeClassifier(criterionentropy,#max_depth3, # 定义树的深度, 可以用来防止过拟合min_weight_fraction_leaf0.01 # 定义叶子节点最少需要包含多少个样本(使用百分比表达), 防止过拟合)
dtree dtree.fit(X_train,y_train)
print (\n\n ---决策树---)
dt_roc_auc roc_auc_score(y_test, dtree.predict(X_test))
print (决策树 AUC %2.2f % dt_roc_auc)
print(classification_report(y_test, dtree.predict(X_test)))# 随机森林
rf RandomForestClassifier(criterionentropy,n_estimators1000, max_depthNone, # 定义树的深度, 可以用来防止过拟合min_samples_split10, # 定义至少多少个样本的情况下才继续分叉#min_weight_fraction_leaf0.02 # 定义叶子节点最少需要包含多少个样本(使用百分比表达), 防止过拟合)
rf.fit(X_train, y_train)
print (\n\n ---随机森林---)
rf_roc_auc roc_auc_score(y_test, rf.predict(X_test))
print (随机森林 AUC %2.2f % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))---决策树---
决策树 AUC 0.93precision recall f1-score support0 0.97 0.98 0.97 17141 0.93 0.89 0.91 536accuracy 0.96 2250macro avg 0.95 0.93 0.94 2250
weighted avg 0.96 0.96 0.96 2250---随机森林---
随机森林 AUC 0.97precision recall f1-score support0 0.98 1.00 0.99 17141 0.99 0.94 0.97 536accuracy 0.98 2250macro avg 0.99 0.97 0.98 2250
weighted avg 0.98 0.98 0.98 22505. 模型评估
5.1ROC 图 # ROC 图
from sklearn.metrics import roc_curve
rf_fpr, rf_tpr, rf_thresholds roc_curve(y_test, rf.predict_proba(X_test)[:,1])
dt_fpr, dt_tpr, dt_thresholds roc_curve(y_test, dtree.predict_proba(X_test)[:,1])plt.figure()# 随机森林 ROC
plt.plot(rf_fpr, rf_tpr, labelRandom Forest (area %0.2f) % rf_roc_auc)# 决策树 ROC
plt.plot(dt_fpr, dt_tpr, labelDecision Tree (area %0.2f) % dt_roc_auc)plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(False Positive Rate)
plt.ylabel(True Positive Rate)
plt.title(ROC Graph)
plt.legend(loclower right)
plt.show()5.2通过决策树分析不同的特征的重要性
## 画出决策树特征的重要性 ##
importances rf.feature_importances_
feat_names df.drop([turnover],axis1).columnsindices np.argsort(importances)[::-1]
plt.figure(figsize(12,6))
plt.title(Feature importances by RandomForest)
plt.bar(range(len(indices)), importances[indices], colorlightblue, aligncenter)
plt.step(range(len(indices)), np.cumsum(importances[indices]), wheremid, labelCumulative)
plt.xticks(range(len(indices)), feat_names[indices], rotationvertical,fontsize14)
plt.xlim([-1, len(indices)])
plt.show()## 画出决策树的特征的重要性 ##
importances dtree.feature_importances_
feat_names df.drop([turnover],axis1).columnsindices np.argsort(importances)[::-1]
plt.figure(figsize(12,6))
plt.title(Feature importances by Decision Tree)
plt.bar(range(len(indices)), importances[indices], colorlightblue, aligncenter)
plt.step(range(len(indices)), np.cumsum(importances[indices]), wheremid, labelCumulative)
plt.xticks(range(len(indices)), feat_names[indices], rotationvertical,fontsize14)
plt.xlim([-1, len(indices)])
plt.show()如果文章对你有帮助感谢点赞关注 关注下方GZH阿旭算法与机器学习回复“ML35”即可获取本文数据集、源码与项目文档欢迎共同学习交流