当前位置：首页 > news >正文

网站建设推广行业湖北个人网站备案时间

news 2025/11/30 10:10:02

网站建设推广行业,湖北个人网站备案时间,公司注册网站建设,企业网站建设的公司价格【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例#xff0c;欢迎点赞#xff0c;关注共同学习交流。本文的主要任务是通过决策树与随机森林模型预测一个员工离职的可能性并帮助人事部门理解员工为何离职。目录1.获取数据2.数据预处理3.分析数据3.… 【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例欢迎点赞关注共同学习交流。本文的主要任务是通过决策树与随机森林模型预测一个员工离职的可能性并帮助人事部门理解员工为何离职。目录1.获取数据2.数据预处理3.分析数据3.1 相关性分析3.2 进行 T-Test4. 建立预测模型Decision Tree V.S. Random Forest5. 模型评估5.1ROC 图5.2通过决策树分析不同的特征的重要性1.获取数据关注GZH阿旭算法与机器学习回复“ML35”即可获取本文数据集、源码与项目文档 # 引入工具包 import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as matplot import seaborn as sns %matplotlib inline# 读入数据到Pandas Dataframe df df pd.read_csv(HR_comma_sep.csv, index_colNone)2.数据预处理 # 检测是否有缺失数据 df.isnull().any()satisfaction_level False last_evaluation False number_project False average_montly_hours False time_spend_company False Work_accident False left False promotion_last_5years False sales False salary False dtype: bool# 数据的样例 df.head()satisfaction_levellast_evaluationnumber_projectaverage_montly_hourstime_spend_companyWork_accidentleftpromotion_last_5yearssalessalary00.380.5321573010saleslow10.800.8652626010salesmedium20.110.8872724010salesmedium30.720.8752235010saleslow40.370.5221593010saleslow 注:“turnover”列为标签:1表示离职0表示不离职其他列均为特征值 # 重命名 df df.rename(columns{satisfaction_level: satisfaction, last_evaluation: evaluation,number_project: projectCount,average_montly_hours: averageMonthlyHours,time_spend_company: yearsAtCompany,Work_accident: workAccident,promotion_last_5years: promotion,sales : department,left : turnover})# 将预测标签‘是否离职’放在第一列 front df[turnover] df.drop(labels[turnover], axis1, inplace True) df.insert(0, turnover, front) df.head()turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotiondepartmentsalary010.380.532157300saleslow110.800.865262600salesmedium210.110.887272400salesmedium310.720.875223500saleslow410.370.522159300saleslow 3.分析数据 14999 条数据, 每一条数据包含 10 个特征总的离职率 24%平均满意度为 0.61 df.shape(14999, 10)# 特征数据类型. df.dtypesturnover int64 satisfaction float64 evaluation float64 projectCount int64 averageMonthlyHours int64 yearsAtCompany int64 workAccident int64 promotion int64 department object salary object dtype: objectturnover_rate df.turnover.value_counts() / len(df) turnover_rate0 0.761917 1 0.238083 Name: turnover, dtype: float64# 显示统计数据 df.describe()turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotioncount14999.00000014999.00000014999.00000014999.00000014999.00000014999.00000014999.00000014999.000000mean0.2380830.6128340.7161023.803054201.0503373.4982330.1446100.021268std0.4259240.2486310.1711691.23259249.9430991.4601360.3517190.144281min0.0000000.0900000.3600002.00000096.0000002.0000000.0000000.00000025%0.0000000.4400000.5600003.000000156.0000003.0000000.0000000.00000050%0.0000000.6400000.7200004.000000200.0000003.0000000.0000000.00000075%0.0000000.8200000.8700005.000000245.0000004.0000000.0000000.000000max1.0000001.0000001.0000007.000000310.00000010.0000001.0000001.000000 # 分组的平均数据统计 turnover_Summary df.groupby(turnover) turnover_Summary.mean()satisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotionturnover00.6668100.7154733.786664199.0602033.3800320.1750090.02625110.4400980.7181133.855503207.4192103.8765050.0473260.005321 3.1 相关性分析 # 相关性矩阵 corr df.corr() #corr (corr) sns.heatmap(corr, xticklabelscorr.columns.values,yticklabelscorr.columns.values)corrturnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotionturnover1.000000-0.3883750.0065670.0237870.0712870.144822-0.154622-0.061788satisfaction-0.3883751.0000000.105021-0.142970-0.020048-0.1008660.0586970.025605evaluation0.0065670.1050211.0000000.3493330.3397420.131591-0.007104-0.008684projectCount0.023787-0.1429700.3493331.0000000.4172110.196786-0.004741-0.006064averageMonthlyHours0.071287-0.0200480.3397420.4172111.0000000.127755-0.010143-0.003544yearsAtCompany0.144822-0.1008660.1315910.1967860.1277551.0000000.0021200.067433workAccident-0.1546220.058697-0.007104-0.004741-0.0101430.0021201.0000000.039245promotion-0.0617880.025605-0.008684-0.006064-0.0035440.0674330.0392451.000000 正相关的特征: projectCount VS evaluation: 0.349333projectCount VS averageMonthlyHours: 0.417211averageMonthlyHours VS evaluation: 0.339742 负相关的特征: satisfaction VS turnover: -0.388375 # 比较离职和未离职员工的满意度 emp_population df[satisfaction][df[turnover] 0].mean() emp_turnover_satisfaction df[df[turnover]1][satisfaction].mean()print( 未离职员工满意度: str(emp_population)) print( 离职员工满意度: str(emp_turnover_satisfaction) )未离职员工满意度: 0.666809590479516 离职员工满意度: 0.440098011761409173.2 进行 T-Test 进行一个 t-test, 看离职员工的满意度是不是和未离职员工的满意度明显不同 import scipy.stats as stats stats.ttest_1samp(a df[df[turnover]1][satisfaction], # 离职员工的满意度样本popmean emp_population) # 未离职员工的满意度均值Ttest_1sampResult(statistic-51.3303486754725, pvalue0.0)T-Test 显示pvalue (0) 非常小, 所以他们之间是显著不同的 degree_freedom len(df[df[turnover]1])LQ stats.t.ppf(0.025,degree_freedom) # 95%致信区间的左边界RQ stats.t.ppf(0.975,degree_freedom) # 95%致信区间的右边界print (The t-分布左边界: str(LQ)) print (The t-分布右边界: str(RQ)) The t-分布左边界: -1.9606285215955626 The t-分布右边界: 1.9606285215955621# 概率密度函数估计 fig plt.figure(figsize(15,4),) axsns.kdeplot(df.loc[(df[turnover] 0),evaluation] , colorb,shadeTrue,labelno turnover) axsns.kdeplot(df.loc[(df[turnover] 1),evaluation] , colorr,shadeTrue, labelturnover) ax.set(xlabelEmployee Evaluation, ylabelFrequency) ax.legend() plt.title(Employee Evaluation Distribution - Turnover V.S. No Turnover)Text(0.5, 1.0, Employee Evaluation Distribution - Turnover V.S. No Turnover)# 概率密度函数估计 fig plt.figure(figsize(15,4)) axsns.kdeplot(df.loc[(df[turnover] 0),averageMonthlyHours] , colorb,shadeTrue, labelno turnover) axsns.kdeplot(df.loc[(df[turnover] 1),averageMonthlyHours] , colorr,shadeTrue, labelturnover) ax.legend() ax.set(xlabelEmployee Average Monthly Hours, ylabelFrequency) plt.title(Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover)Text(0.5, 1.0, Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover)# 概率密度函数估计 fig plt.figure(figsize(15,4)) axsns.kdeplot(df.loc[(df[turnover] 0),satisfaction] , colorb,shadeTrue, labelno turnover) axsns.kdeplot(df.loc[(df[turnover] 1),satisfaction] , colorr,shadeTrue, labelturnover) plt.title(Employee Satisfaction Distribution - Turnover V.S. No Turnover) ax.legend()matplotlib.legend.Legend at 0x281a5a6b820from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve # 将string类型转换为整数类型 df[department] df[department].astype(category).cat.codes df[salary] df[salary].astype(category).cat.codes# 产生X, y target_name turnover X df.drop(turnover, axis1) y df[target_name]# 将数据分为训练和测试数据集 # 注意参数 stratify y 意味着在产生训练和测试数据中, 离职的员工的百分比等于原来总的数据中的离职的员工的百分比 X_train, X_test, y_train, y_test train_test_split(X,y,test_size0.15, random_state123, stratifyy)df.head()turnoversatisfactionevaluationprojectCountaverageMonthlyHoursyearsAtCompanyworkAccidentpromotiondepartmentsalary010.380.53215730071110.800.86526260072210.110.88727240072310.720.87522350071410.370.52215930071 4. 建立预测模型Decision Tree V.S. Random Forest from sklearn.metrics import roc_auc_score from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier from sklearn import tree from sklearn.tree import DecisionTreeClassifier# 决策树 dtree tree.DecisionTreeClassifier(criterionentropy,#max_depth3, # 定义树的深度, 可以用来防止过拟合min_weight_fraction_leaf0.01 # 定义叶子节点最少需要包含多少个样本(使用百分比表达), 防止过拟合) dtree dtree.fit(X_train,y_train) print (\n\n ---决策树---) dt_roc_auc roc_auc_score(y_test, dtree.predict(X_test)) print (决策树 AUC %2.2f % dt_roc_auc) print(classification_report(y_test, dtree.predict(X_test)))# 随机森林 rf RandomForestClassifier(criterionentropy,n_estimators1000, max_depthNone, # 定义树的深度, 可以用来防止过拟合min_samples_split10, # 定义至少多少个样本的情况下才继续分叉#min_weight_fraction_leaf0.02 # 定义叶子节点最少需要包含多少个样本(使用百分比表达), 防止过拟合) rf.fit(X_train, y_train) print (\n\n ---随机森林---) rf_roc_auc roc_auc_score(y_test, rf.predict(X_test)) print (随机森林 AUC %2.2f % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test)))---决策树--- 决策树 AUC 0.93precision recall f1-score support0 0.97 0.98 0.97 17141 0.93 0.89 0.91 536accuracy 0.96 2250macro avg 0.95 0.93 0.94 2250 weighted avg 0.96 0.96 0.96 2250---随机森林--- 随机森林 AUC 0.97precision recall f1-score support0 0.98 1.00 0.99 17141 0.99 0.94 0.97 536accuracy 0.98 2250macro avg 0.99 0.97 0.98 2250 weighted avg 0.98 0.98 0.98 22505. 模型评估 5.1ROC 图 # ROC 图 from sklearn.metrics import roc_curve rf_fpr, rf_tpr, rf_thresholds roc_curve(y_test, rf.predict_proba(X_test)[:,1]) dt_fpr, dt_tpr, dt_thresholds roc_curve(y_test, dtree.predict_proba(X_test)[:,1])plt.figure()# 随机森林 ROC plt.plot(rf_fpr, rf_tpr, labelRandom Forest (area %0.2f) % rf_roc_auc)# 决策树 ROC plt.plot(dt_fpr, dt_tpr, labelDecision Tree (area %0.2f) % dt_roc_auc)plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel(False Positive Rate) plt.ylabel(True Positive Rate) plt.title(ROC Graph) plt.legend(loclower right) plt.show()5.2通过决策树分析不同的特征的重要性 ## 画出决策树特征的重要性 ## importances rf.feature_importances_ feat_names df.drop([turnover],axis1).columnsindices np.argsort(importances)[::-1] plt.figure(figsize(12,6)) plt.title(Feature importances by RandomForest) plt.bar(range(len(indices)), importances[indices], colorlightblue, aligncenter) plt.step(range(len(indices)), np.cumsum(importances[indices]), wheremid, labelCumulative) plt.xticks(range(len(indices)), feat_names[indices], rotationvertical,fontsize14) plt.xlim([-1, len(indices)]) plt.show()## 画出决策树的特征的重要性 ## importances dtree.feature_importances_ feat_names df.drop([turnover],axis1).columnsindices np.argsort(importances)[::-1] plt.figure(figsize(12,6)) plt.title(Feature importances by Decision Tree) plt.bar(range(len(indices)), importances[indices], colorlightblue, aligncenter) plt.step(range(len(indices)), np.cumsum(importances[indices]), wheremid, labelCumulative) plt.xticks(range(len(indices)), feat_names[indices], rotationvertical,fontsize14) plt.xlim([-1, len(indices)]) plt.show()如果文章对你有帮助感谢点赞关注关注下方GZH阿旭算法与机器学习回复“ML35”即可获取本文数据集、源码与项目文档欢迎共同学习交流

查看全文

http://www.dnsts.com.cn/news/152000.html