商务网站建设的主流程,网络营销师是做什么的,php怎么用来做网站,xammp配置wordpress两个最主要的处理缺失值的方法是#xff1a; ❏ 删除缺少值的行#xff1b; ❏ 填充缺失值#xff1b;
我们首先将serum_insulin的中的字段值0替换为None#xff0c;可以看到缺失值的数量为374个#xff1b;
print(pima[serum_insulin].isnull().sum())
pima[serum_insu…两个最主要的处理缺失值的方法是 ❏ 删除缺少值的行 ❏ 填充缺失值
我们首先将serum_insulin的中的字段值0替换为None可以看到缺失值的数量为374个
print(pima[serum_insulin].isnull().sum())
pima[serum_insulin] pima[serum_insulin].map(lambda x:x if x ! 0 else None)
print(pima[serum_insulin].isnull().sum())
# 0
# 374替换所有的缺失字段可以看到不同字段缺失值的情况是不一样的
columns [serum_insulin, bmi, plasma_glucose_concentration,diastolic_blood_pressure, triceps_thickness]
for c in columns:pima[c].replace([0], [None], inplaceTrue)print(pima.isnull().sum())
# times_pregnant 0
# plasma_glucose_concentration 5
# diastolic_blood_pressure 35
# triceps_thickness 227
# serum_insulin 374
# bmi 11
# pedigree_function 0
# age 0
# onset_diabetes 0
# dtype: int64可以看到此时describe不会针对有缺失值的列进行计算
print(pima.describe())
# times_pregnant pedigree_function age onset_diabetes
# count 768.000000 768.000000 768.000000 768.000000
# mean 3.845052 0.471876 33.240885 0.348958
# std 3.369578 0.331329 11.760232 0.476951
# min 0.000000 0.078000 21.000000 0.000000
# 25% 1.000000 0.243750 24.000000 0.000000
# 50% 3.000000 0.372500 29.000000 0.000000
# 75% 6.000000 0.626250 41.000000 1.000000
# max 17.000000 2.420000 81.000000 1.000000我们可以自己手动计算均值
# print(pima[plasma_glucose_concentration].mean(), pima[plasma_glucose_concentration].std())# 121.6867627785059 30.53564107280403处理缺失数据最简单的方式就是丢弃数据行我们使用dropna方法进行处理可以看到将近丢弃一半的数据从机器学习的角度考虑尽管数据都有值、很干净但是我们没有利用尽可能多的数据忽略了一半以上的观察值。
pima_dropped pima.dropna()
rows pima.shape[0]
rows_dropped pima_dropped.shape[0]
num_rows_lost round(100*(rows-rows_dropped)/rows)
print(lost {}% rows.format(num_rows_lost))
# lost 49% rows通过以下我们可以看到糖尿病的患病概率影响并不是很大
print(pima[onset_diabetes].value_counts(normalizeTrue))
print(pima_dropped[onset_diabetes].value_counts(normalizeTrue))
# onset_diabetes
# 0 0.651042
# 1 0.348958
# Name: proportion, dtype: float64
# onset_diabetes
# 0 0.668367
# 1 0.331633
# Name: proportion, dtype: float64通过以下可以看到各个字段的均值处理前后的大小
pima_mean pima.mean()
pima_dropped_mean pima_dropped.mean()
print(pima_mean)
print(pima_dropped_mean)
# times_pregnant 3.845052
# plasma_glucose_concentration 121.686763
# diastolic_blood_pressure 72.405184
# triceps_thickness 29.15342
# serum_insulin 155.548223
# bmi 32.457464
# pedigree_function 0.471876
# age 33.240885
# onset_diabetes 0.348958
# dtype: object# times_pregnant 3.30102
# plasma_glucose_concentration 122.627551
# diastolic_blood_pressure 70.663265
# triceps_thickness 29.145408
# serum_insulin 156.056122
# bmi 33.086224
# pedigree_function 0.523046
# age 30.864796
# onset_diabetes 0.331633
# dtype: object可以看到进行数据处理之后每个字段的变化率
mean_percent (pima_dropped_mean - pima_mean) / pima_mean
print(mean_percent)
# times_pregnant -0.141489
# plasma_glucose_concentration 0.007731
# diastolic_blood_pressure -0.024058
# triceps_thickness -0.000275
# serum_insulin 0.003265
# bmi 0.019372
# pedigree_function 0.108439
# age -0.071481
# onset_diabetes -0.04965
# dtype: object通过饼图查看各个字段的百分比变化
ax mean_percent.plot(kindbar, title% change in average column values)
ax.set_ylabel(% change)
plt.show()可以看到times_pregnant怀孕次数的均值在删除缺失值后下降了14%变化很大pedigree_function糖尿病血系功能也上升了11%也是个飞跃。可以看到删除行观察值会严重影响数据的形状所以应该保留尽可能多的数据。
使用处理过的数据训练scikit-learn的K最近邻KNNk-nearest neighbor分类模型可以看到最好的邻居数是7个此时KNN模型的准确率是74.5%
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCVX_dropped pima_dropped.drop(onset_diabetes, axis 1)
print(learning from {} rows.format(X_dropped.shape[0]))
y_dropped pima_dropped[onset_diabetes]knn_para {n_neighbors:[1,2,3,4,5,6,7]}
knn KNeighborsClassifier()
grid GridSearchCV(knn, knn_para)
grid.fit(X_dropped, y_dropped)
print(grid.best_score_, grid.best_params_)# learning from 392 rows
# 0.7348263550795197 {n_neighbors: 7}