网站策划论坛,平台推广方式有哪些,哪个做网站公司好,宁波外贸seo网站建设1.实战内容 (1) 加载鸢尾花数据集(iris.txt)并存到iris_df中,使用seaborn.lmplot寻找class#xff08;种类#xff09;项中的异常值#xff0c;其他异常值也同时处理 。
import pandas as pd
from sklearn.datasets import load_iris
pd.set_option(display.max_columns, N…1.实战内容 (1) 加载鸢尾花数据集(iris.txt)并存到iris_df中,使用seaborn.lmplot寻找class种类项中的异常值其他异常值也同时处理 。
import pandas as pd
from sklearn.datasets import load_iris
pd.set_option(display.max_columns, None)
pd.set_option(display.max_rows, None)
irisload_iris()
iris_df pd.DataFrame(iris[data], columnsiris[feature_names])
iris_df[target]iris[target]
import pandas as pd
import matplotlib.pyplot as plt
iris_dfpd.read_csv(iris.txt,sep,)
iris_df import seaborn as snsimport warnings
warnings.filterwarnings(ignore)sns.lmplot(xsepal_length,ysepal_width,colclass,datairis_df)
sns.lmplot(xpetal_length,ypetal_width,colclass,datairis_df)
iris_df[class].drop_duplicates()#通过上面的语句发现class中有异常值同时发现sepal_width和sepal_length有异常值#class应为3类将versicolor修改为Iris-versicolor,将iris-setossa修改为Iris-setosa
iris_df.loc[iris_df[class]versicolor,class]Iris-versicolor
iris_df.loc[iris_df[class]Iris-setossa,class]Iris-setosa
sns.lmplot(xsepal_length,ysepal_width,colclass,datairis_df)#重画,检验是否是3类 #通过直方图观察数据分布
iris_df.loc[iris_df[class]Iris-setosa,sepal_width].hist() # 将Iris-setosa的sepal_width小于2.5cm删除
iris_dfiris_df.loc[(iris_df[class]!Iris-setosa)|(iris_df[sepal_width]2.5)]
iris_df.loc[iris_df[class]Iris-setosa,sepal_width].hist() #列出异常值
iris_df.loc[(iris_df[class]Iris-versicolor)(iris_df[sepal_length]1.0)] # 将Iris-versicolor的sepal_length接近于0的异常值乘100‘米’转化成‘厘米’
iris_df.loc[(iris_df[class]Iris-versicolor)(iris_df[sepal_length]1.0),sepal_length]* 100
iris_df.loc[iris_df[class]Iris-versicolor,sepal_length].hist() (2) 使用isnull和describe查看缺失值并处理
# 列出缺失的样本
iris_df.isnull().sum() iris_df.describe() iris_df.loc[iris_df[petal_width].isnull()]#用该类的平均值来填补缺失值并列出修改过样本
avg_valueiris_df.loc[iris_df[class]Iris-setosa,petal_width].mean()
iris_df.loc[(iris_df[class]Iris-setosa)(iris_df[petal_width].isnull()), petal_width] avg_value
iris_df.loc[(iris_df[class]Iris-setosa)(iris_df[petal_width]avg_value)] #检查是否还存在缺失值
iris_df.isnull().sum() #将标签名称转化成标签(如Iris-setosa变成0)
class_mapping{Iris-setosa:0,Iris-versicolor:1,Iris-virginica:2}
iris_df[class]iris_df[class].map(class_mapping)
iris_df #保存数据
iris_df.to_csv(iris-clean.csv,indexFalse)
(3) 导入sklearn自带的数据集load_iris,获取特征矩阵和目标数组标签
from sklearn.datasets import load_iris
irisload_iris()
iris_Xiris.data
iris_Yiris.target
(4) 使用KNeighborsClassifier()分类预测
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,cross_val_score
def knn_function(X,Y):X_train,X_test,Y_train,Y_testtrain_test_split(X,Y,test_size0.3)clfKNeighborsClassifier()#建立模型clf.fit(X_train,Y_train)#训练模型predict_testclf.predict(X_test)print(预测的值,\n,predict_test)print(真实的值,\n,Y_test)scoreclf.score(X_test,Y_test,sample_weightNone)#计算准确率print(准确率,\n,score)return clf
knn_function(iris_X,iris_Y) (5) 导入iris_clean.csv,获取特征矩阵和目标数组调用函数knn_function()保存模型
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
iris pd.read_csv(iris-clean.csv)
#获取特征矩阵和目标数组标签
iris_XX iris.loc[0:,sepal_length:petal_width].values
iris_YY iris[class].values
#调用函数
knn_model knn_function(iris_XX,iris_YY)
# 保存模型
with open(knn_model.pkl, wb) as f:pickle.dump(knn_model, f)
# 读取保存模型
with open(knn_model.pkl, rb) as f:model pickle.load(f)
#模型的表现与训练集的选择关系
model_accuracies []
for repetition in range(1000):X_train, X_test, Y_train, Y_test \train_test_split(iris_XX, iris_YY, test_size0.3)
# 通过读取保存模型knn_model.pkl代码,建立模型modelscore model.score(X_test, Y_test, sample_weightNone)model_accuracies.append(score)
sns.distplot(model_accuracies)
plt.show() (6) 超参数与调整以sklearn自带的鸢尾花数据为例选择KNN模型调整超参数K的值用10折交叉验证判断K值为1~25时的最优值
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as pltiris datasets.load_iris()
X iris.data
Y iris.target
# 划分训练集和测试集测试集占总数据的33%随机数生成器种子为10
X_train, X_test, Y_train, Y_test train_test_split(X, Y, test_size0.33,random_state10)
k_range range(1, 26)
cv_scores []
for n in k_range:clf KNeighborsClassifier(n)scores cross_val_score(clf, X_train, Y_train, cv10,scoringaccuracy) cv_scores.append(scores.mean())
plt.plot(k_range, cv_scores)
plt.xlabel(K)
plt.ylabel(Accuracy)
plt.show()#选择最优的k
best_clf KNeighborsClassifier(n_neighbors5)
best_clf.fit(X_train, Y_train)
print(参数,best_clf.get_params())
print(准确率,best_clf.score(X_test, Y_test))
print(预测的值,best_clf.predict(X_test))2.数据集下载
https://gitee.com/qxh200000/c_-code/commit/1af2468e6b7f1bd8cd3b890018031c6fa6dff9bd