备案网站建设承诺书,做网站排名需要多少钱,wordpress 板块,logo免费设计图案目录基础语法加载数据#xff1a;pd.read_csv查看数据大小#xff1a;shape浏览数据行字段#xff1a;columns浏览少量数据#xff1a;head()浏览数据概要#xff1a;describe()基础功能语法缺省值去除缺失值#xff1a;dropna按行删除#xff1a;存在空值#xff0c;即…
目录基础语法加载数据pd.read_csv查看数据大小shape浏览数据行字段columns浏览少量数据head()浏览数据概要describe()基础功能语法缺省值去除缺失值dropna按行删除存在空值即删除该行按行删除所有数据都为空值即删除该行按列删除该列非空元素小于10个的即去除该列设置子集去除多列都为空的行分割后删除缺省列.drop插补SimpleImputer()插补的扩展选择数据集里的目标单一目标多个目标输出to_csv分类变量删除分类列select_dtypes()顺序编码OrdinalEncoder()One-Hot 编码OneHotEncoder()计算唯一值unique()和nunique()建模方法基本流程决策树模型DecisionTreeRegressor定义加载数据分割数据train_test_split(X, y, random_state 0)拟合.fit(train_X, train_y)预测.predict(val_X)评估mean_absolute_error(val_y, val_predictions)范例随机森林模型DecisionTreeRegressor定义拟合.fit(train_X, train_y)预测predict(val_X)评估mean_absolute_error(val_y, melb_preds)范例1范例2简单函数通用的MAE计算随机森林计算MAE复杂函数决策树叶子节点的选择管道Pipeline介绍使用步骤计算计算数据平局值round计算日期datetime基础语法
加载数据pd.read_csv
加载csv格式的数据并以pd格式存储
import pandas as pd
# 查看文件相关路径
iowa_file_path ../input/home-data-for-ml-course/train.csv
# 读取数据并保存为 DataFrame 格式 以train.csv数据为例
home_data pd.read_csv(iowa_file_path)查看数据大小shape
home_data.shape结果
(1460, 81)浏览数据行字段columns
home_data.columns结果
Index([MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley,LotShape, LandContour, Utilities, LotConfig, LandSlope,Neighborhood, Condition1, Condition2, BldgType, HouseStyle,OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle,RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea,ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond,BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2,BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC,CentralAir, Electrical, 1stFlrSF, 2ndFlrSF, LowQualFinSF,GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath,BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd,Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt,GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond,PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch,ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal,MoSold, YrSold, SaleType, SaleCondition],dtypeobject)浏览少量数据head()
查看前五行数据
home_data.head()结果
浏览数据概要describe()
打印pd格式存储的数据
# 打印 home_data 的数据集
home_data.describe()结果 Rooms Price Distance Postcode Bedroom2 \
count 13580.000000 1.358000e04 13580.000000 13580.000000 13580.000000
mean 2.937997 1.075684e06 10.137776 3105.301915 2.914728
std 0.955748 6.393107e05 5.868725 90.676964 0.965921
min 1.000000 8.500000e04 0.000000 3000.000000 0.000000
25% 2.000000 6.500000e05 6.100000 3044.000000 2.000000
50% 3.000000 9.030000e05 9.200000 3084.000000 3.000000
75% 3.000000 1.330000e06 13.000000 3148.000000 3.000000
max 10.000000 9.000000e06 48.100000 3977.000000 20.000000 Bathroom Car Landsize BuildingArea YearBuilt \
count 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000
mean 1.534242 1.610075 558.416127 151.967650 1964.684217
std 0.691712 0.962634 3990.669241 541.014538 37.273762
min 0.000000 0.000000 0.000000 0.000000 1196.000000
25% 1.000000 1.000000 177.000000 93.000000 1940.000000
50% 1.000000 2.000000 440.000000 126.000000 1970.000000
75% 2.000000 2.000000 651.000000 174.000000 1999.000000
max 8.000000 10.000000 433014.000000 44515.000000 2018.000000 Lattitude Longtitude Propertycount
count 13580.000000 13580.000000 13580.000000
mean -37.809203 144.995216 7454.417378
std 0.079260 0.103916 4378.581772
min -38.182550 144.431810 249.000000
25% -37.856822 144.929600 4380.000000
50% -37.802355 145.000100 6555.000000
75% -37.756400 145.058305 10331.000000
max -37.408530 145.526350 21650.000000
结果解释 这部分为数据的概要描述每个字段的基本情况最顶行是数据集里的每一个字段左侧第一列是每个字段的基本情况每个字段有8个数字。第一个数字count显示了有多少行没有缺失值。 缺失值的原因有很多。例如在调查一套一居室的房子时不会收集第二居室Bedroom2的大小。这套房子的第二居室的count值就不会计算该套房子。 第二个值是mean它是平均值。在这种情况下std是标准偏差用于测量数值在数值上的分布情况。min、25%、50%、75%和max请想象将每列从最低值到最高值进行排序。第一个值就是最小值min最后一个值就是最大值max。如果你在列表中遍历四分之一个它就是25%的值比如10000个数据第2500个数据就是25%值第50%和第75%值的定义类似。
基础功能语法
缺省值
去除缺失值dropna 去除结束最好借助home_data.shape检查一下去掉了多少 按行删除存在空值即删除该行
如果有一项数值不存在则判定为缺失值进行删除。 去除前需要确定不要有某一列数据全部缺失
home_data home_data.dropna(axis0)按行删除所有数据都为空值即删除该行
如果有一项数值不存在则判定为缺失值进行删除。 去除前需要确定不要有某一列数据全部缺失
home_data home_data.dropna(axis0)或
home_data home_data.dropna(axis0,howany)按列删除该列非空元素小于10个的即去除该列
home_data home_data.dropna(axiscolumns, thresh10)设置子集去除多列都为空的行
将列Alley和FireplaceQu为空的行去除
home_data home_data.dropna(axisindex, howall, subset[Alley,FireplaceQu])分割后删除缺省列.drop
当我们分割好了训练集和验证集已经进行了一系列操作这时我们想知道同一训练集和验证集删除缺省列比不删除缺省列的MAE值是否会更优秀我们可以通过下述语句来检验。
# 获取缺少值的列的名称
cols_with_missing [col for col in X_train.columnsif X_train[col].isnull().any()]# 删除训练和验证数据中的列
reduced_X_train X_train.drop(cols_with_missing, axis1)
reduced_X_valid X_valid.drop(cols_with_missing, axis1)
print(删除缺省列后的MAE值:)
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))插补SimpleImputer()
将缺少的值替换为每列的平均值。SimpleImputer可以携带的参数 missing_valuesint, float, str, (默认)np.nan或是None, 即缺失值是什么。strategy默认为mean还有median、most_frequent、constant mean表示该列的缺失值由该列的均值填充median为中位数most_frequent为众数constant表示将空值填充为自定义的值但这个自定义的值要通过fill_value来定义。 fill_valuestr或数值默认为Zone。当strategy “constant时fill_value被用来替换所有出现的缺失值missing_values。fill_value为Zone当处理的是数值数据时缺失值missing_values会替换为0对于字符串或对象数据类型则替换为missing_value” 这一字符串。verboseint默认0控制imputer的冗长。copyboolean默认True表示对数据的副本进行处理False对数据原地修改。add_indicatorboolean默认FalseTrue则会在数据后面加入n列由0和1构成的同样大小的数据0表示所在位置非缺失值1表示所在位置为缺失值
from sklearn.impute import SimpleImputer# 插补生成新的训练特征和验证特征暂时没有列名
my_imputer SimpleImputer()
imputed_X_train pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid pd.DataFrame(my_imputer.transform(X_valid))# 对新的训练特征和验证特征赋予真实的列名
imputed_X_train.columns X_train.columns
imputed_X_valid.columns X_valid.columnsprint(插补后的MAE值:)
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
插补的扩展
我们像以前一样对缺失的值进行插补之后对于原始数据集中缺少条目的每一列我们添加一个新列显示该条目是否为缺失后进行插补的值。
# 制作副本以避免更改原始数据(输入时)
X_train_plus X_train.copy()
X_valid_plus X_valid.copy()# 制作新的栏目标明因缺省需要新增的列
for col in cols_with_missing:X_train_plus[col _was_missing] X_train_plus[col].isnull()X_valid_plus[col _was_missing] X_valid_plus[col].isnull()# 插补生成新的训练特征和验证特征暂时没有列名
my_imputer SimpleImputer()
imputed_X_train_plus pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus pd.DataFrame(my_imputer.transform(X_valid_plus))# 对新的训练特征和验证特征赋予真实的列名
imputed_X_train_plus.columns X_train_plus.columns
imputed_X_valid_plus.columns X_valid_plus.columnsprint(插补扩展后的MAE值:)
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
选择数据集里的目标
单一目标
直接用.取出目标值 该方法适仅用于英文
y home_data.Price用中括号加引号 该方法适用于中文和英文
y home_data.[Price]结果
Price为特征数据集里全部的Price称为目标结果为列表
1 181500
2 223500
3 140000
4 250000
6 307000...
1451 287090
1454 185000
1455 175000
1456 210000
1457 266500通常预测结果我们定义为y 多个目标
定义特征选择目标 变量X具有包含’LotArea’, LotConfig’两个特征的数据集
home_data_features [LotArea, LotConfig]
X home_data[home_data_features]结果 LotArea LotConfig
1 9600 FR2
2 11250 Inside
3 9550 Corner
4 14260 FR2
6 10084 Inside
... ... ...
1451 9262 Inside
1454 7500 Inside
1455 7917 Inside
1456 13175 Inside
1457 9042 Inside通常已知数据集我们定义为X 输出to_csv
生成一个CSV文件submission.csv包含Id和SalePrice
output pd.DataFrame({Id: test_data.Id,SalePrice: test_preds})
output.to_csv(submission.csv, indexFalse)分类变量
如果数据不是数值则需要进行特殊处理一般来说one-hot编码的性能通常最好删除分类列的性能通常最差但具体情况会有所不同。
删除分类列select_dtypes()
删除非数值
drop_X_train X_train.select_dtypes(exclude[object])
drop_X_valid X_valid.select_dtypes(exclude[object])print(MAE值:)
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))顺序编码OrdinalEncoder() from sklearn.preprocessing import OrdinalEncoder# 制作副本以避免更改原始数据
label_X_train X_train.copy()
label_X_valid X_valid.copy()# 对包含分类数据的每一列应用顺序编码器
ordinal_encoder OrdinalEncoder()
label_X_train[object_cols] ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] ordinal_encoder.transform(X_valid[object_cols])print(MAE值:)
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
One-Hot 编码OneHotEncoder()
设置handle_unknownignore’以避免验证数据包含训练数据中未表示的类时出错设置sparseFalse可确保编码的列作为numpy数组而不是稀疏矩阵返回。 from sklearn.preprocessing import OneHotEncoder# 对包含分类数据的每一列生成one-hot编码列
OH_encoder OneHotEncoder(handle_unknownignore, sparseFalse)
OH_cols_train pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))# One-hot编码索引重置
OH_cols_train.index X_train.index
OH_cols_valid.index X_valid.index# 删除原始分类列比如Color列
num_X_train X_train.drop(object_cols, axis1)
num_X_valid X_valid.drop(object_cols, axis1)# 将one-hot编码列加入其中比如Red\Yellow\Green
OH_X_train pd.concat([num_X_train, OH_cols_train], axis1)
OH_X_valid pd.concat([num_X_valid, OH_cols_valid], axis1)print(MAE值:)
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
计算唯一值unique()和nunique()
unique()方法返回的是去重之后的不同值nunique()方法则直接返回不同值的个数dropna为True时不包含空值为False时包含空值
import pandas as pd
import numpy as np
s1 pd.Series([A, 7, 6, 3, 4, 1, 2, 3, 5, 4, 1, 1])
print(s1中不同值s1.unique():, s1.unique())
print(s1中不同值的个数len(s1.unique()):, len(s1.unique()))
print(s1中不同值的个数s1.nunique():, s1.nunique())# 当存在Nan、None时
print(*30)
s2 pd.Series([A, 7, 6, 3, np.NAN, np.NaN,4, 1, 2, 3, 5, 4, 1, 1, pd.NaT, None])
print(s2中不同值s2.unique():, s2.unique())
print(s2中不同值的个数len(s2.unique()):, len(s2.unique()))
print(s2中不同值的个数s2.nunique():, s2.nunique())
print(s2中不同值的个数(包含空值)s2.nunique(dropnaFalse):, s2.nunique(dropnaFalse))
print(s2中不同值的个数(不包含空值)s2.nunique(dropnaTrue):, s2.nunique(dropnaTrue))结果
s1中不同值s1.unique(): [A 7 6 3 4 1 2 5]
s1中不同值的个数len(s1.unique()): 8
s1中不同值的个数s1.nunique(): 8s2中不同值s2.unique(): [A 7 6 3 nan 4 1 2 5 NaT None]
s2中不同值的个数len(s2.unique()): 11
s2中不同值的个数s2.nunique(): 8
s2中不同值的个数(包含空值)s2.nunique(dropnaFalse): 11
s2中不同值的个数(不包含空值)s2.nunique(dropnaTrue): 8建模方法
基本流程
定义它将是什么类型的模型决策树、随机森林等模型以及定义模型的一些基本参数。拟合从提供的数据集中捕获模式。预测预测想要的数值。评估确定模型预测的准确性。
决策树模型DecisionTreeRegressor
拟合过程不能处理非数值字段数据集中若有字母、符号、中文等需要进行特殊处理
定义
决策树是一种非参数的有监督学习方法它能够从一系列有特征和标签的数据中总结出决策规则并用树状图的结构来呈现这些规则以解决分类和回归问题。决策树中每个内部节点表示一个属性上的判断每个分支代表一个判断结果的输出最后每个叶节点代表一种分类结果。
加载数据
from sklearn.tree import DecisionTreeRegressor# 定义模型为random_state指定一个数字以确保每次运行的结果相同
iowa_model DecisionTreeRegressor(random_state1)# 预测目标价格
y home_data.SalePrice# 模型特征
feature_names [LotArea, YearBuilt, 1stFlrSF, 2ndFlrSF,FullBath, BedroomAbvGr, TotRmsAbvGrd]
# 定义特征集
Xhome_data[feature_names]
分割数据train_test_split(X, y, random_state 0)
X特征集y目标集train_X训练特征集val_X验证特征集train_y训练目标集val_y 验证目标集random_state参数值保证每次得到相同的分割的数据
train_X, val_X, train_y, val_y train_test_split(X, y, random_state 0)其他参数介绍 train_size训练集占比训练集占数据集的比重如果是整数的话就是训练的数量test_size验证集占比验证集占数据集的比重如果是整数的话就是验证的数量 拟合.fit(train_X, train_y)
iowa_model.fit(train_X, train_y)预测.predict(val_X)
在验证数据上获得预测值
val_predictions iowa_model.predict(val_X)评估mean_absolute_error(val_y, val_predictions)
计算验证数据中的平均绝对误差
val_mae mean_absolute_error(val_y, val_predictions)范例
https://www.kaggle.com/code/hyon666666/exercise-underfitting-and-overfitting?scriptVersionId119421539
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor# Path of the file to read
iowa_file_path ../input/home-data-for-ml-course/train.csvhome_data pd.read_csv(iowa_file_path)
# Create target object and call it y
y home_data.SalePrice
# Create X
features [LotArea, YearBuilt, 1stFlrSF, 2ndFlrSF, FullBath, BedroomAbvGr, TotRmsAbvGrd]
X home_data[features]# Split into validation and training data
train_X, val_X, train_y, val_y train_test_split(X, y, random_state1)# Specify Model
iowa_model DecisionTreeRegressor(random_state1)
# Fit Model
iowa_model.fit(train_X, train_y)# Make validation predictions and calculate mean absolute error
val_predictions iowa_model.predict(val_X)
val_mae mean_absolute_error(val_predictions, val_y)
print(Validation MAE: {:,.0f}.format(val_mae))# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print(\nSetup complete)def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):model DecisionTreeRegressor(max_leaf_nodesmax_leaf_nodes, random_state0)model.fit(train_X, train_y)preds_val model.predict(val_X)mae mean_absolute_error(val_y, preds_val)return(mae)candidate_max_leaf_nodes [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
scores {leaf_size: get_mae(leaf_size, train_X,val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size min(scores, keyscores.get)# Fill in argument to make optimal size and uncomment
final_model DecisionTreeRegressor(max_leaf_nodesbest_tree_size, random_state1)# fit the final model and uncomment the next two lines
final_model.fit(X, y)随机森林模型DecisionTreeRegressor
定义
import pandas as pd# 获取数据
melbourne_file_path ../input/melbourne-housing-snapshot/melb_data.csv
melbourne_data pd.read_csv(melbourne_file_path)
# 筛选缺少值的行
melbourne_data melbourne_data.dropna(axis0)
# 选择模板及特征
y melbourne_data.Price
melbourne_features [Rooms, Bathroom, Landsize, BuildingArea, YearBuilt, Lattitude, Longtitude]
X melbourne_data[melbourne_features]from sklearn.model_selection import train_test_split# 拆分数据为训练集和验证集
train_X, val_X, train_y, val_y train_test_split(X, y,random_state 0)拟合.fit(train_X, train_y)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_errorforest_model RandomForestRegressor(random_state1)
forest_model.fit(train_X, train_y)预测predict(val_X)
melb_preds forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))评估mean_absolute_error(val_y, melb_preds)
计算验证数据中的平均绝对误差
val_mae mean_absolute_error(val_y, melb_preds)范例1
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *# Set up filepaths
import os
if not os.path.exists(../input/train.csv):os.symlink(../input/home-data-for-ml-course/train.csv, ../input/train.csv) os.symlink(../input/home-data-for-ml-course/test.csv, ../input/test.csv) # Import helpful libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split# Load the data, and separate the target
iowa_file_path ../input/train.csv
home_data pd.read_csv(iowa_file_path)
y home_data.SalePrice# Create X (After completing the exercise, you can return to modify this line!)
features [LotArea, YearBuilt, 1stFlrSF, 2ndFlrSF, FullBath, BedroomAbvGr, TotRmsAbvGrd]# Select columns corresponding to features, and preview the data
X home_data[features]
X.head()# Split into validation and training data
train_X, val_X, train_y, val_y train_test_split(X, y, random_state1)# Define a random forest model
rf_model RandomForestRegressor(random_state1)
rf_model.fit(train_X, train_y)
rf_val_predictions rf_model.predict(val_X)
rf_val_mae mean_absolute_error(rf_val_predictions, val_y)print(Validation MAE for Random Forest Model: {:,.0f}.format(rf_val_mae))
范例2
获取数据
# Set up code checking
import os
if not os.path.exists(../input/train.csv):os.symlink(../input/home-data-for-ml-course/train.csv, ../input/train.csv) os.symlink(../input/home-data-for-ml-course/test.csv, ../input/test.csv)
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex1 import *
print(Setup Complete)
分割数据
import pandas as pd
from sklearn.model_selection import train_test_split# Read the data
X_full pd.read_csv(../input/train.csv, index_colId)
X_test_full pd.read_csv(../input/test.csv, index_colId)# Obtain target and predictors
y X_full.SalePrice
features [LotArea, YearBuilt, 1stFlrSF, 2ndFlrSF, FullBath, BedroomAbvGr, TotRmsAbvGrd]
X X_full[features].copy()
X_test X_test_full[features].copy()# Break off validation set from training data
X_train, X_valid, y_train, y_valid train_test_split(X, y, train_size0.8, test_size0.2,random_state0)查看部分数据
X_train.head()LotArea YearBuilt 1stFlrSF 2ndFlrSF FullBath BedroomAbvGr TotRmsAbvGrd
Id
619 11694 2007 1828 0 2 3 9
871 6600 1962 894 0 1 2 5
93 13360 1921 964 0 1 2 5
818 13265 2002 1689 0 2 3 7
303 13704 2001 1541 0 2 3 6定义了五种不同的随机森林模型
from sklearn.ensemble import RandomForestRegressor# Define the models
model_1 RandomForestRegressor(n_estimators50, random_state0)
model_2 RandomForestRegressor(n_estimators100, random_state0)
model_3 RandomForestRegressor(n_estimators100, criterionmae, random_state0)
model_4 RandomForestRegressor(n_estimators200, min_samples_split20, random_state0)
model_5 RandomForestRegressor(n_estimators100, max_depth7, random_state0)models [model_1, model_2, model_3, model_4, model_5]定义一个MAE计算函数
from sklearn.metrics import mean_absolute_error# Function for comparing different models
def score_model(model, X_tX_train, X_vX_valid, y_ty_train, y_vy_valid):model.fit(X_t, y_t)preds model.predict(X_v)return mean_absolute_error(y_v, preds)计算每一个随机森林的MAE
for i in range(0, len(models)):mae score_model(models[i])print(Model %d MAE: %d % (i1, mae))简单函数
通用的MAE计算
model模型X_t函数内部变量代表验证特征X_train函数外部变量代表训练特征X_tX_train调用此函数时无需输入该变量会自动获取上文中的X_train赋值给X_t其他用法同理y_t函数内部变量代表验证集y_valid函数外部变量代表训练集函数使用方法mae score_model(model)
from sklearn.metrics import mean_absolute_error# Function for comparing different models
def score_model(model, X_tX_train, X_vX_valid, y_ty_train, y_vy_valid):model.fit(X_t, y_t)preds model.predict(X_v)return mean_absolute_error(y_v, preds)随机森林计算MAE
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):model RandomForestRegressor(n_estimators10, random_state0)model.fit(X_train, y_train)preds model.predict(X_valid)return mean_absolute_error(y_valid, preds)复杂函数
决策树叶子节点的选择
决策树叶子节点选择过大或过小会导致出现过拟合或欠拟合问题 过拟合捕捉未来不会再次出现的虚假模式导致预测不太准确欠拟合未能捕捉相关模式再次导致预测不准确。 使用工具函数来帮助比较max_leaf_nodes不同值的MAE分数
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressordef get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):model DecisionTreeRegressor(max_leaf_nodesmax_leaf_nodes, random_state0)model.fit(train_X, train_y)preds_val model.predict(val_X)mae mean_absolute_error(val_y, preds_val)return(mae)使用for循环来比较用max_leaf_nodes的不同值构建的模型的精度。
# 不同的max_leaf_nodes对应不同的 MAE
for max_leaf_nodes in [5, 50, 500, 5000]:my_mae get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)print(Max leaf nodes: %d \t\t Mean Absolute Error: %d %(max_leaf_nodes, my_mae))结果
Max leaf nodes: 5 Mean Absolute Error: 347380
Max leaf nodes: 50 Mean Absolute Error: 258171
Max leaf nodes: 500 Mean Absolute Error: 243495
Max leaf nodes: 5000 Mean Absolute Error: 254983由此可以得出500是一个比较合适的叶子节点
更精简的使用方法
# 叶子节点集合
candidate_max_leaf_nodes [5, 25, 50, 100, 250, 500]
# 一行代码计算叶子节点对应的MAE
scores {leaf_size: get_mae(leaf_size, train_X,val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
# 选择最合适的叶子节点
best_tree_size min(scores, keyscores.get)管道Pipeline
介绍
管道是保持数据预处理和建模代码井然有序的一种简单方法。具体来说管道将预处理和建模步骤捆绑在一起这样您就可以像使用单个步骤一样使用整个包。
使用步骤
加载数据
import pandas as pd
from sklearn.model_selection import train_test_split# 读取训练集
X_full pd.read_csv(../input/train.csv, index_colId)
# 读取测试集
X_test_full pd.read_csv(../input/test.csv, index_colId)# 将SalePrice列数值为空的行删除
X_full.dropna(axis0, subset[SalePrice], inplaceTrue)# 将SalePrice列数值放到y上
y X_full.SalePrice# 将SalePrice列在X_full上删除
X_full.drop([SalePrice], axis1, inplaceTrue)# 从训练数据中分离出验证集
X_train_full, X_valid_full, y_train, y_valid train_test_split(X_full, y, train_size0.8, test_size0.2,random_state0)选择数值列和字符列
# 选择重复值小于10且为object类型的列一般都是字符串重复数小于10为了便于分类变量
categorical_cols [cname for cname in X_train_full.columns ifX_train_full[cname].nunique() 10 and X_train_full[cname].dtype object]#选择int64和float64类型的列
numerical_cols [cname for cname in X_train_full.columns if X_train_full[cname].dtype in [int64, float64]]创建新的训练集、验证集、测试集
# 创建新的训练集、验证集、测试集只保留选定的列数据
my_cols categorical_cols numerical_cols
X_train X_train_full[my_cols].copy()
X_valid X_valid_full[my_cols].copy()
X_test X_test_full[my_cols].copy()搭建管道
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error# 数字数据预处理采用插补的constant策略
numerical_transformer SimpleImputer(strategyconstant)# 分类数据的预处理采用插补的most_frequent策略和OneHot编码方法
categorical_transformer Pipeline(steps[(imputer, SimpleImputer(strategymost_frequent)),(onehot, OneHotEncoder(handle_unknownignore))
])# 数值和分类数据的束预处理
# 这里的numerical_cols和categorical_cols是刚才获取到的变量表示数值类型的列和object类型的列
preprocessor ColumnTransformer(transformers[(num, numerical_transformer, numerical_cols),(cat, categorical_transformer, categorical_cols)])# 定义随机森林模型
model RandomForestRegressor(n_estimators100, random_state0)# 在管道中将预处理和建模的代码进行捆绑
clf Pipeline(steps[(preprocessor, preprocessor),(model, model)])# 拟合模型
clf.fit(X_train, y_train)# 预测数值
preds clf.predict(X_valid)# 验证模型
print(MAE:, mean_absolute_error(y_valid, preds))计算
计算数据平局值round
计算某一列数据的平局值保留到整数 home_data为pd处理过的数据集LotArea数据集的某一字段
avg_lot_size round(home_data[LotArea].mean())计算日期datetime
计算到今天为止最新的房子最悠久的历史今年 - 它的建造日期 home_data为pd处理过的数据集datetime.datetime.now().year当前时间YearBuilt在数据集中表示房子建造市场
import datetime
newest_home_age datetime.datetime.now().year-home_data[YearBuilt].max()