湖南网站建设开发,wordpress域名配置,学校网站建设问卷调查,移动商城个人中心更新时间#xff1a;2023-2-19
相关链接
#xff08;1#xff09;2023年美赛C题Wordle预测问题一建模及Python代码详细讲解 #xff08;2#xff09;2023年美赛C题Wordle预测问题二建模及Python代码详细讲解 #xff08;3#xff09;2023年美赛C题Wordle预测问题三、四…
更新时间2023-2-19
相关链接
12023年美赛C题Wordle预测问题一建模及Python代码详细讲解 22023年美赛C题Wordle预测问题二建模及Python代码详细讲解 32023年美赛C题Wordle预测问题三、四建模及Python代码详细讲解 42023年美赛C题Wordle预测问题25页论文
1 数据分析与特征工程
1将2023-3-1的EERIR样本加入到数据集中和所有数据集一起预处理和做特征工程。
特征工程中和问题一第二问中类同的是提取了’w1’,‘w2’,‘w3’,‘w4’,‘w5’,‘Vowel_fre’,Consonant_fre’这几个特征再加上时间特征包括年、月、日、季节、样本序号。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsdf pd.read_excel(data/Problem_C_Data_Wordle.xlsx,header1)
data df.drop(columnsUnnamed: 0)
data.loc[len(data)] [2023-3-1,np.nan,eerie,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
data[Date] pd.to_datetime(data[Date])
df data.copy()
df[Words] df[Word].apply(lambda x:str(list(x))[1:-1].replace(,).replace( ,))
df[w1], df[w2],df[w3], df[w4],df[w5] df[Words].str.split(,,n4).str
dfsmall [str(chr(i)) for i in range(ord(a),ord(z)1)]
letter_map dict(zip(small,range(1,27)))
letter_map{‘a’: 1, ‘b’: 2, ‘c’: 3, ‘d’: 4, ‘e’: 5, ‘f’: 6, ‘g’: 7, ‘h’: 8, ‘i’: 9, ‘j’: 10, ‘k’: 11, ‘l’: 12, ‘m’: 13, ‘n’: 14, ‘o’: 15, ‘p’: 16, ‘q’: 17, ‘r’: 18, ‘s’: 19, ‘t’: 20, ‘u’: 21, ‘v’: 22, ‘w’: 23, ‘x’: 24, ‘y’: 25, ‘z’: 26} df[w1] df[w1].map(letter_map)
df[w2] df[w2].map(letter_map)
df[w3] df[w3].map(letter_map)
df[w4] df[w4].map(letter_map)
df[w5] df[w5].map(letter_map)
df.set_index(Date,inplaceTrue)
df.sort_index(ascendingTrue,inplaceTrue)df2统计元音辅音频率
Vowel [a,e,i,o,u]
Consonant list(set(small).difference(set(Vowel)))
def count_Vowel(s):c 0for i in range(len(s)):if s[i] in Vowel:c1return c
def count_Consonant(s):c 0for i in range(len(s)):if s[i] in Consonant:c1return cdf[Vowel_fre] df[Word].apply(lambda x:count_Vowel(x))
df[Consonant_fre] df[Word].apply(lambda x:count_Consonant(x)) 3提取时间特征
df[year] df.index.year
df[qtr] df.index.quarter
df[mon] df.index.month
df[week] df.index.week
df[day] df.index.weekday
df[ix] range(0,len(data))
time_features [year,qtr,mon,week,day,ix]
df3构造数据集
from sklearn.preprocessing import StandardScaler
features [w1,w2,w3,w4,w5,Vowel_fre,Consonant_fre]time_features
label [1 try,6 tries,6 tries,6 tries,6 tries,6 tries,7 or more tries (X)]
Trian_all df[featureslabel].copy().dropna()
X Trian_all[features]# 标准化
ss StandardScaler()
X_1 ss.fit_transform(X)
Y_1 Trian_all[label[0]]
Y_2 Trian_all[label[1]]
Y_3 Trian_all[label[2]]
Y_4 Trian_all[label[3]]
Y_5 Trian_all[label[4]]
Y_6 Trian_all[label[5]]
Y_7 Trian_all[label[6]]
Trian_all 2 模型预测与评估 from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegressionX_train, X_test, y_train, y_test train_test_split(X_1,Y_1, test_size0.1, random_state0)reg LinearRegression().fit(X_train, y_train)
p_pred reg.predict(X_test)
test_df pd.DataFrame(y_test,columnslabel)
test_df[pred_1] p_pred
# 计算误差
from sklearn.metrics import mean_squared_error
RMSE_1 np.sqrt(mean_squared_error(test_df[label[0]],test_df[pred_1]))
print(f第1个模型RMSE误差是{RMSE_1})
# 预测结果可视化
test_df[[label[0],pred_1]].plot()
plt.legend()
plt.savefig(img/3.png,dpi300)
plt.show()然后训练7个回归模型。
第1个模型RMSE误差是0.901305736956438
剩余的6个模型复制粘贴代码改一下标签就行。 3 预测EERIE难度
# 训练所有模型from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
features [w1,w2,w3,w4,w5,Vowel_fre,Consonant_fre]time_features
label [1 try,2 tries,3 tries,4 tries,5 tries,6 tries,7 or more tries (X)]
Trian_all df[featureslabel].copy().dropna()
X Trian_all[features]# 标准化
ss StandardScaler()
X_1 ss.fit_transform(X)
Y_1 Trian_all[label[0]]
Y_2 Trian_all[label[1]]
Y_3 Trian_all[label[2]]
Y_4 Trian_all[label[3]]
Y_5 Trian_all[label[4]]
Y_6 Trian_all[label[5]]
Y_7 Trian_all[label[6]]reg1 LinearRegression().fit(X_1, Y_1)
reg2 LinearRegression().fit(X_1, Y_2)
reg3 LinearRegression().fit(X_1, Y_3)
reg4 LinearRegression().fit(X_1, Y_4)
reg5 LinearRegression().fit(X_1, Y_5)
reg6 LinearRegression().fit(X_1, Y_6)
reg7 LinearRegression().fit(X_1, Y_7)X_pred ss.fit_transform(np.array(df.loc[2023-3-1][features]).reshape(1,-1))
p_pred1 reg1.predict(X_pred)
p_pred2 reg2.predict(X_pred)
p_pred3 reg3.predict(X_pred)
p_pred4 reg4.predict(X_pred)
p_pred5 reg5.predict(X_pred)
p_pred6 reg6.predict(X_pred)
p_pred7 reg7.predict(X_pred)print(p_pred1,p_pred2,p_pred3,p_pred4,p_pred5,p_pred6,p_pred7)进行预测3月1号的EERIR的百分比结果如下
[0.46327684] [5.77683616] [22.67231638] [32.97457627] [23.68361582] [11.5819209] [2.81920904]
评价模型的好坏除了上面的RMSE还有MAPE、MAE、MSE等误差的计算方法都可以计算一下做一个表格来评价模型。
改进的地方就是可以对比其他的机器学习回归模型比如KNN回归、随机森林回归等。
3 Code
Code获取在浏览器中输入betterbench.top/#/40/detail或者Si我其他问题在我主页查看或者文章首部点击