当前位置: 首页 > news >正文

建外贸网站 东莞公司宣传册设计与制作公司

建外贸网站 东莞,公司宣传册设计与制作公司,长沙房地产信息网官网,南阳网站seo进度24/12/15 昨日复盘 Intermediate Mechine Learning之类型变量 读两篇讲解如何提问的文章#xff0c;在提问区里发起一次提问 实战#xff1a;自己从头到尾首先Housing Prices Competition for Kaggle Learn Users并成功提交 Intermediate Mechine Learning之管道#…进度24/12/15 昨日复盘 Intermediate Mechine Learning之类型变量 读两篇讲解如何提问的文章在提问区里发起一次提问 实战自己从头到尾首先Housing Prices Competition for Kaggle Learn Users并成功提交 Intermediate Mechine Learning之管道pipeline之前一直错译为工作流 今日进度 Intermediate Mechine Learning之交叉验证 Intermediate Mechine Learning之XGBoost Intermediate Mechine Learning之数据泄露 利用以上所学刷一遍分数。 Cross-Validation 交叉验证用来更好的测评模型表现。 验证集越大我们得到的测评结果约可靠但是在数据集大小确定的情况下验证集越大意味着训练集越小这是我们不想面对的情况。 交叉验证将数据分为多个fold进行多次实验每次实验使用其中一个fold作为验证集最终确保每一个已知数据都被当作验证集使用过。 优点是足够可靠缺点是开销翻倍。如果运行一次时间可以接收采用交叉验证无疑是一个不错的选择但如果运行时间较长且数据量足够大则不宜采用交叉验证。 利用交叉验证选择最优参数 #数据只保留了数字类型 numeric_cols [cname for cname in train_data.columns if train_data[cname].dtype in [int64, float64]] X train_data[numeric_cols].copy() X_test test_data[numeric_cols].copy()from sklearn.ensemble import RandomForestRegressor from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_scoredef get_score(n_estimators):Return the average MAE over 3 CV folds of random forest model.Keyword argument:n_estimators -- the number of trees in the forest# Replace this body with your own codemy_pipeline Pipeline(steps[(preprocessor, SimpleImputer()),(model, RandomForestRegressor(n_estimatorsn_estimators, random_state0))])scores -1 * cross_val_score(my_pipeline, X, y,cv3,scoringneg_mean_absolute_error)return scores.mean()n_list list(range(50, 401, 50)) results {} for ns in n_list:mean_s get_score(ns)results[ns] mean_sprint(results)import matplotlib.pyplot as plt %matplotlib inlineplt.plot(list(results.keys()), list(results.values())) plt.show()后续可以学习超参数优化课程可以从网格搜索grid search开始 XGBoost 对于结构化数据最准确的建模技术 gradient boosting梯度迭代模型是Kaggle比赛中实现了多种数据集的SOTA 对于随机森林方法它本质上使用了多个单独的决策树进行学习可以称作ensemble methods集成学习方法。另外一种集成学习方法叫做graient boosting 基本流程先使用一个基本模型做出预测计算损失函数。利用这个损失值去训练新的模型。具体来说我们决定了模型参数以便新的模型加入后可以降低损失。 XGBoost代表了极致的梯度迭代专注于表现和效率。 from xgboost import XGBRegressor my_model XGBRegressor() my_model.fit(X_train, y_train)# 更多参数 my_model XGBRegressor(n_estimators500, learning_rate0.05, n_jobs4) # 迭代次数学习率和并行数 my_model.fit(X_train, y_train, early_stopping_rounds5, #自动停止eval_set[(X_valid, y_valid)], #测试用集合verboseFalse)Data Leakage 数据泄露使得模型在训练时看起来非常准确但是用来预测时准确率不高。 两种类型的数据泄露target leakage和train-test contamination 训练、测试污染 Target leakage 目标泄露发生在时间或时间顺序类型的数据上。 任何在目标产生那一刻以后生成的数据都不应该出现在已知变量集合中。 示例生病的人会用抗生素如果是否服用抗生素信息出现在训练数据中在训练和验证时依据这个信息就可以准确地判断一个人是否生病。但是实际用来预测时一个人未来是否会生病和当前是否服用抗生素没有直接的必然联系原本学习到的经验变成了错误的。 Train-test Contamination 如果验证和测试数据通过某种方式影响了模型的训练过程就会导致这种泄露。这种泄露的发生有时是不易察觉的需要注意数据预处理的时间。 一个建议是When using cross-validation, it’s even more critical that you do your preprocessing inside the pipeline! 观察这样一组数据 card: 1 if credit card application accepted, 0 if notreports: Number of major derogatory reportsage: Age n years plus twelfths of a yearincome: Yearly income (divided by 10,000)share: Ratio of monthly credit card expenditure to yearly incomeexpenditure: Average monthly credit card expenditureowner: 1 if owns home, 0 if rentsselfempl: 1 if self-employed, 0 if notdependents: 1 number of dependentsmonths: Months living at current addressmajorcards: Number of major credit cards heldactive: Number of active credit accounts expenditures_cardholders X.expenditure[y] expenditures_noncardholders X.expenditure[~y]print(Fraction of those who did not receive a card and had no expenditures: %.2f \%((expenditures_noncardholders 0).mean())) print(Fraction of those who received a card and had no expenditures: %.2f \%(( expenditures_cardholders 0).mean()))Fraction of those who did not receive a card and had no expenditures: 1.00 Fraction of those who received a card and had no expenditures: 0.02 potential_leaks [expenditure, share, active, majorcards] #排除潜在可能的泄露 X2 X.drop(potential_leaks, axis1)# Evaluate the model with leaky predictors removed cv_scores cross_val_score(my_pipeline, X2, y, cv5,scoringaccuracy)print(Cross-val accuracy: %f % cv_scores.mean()) # 准确率大大下降一般只会发生在自己构建的数据集上标准数据集一般不会有这种情况。如果不能详尽的了解每一项数据的由来排除所有可能的泄露也许是更好的选择。 另一个好用的方法是在实际的预测场景中能用相同的方法获取到的数据用在训练中都不算泄露。 实际应用场景中还要考虑预测结果是否真的有效。 一个加深理解的例子 Step 4: Preventing Infections An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients. You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection. Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model? You have a clever idea. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.For each patient in the data, find out who the surgeon was and plug in that surgeon’s average infection rate as a feature. Does this pose any target leakage issues? Does it pose any train-test contamination issues? This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful). You have target leakage if a given patient’s outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon’s infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky. You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn’t generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set. 非常有帮助的例子。直觉上没有问题但是考虑到手术数据本身就很少感觉结果又会对某些变量有影响。当数据量很大时某个病人是否感染对比例产生的影响微乎其微 但是从原理上将只要结果参与到某个用于预测的变量的计算中这就叫数据泄露本例中毫无疑问是发生了数据泄露的。 实战XGBoost–pipelien # This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, heres several helpful packages to loadimport numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)# Input data files are available in the read-only ../input/ directory # For example, running this (by clicking run or pressing ShiftEnter) will list all files under the input directoryimport os for dirname, _, filenames in os.walk(/kaggle/input):for filename in filenames:print(os.path.join(dirname, filename))# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using Save Run All # You can also write temporary files to /kaggle/temp/, but they wont be saved outside of the current session# Load original data from sklearn.model_selection import train_test_splitX_full pd.read_csv(/kaggle/input/home-data-for-ml-course/train.csv) X_test pd.read_csv(/kaggle/input/home-data-for-ml-course/test.csv)X_full.dropna(axis0, subset[SalePrice], inplaceTrue) y X_full.SalePrice X_full.drop([SalePrice], axis1, inplaceTrue)# X_train, X_valid, y_train, y_valid train_test_split(X_full, y, train_size0.8, test_size0.2,# random_state0)print(Load data successfully.)# print(X_full.isnull().sum()[X_full.isnull().sum()0]) # # 对于缺失值过多的列采用丢弃策略 # X_drop_cols [col for col in X_full.columns if X_full[col].isnull().sum() 100] # X_full.drop(X_drop_cols, axis1, inplaceTrue)numerical_cols [col for col in X_full.columns if X_full[col].dtype in [int64, float64]] categorical_cols [col for col in X_full.columns if X_full[col].dtype object]# print(X_drop_cols)# define pipelinefrom sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_scorenumerical_transformer SimpleImputer(strategyconstant)categorical_transformer Pipeline(steps[(imputer, SimpleImputer(strategymost_frequent)),(one_hot, OneHotEncoder(handle_unknownignore)) ])preprocessor ColumnTransformer(transformers[(num, numerical_transformer, numerical_cols),(cat, categorical_transformer, categorical_cols)] )def get_score(model):my_pipeline Pipeline(steps[(preprocessor, preprocessor),(model, model)])scores -1 * cross_val_score(my_pipeline, X_full, y,cv3,scoringneg_mean_absolute_error)return scores.mean()print(get_score defined.)# 挑选最佳模型 from xgboost import XGBRegressor # my_model XGBRegressor(n_estimators2000, # learning_rate0.01, # random_state0, # n_jobs4) # s get_score(my_model) # print(fMAE is {s})最原始模型17468丢弃缺失值超过10的17562丢弃缺失值超过40的17524丢弃缺失值超过100的17516 不丢弃原始500 epoch-200: 17489epoch-300: 17467epoch-400: 17463epoch-450: 17467 学习率–0.05–0.01 轮次450: 17818轮次60017504轮次70017403轮次80017343轮次90017319轮次100017307轮次150017271轮次200017268 final_model XGBRegressor(n_estimators2000, learning_rate0.01,random_state0,n_jobs4) final_pipeline Pipeline(steps[(preprocessor, preprocessor),(model, final_model)])final_pipeline.fit(X_full, y)predictions final_pipeline.predict(X_test) print(Predictions on test set:, predictions)output pd.DataFrame({Id: X_test.Id,SalePrice: predictions}) output.to_csv(submission.csv, indexFalse) print(Sub saved)最终损失14898排名到了140/4711
http://www.dnsts.com.cn/news/39025.html

相关文章:

  • 佛山网站制作哪里好建设网站需要分析什么条件
  • 网站正能量晚上下载直接进入滁州网站公司
  • 国内erp公司排名四川整站优化专业的机构
  • 如何建微信商城网站网站超市
  • 沈阳企业模板建站如何将自己做的网站发布
  • 厦门专业做网站建行企业银行官网
  • WordPress微信一键登录珠海网站优化
  • wordpress网站语言包wordpress 调用微博内容
  • 备案号 不放在网站上网站建设组织管理怎么写
  • 网站开发是培训网站开发的硬件设备有
  • 新建的网站怎么上首页做网站换域名
  • 平面设计专用网站企业网站建设不足
  • 浙江省建设厅网站张清云网站开发国内外研究现状
  • 济宁祥云网站建设电商网站模版
  • 信息技术转移网站建设成品网站建设咨询
  • 鞍山做网站的公司wordpress post id清理
  • 有引导的网站wordpress贴心插件
  • wordpress淘宝客程序2019网站seo
  • 网站建设功能评价指标龙口网站建设联系电话
  • 用vs2010做网站教程沈阳专业做网站
  • 如何做视频教程网站400靓号手机网站建设
  • 线上海报设计网站太原网站seo顾问
  • 网站建设教程大全 百度网盘南阳网站建站培训
  • 贵阳美容网站建设1有免费建网站
  • 深圳有没有可以做家教的网站环球军事最新消息
  • discuz 做视频网站学会网站建设目的
  • app应用网站html5模板下载北海 做网站 英文
  • 做网站意义制作网页一般需要兼容哪些软件
  • 如何给网站配色公众号开发公司排行榜
  • 怎么让自己做的网站让别人看到深圳华鑫峰网站建设