当前位置：首页 > news >正文

网站备案注意动易网站内容管理系统

news 2026/1/16 23:38:54

网站备案注意,动易网站内容管理系统,做网站内网怎么映射,深圳做企业网站的公司推荐1. 项目简介随着移动互联网的快速发展#xff0c;O2O#xff08;Online to Offline#xff09;模式已成为电商领域的一大亮点。优惠券作为一种有效的营销工具#xff0c;被广泛应用于吸引新客户和激活老用户。然而#xff0c;传统的随机投放方式往往效率低下#xff0c;…1. 项目简介随着移动互联网的快速发展O2OOnline to Offline模式已成为电商领域的一大亮点。优惠券作为一种有效的营销工具被广泛应用于吸引新客户和激活老用户。然而传统的随机投放方式往往效率低下不仅对用户造成干扰还可能损害品牌形象。因此个性化优惠券投放成为提高营销效果的关键。本文将详细介绍如何利用机器学习技术进行电商优惠券使用预测以实现优惠券的精准投放。 2. 数据准备 2.1 数据来源与收集本研究使用的数据集包括线下和线上两个部分。线下数据集包含用户ID、商户ID、优惠券ID、折扣率、距离、领券日期和消费日期等信息。线上数据集则包含用户ID、商户ID、行为类型、优惠券ID、折扣率、领券日期和消费日期等信息。 2.2 数据预处理数据预处理是机器学习中的关键步骤。首先我们需要处理缺失值例如将字符串类型的缺失值替换为np.nan。其次对于异常值如距离字段中的null值我们将其替换为-1并转换为整数类型。最后我们需要对数据类型进行转换确保所有数值字段都是正确的数据类型。 # 处理缺失值和异常值 t2.replace(null, -1, inplaceTrue) t2.distance t2.distance.astype(int) t2.replace(-1, np.nan, inplaceTrue) 3. 特征工程特征工程是机器学习中提高模型性能的重要环节。我们从以下几个方面构建特征 3.1 优惠券相关特征优惠券类型直接优惠为0满减为1优惠券折率满减优惠券的最低消费历史出现次数历史核销次数历史核销率历史核销时间率领取优惠券是一周的第几天领取优惠券是一月的第几天历史上用户领取该优惠券次数历史上用户消费该优惠券次数历史上用户对该优惠券的核销率 def get_coupon_related_feature(dataset3, filenamecoupon3_feature):# 计算折扣率函数def calc_discount_rate(s):s str(s)s s.split(:)if len(s) 1:return float(s[0])else:return 1.0 - float(s[1]) / float(s[0])# 提取满减优惠券中满对应的金额def get_discount_man(s):s str(s)s s.split(:)if len(s) 1:return nullelse:return int(s[0])# 提取满减优惠券中减对应的金额def get_discount_jian(s):s str(s)s s.split(:)if len(s) 1:return nullelse:return int(s[1])# 是不是满减卷def is_man_jian(s):s str(s)s s.split(:)if len(s) 1:return 0else:return 1.0# 周几领取的优惠券dataset3[day_of_week] dataset3.date_received.astype(str).apply(lambda x: date(int(x[0:4]), int(x[4:6]), int(x[6:8])).weekday() 1)# 每月的第几天领取的优惠券dataset3[day_of_month] dataset3.date_received.astype(str).apply(lambda x: int(x[6:8]))# 领取优惠券的时间与当月初距离多少天dataset3[days_distance] dataset3.date_received.astype(str).apply(lambda x: (date(int(x[0:4]), int(x[4:6]), int(x[6:8])) - date(2016, 6, 30)).days)# 满减优惠券中满对应的金额dataset3[discount_man] dataset3.discount_rate.apply(get_discount_man)# 满减优惠券中减对应的金额dataset3[discount_jian] dataset3.discount_rate.apply(get_discount_jian)# 优惠券是不是满减卷dataset3[is_man_jian] dataset3.discount_rate.apply(is_man_jian)# 优惠券的折扣率满减卷进行折扣率转换dataset3[discount_rate] dataset3.discount_rate.apply(calc_discount_rate)# 特定优惠券的总数量d dataset3[[coupon_id]]d[coupon_count] 1d d.groupby(coupon_id).agg(sum).reset_index()dataset3 pd.merge(dataset3, d, oncoupon_id, howleft)dataset3.to_csv(os.path.join(features, filename .csv), indexNone)return dataset3 3.2 商户相关特征商家优惠券被领取次数商家优惠券被领取后不核销次数商家优惠券被领取后核销次数商家优惠券被领取后核销率商家优惠券核销的平均/最小/最大消费折率核销商家优惠券的不同用户数量及其占领取不同的用户比重商家优惠券平均每个用户核销多少张商家被核销过的不同优惠券数量商家被核销过的不同优惠券数量占所有领取过的不同优惠券数量的比重商家平均每种优惠券核销多少张商家被核销优惠券的平均时间率商家被核销优惠券中的平均/最小/最大用户-商家距离 def get_merchant_related_feature(feature3, filenamemerchant3_feature):merchant3 feature3[[merchant_id, coupon_id, distance, date_received, date]]# 提取不重复的商户集合t merchant3[[merchant_id]]t.drop_duplicates(inplaceTrue)# 商户的总销售次数t1 merchant3[merchant3.date ! null][[merchant_id]]t1[total_sales] 1t1 t1.groupby(merchant_id).agg(sum).reset_index()# 商户被核销优惠券的销售次数t2 merchant3[(merchant3.date ! null) (merchant3.coupon_id ! null)][[merchant_id]]t2[sales_use_coupon] 1t2 t2.groupby(merchant_id).agg(sum).reset_index()# 商户发行优惠券的总数t3 merchant3[merchant3.coupon_id ! null][[merchant_id]]t3[total_coupon] 1t3 t3.groupby(merchant_id).agg(sum).reset_index()# 商户被核销优惠券的用户-商户距离转化为int数值类型t4 merchant3[(merchant3.date ! null) (merchant3.coupon_id ! null)][[merchant_id, distance]]t4.replace(null, -1, inplaceTrue)t4.distance t4.distance.astype(int)t4.replace(-1, np.nan, inplaceTrue)# 商户被核销优惠券的最小用户-商户距离t5 t4.groupby(merchant_id).agg(min).reset_index()t5.rename(columns{distance: merchant_min_distance}, inplaceTrue)# 商户被核销优惠券的最大用户-商户距离t6 t4.groupby(merchant_id).agg(max).reset_index()t6.rename(columns{distance: merchant_max_distance}, inplaceTrue)# 商户被核销优惠券的平均用户-商户距离t7 t4.groupby(merchant_id).agg(mean).reset_index()t7.rename(columns{distance: merchant_mean_distance}, inplaceTrue)# 商户被核销优惠券的用户-商户距离的中位数t8 t4.groupby(merchant_id).agg(median).reset_index()t8.rename(columns{distance: merchant_median_distance}, inplaceTrue)# 合并上述特征merchant3_feature pd.merge(t, t1, onmerchant_id, howleft)merchant3_feature pd.merge(merchant3_feature, t2, onmerchant_id, howleft)merchant3_feature pd.merge(merchant3_feature, t3, onmerchant_id, howleft)merchant3_feature pd.merge(merchant3_feature, t5, onmerchant_id, howleft)merchant3_feature pd.merge(merchant3_feature, t6, onmerchant_id, howleft)merchant3_feature pd.merge(merchant3_feature, t7, onmerchant_id, howleft)merchant3_feature pd.merge(merchant3_feature 4. 数据集可视化分析 4.1 预测标签的类别分布可以看出标签为 1 的占比非常少是一个类别极度不均衡的二分类问题。 4.2 特征相关性分析 4.3 商户的总销售次数分布情况 4.4 领取优惠券的时间与当月初距离天数分布由于特征太多篇幅有限此处只列出部分特征的分布可视化。 5. 训练集和验证集切分由于比赛已结束所以此处将手动切分出训练集、验证集、测试集测试集用于不同模型的性能对比。 df_columns dataset12_x.columns.values print( feature count: {}.format(len(df_columns)))X_train, X_valid, y_train, y_valid train_test_split(dataset12_x, dataset12_y, test_size0.1, random_state42) X_train, X_test, y_train, y_test train_test_split(X_train, y_train, test_size0.1, random_state42) print(train: {}, valid: {}, test: {}.format(X_train.shape[0], X_valid.shape[0], X_test.shape[0])) feature count: 53 train: 327856, valid: 40477, test: 36429 6. Xgboost 建模预测 Xgboost是一种高效的梯度提升框架它可以用来解决分类、回归等多种机器学习任务。Xgboost通过集成多个弱学习器通常是决策树并优化损失函数来提高模型的准确性。 xgb_params {eta: 0.01,min_child_weight: 20,colsample_bytree: 0.5,max_depth: 15,subsample: 0.9,lambda: 2.0,eval_metric: auc,objective: binary:logistic,nthread: -1,silent: 1,booster: gbtree }pre_xgb_model xgb.train(dict(xgb_params),dtrain,evalswatchlist,verbose_eval50) 交叉验证获取最佳迭代次数 print(--- cv train to choose best_num_boost_round) cv_result xgb.cv(dict(xgb_params),dtrain,num_boost_round5000,early_stopping_rounds100,verbose_eval100,show_stdvFalse,) best_num_boost_rounds len(cv_result) mean_train_logloss cv_result.loc[best_num_boost_rounds-11 : best_num_boost_rounds-1, train-auc-mean].mean() mean_test_logloss cv_result.loc[best_num_boost_rounds-11 : best_num_boost_rounds-1, test-auc-mean].mean() print(best_num_boost_rounds {}.format(best_num_boost_rounds))print(mean_train_auc {:.7f} , mean_test_auc {:.7f}\n.format(mean_train_logloss, mean_test_logloss)) [0] train-auc:0.87954 test-auc:0.87309 [100] train-auc:0.90277 test-auc:0.89217 [200] train-auc:0.90981 test-auc:0.89533 [300] train-auc:0.91590 test-auc:0.89786 [400] train-auc:0.92089 test-auc:0.89978 [500] train-auc:0.92522 test-auc:0.90138 [600] train-auc:0.92873 test-auc:0.90252 [700] train-auc:0.93169 test-auc:0.90334 [800] train-auc:0.93411 test-auc:0.90396 [900] train-auc:0.93610 test-auc:0.90444 [1000] train-auc:0.93786 test-auc:0.90482 [1100] train-auc:0.93937 test-auc:0.90512 [1200] train-auc:0.94078 test-auc:0.90540 [1300] train-auc:0.94218 test-auc:0.90564 [1400] train-auc:0.94347 test-auc:0.90583 [1500] train-auc:0.94468 test-auc:0.90595 [1600] train-auc:0.94578 test-auc:0.90607 [1700] train-auc:0.94686 test-auc:0.90616 [1800] train-auc:0.94787 test-auc:0.90626 [1900] train-auc:0.94886 test-auc:0.90632 [2000] train-auc:0.94986 test-auc:0.90636 6.1 特征重要程度分析 6.2 性能评估 # predict train predict_train xgb_model.predict(dtrain) after_xgb_train_auc evaluate_score(predict_train, y_train)# predict validate predict_valid xgb_model.predict(dvalid) after_xgb_valid_auc evaluate_score(predict_valid, y_valid)dtest xgb.DMatrix(X_test, feature_namesdf_columns) predict_test xgb_model.predict(dtest) after_xgb_test_auc evaluate_score(predict_test, y_test)print(训练集 auc {:.7f} , 验证集 auc {:.7f} , 测试集 auc {:.7f}\n.format(after_xgb_train_auc, after_xgb_valid_auc, after_xgb_test_auc)) 训练集 auc 0.9042264 , 验证集 auc 0.8958611 , 测试集 auc 0.8960916 6.3 调参前后模型性能对比可以看出调参后训练集、验证集和测试集的 AUC 都得到了不同程度的提升、 6.4 预测性能 ROC 曲线 7. 随机森林RandomForest建模预测随机森林是一种集成学习方法它通过构建多个决策树并将它们的预测结果进行汇总来提高整体模型的性能。随机森林在处理高维数据时表现出色并且对于过拟合具有一定的抵抗力。用RandomSearchCV选取超参数 # 建立一个分类器或者回归器 rf_clf RandomForestClassifier()# 给定参数搜索范围list or distribution param_dist {n_estimators: [100, 500, 1000, 1500, 2000],max_depth: [3, 5, 8, 12, 15],max_features: [2, 5, 10,],min_samples_split: [2, 4, 6, 8, 10, 12],bootstrap: [True, False],criterion: [gini, entropy], }n_iter_search 20 random_search_cv RandomizedSearchCV(rf_clf, param_distributionsparam_dist, n_itern_iter_search, cv5, n_jobs-1, verbose1) 最佳参数训练 RF 模型 rf_model RandomForestClassifier(n_estimators3000, criteriongini, max_depth12, min_samples_split1000, min_samples_leaf6, min_weight_fraction_leaf0.0, max_featuressqrt, max_leaf_nodesNone, min_impurity_decrease0.0, bootstrapTrue, n_jobs-1, random_state42, verbose1, warm_startFalse, max_samplesNone ) 训练集 auc 0.6242629 , 验证集 auc 0.6232508 , 测试集 auc 0.61946978. Stochastic Gradient Descent(SGD算法) SGD是一种优化算法它通过随机选择样本来更新模型参数从而减少计算量并加快收敛速度。SGD适用于大规模和在线机器学习任务。同样的方法测试 SGD 算法建模预测性能此处省略。 9. 模型对比 import matplotlib.pyplot as plt import numpy as npspecies [训练集, 验证集, 测试集] penguin_means {Xgboost: (xgb_train_auc, xgb_valid_auc, xgb_test_auc),RandomForest: (rf_train_auc, rf_valid_auc, rf_test_auc),SGD: (sgd_train_auc, sgd_valid_auc, sgd_test_auc), } xgb_train_auc x np.arange(len(species)) width 0.25 multiplier 0plt.figure(figsize(40, 20)) fig, ax plt.subplots(layoutconstrained, figsize(30, 15))for attribute, measurement in penguin_means.items():offset width * multiplierrects ax.bar(x offset, measurement, width, labelattribute)ax.bar_label(rects, padding3, fontsize26)multiplier 1ax.set_ylabel(数据集, fontsize26) ax.set_title(不同模型的评测性能对比, fontsize40) ax.set_xticks(x width, species, fontsize26) ax.legend(locupper left, fontsize26) ax.set_ylim(0, 1.5)plt.show() 我们比较了Xgboost、随机森林和SGD三种模型的性能。结果显示Xgboost模型在训练集、验证集和测试集上的AUC值均高于其他两种模型。 10. 结论通过对用户行为和优惠券使用情况的分析我们构建了一个基于机器学习的优惠券使用预测模型。该模型能够有效地预测用户是否会核销他们收到的优惠券从而帮助企业更精准地进行营销活动。未来的工作可以进一步优化特征选择、调整模型参数或者尝试其他类型的机器学习算法以提升预测准确性。

查看全文

http://www.dnsts.com.cn/news/63782.html