网站计数器php,电脑平面制图入门教程,.电子商务网站规划,网站建设成本报表目录 0. 本栏目竞赛汇总表1. 本文主旨2. AI工程架构3. 数据预处理模块3.1 配置数据路径和处理参数3.2 配置API参数3.3 配置输出路径 4. AI并行处理模块4.1 定义LLM客户端类4.2 定义数据处理函数4.3 定义JSON保存函数4.4 定义数据分片函数4.5 定义分片处理函数4.5 定义文件名排序… 目录 0. 本栏目竞赛汇总表1. 本文主旨2. AI工程架构3. 数据预处理模块3.1 配置数据路径和处理参数3.2 配置API参数3.3 配置输出路径 4. AI并行处理模块4.1 定义LLM客户端类4.2 定义数据处理函数4.3 定义JSON保存函数4.4 定义数据分片函数4.5 定义分片处理函数4.5 定义文件名排序函数 5. 数据整合模块5.1 加载数据并生成分片5.2 初始化LLM客户端并测试5.3 并行处理数据生成5.4 合并处理结果5.5 保存最终结果 0. 本栏目竞赛汇总表
Kaggle竞赛汇总
1. 本文主旨
大白话由于在上一篇文章的数据探索中我们发现了部分训练数据的错误解释存在缺失因此直接使用GPT_4o人设提示词工程对训练集数据存在的错误解释缺失问题的处理。通过本文可收获技能API调用AI接口、人设提示词工程案例、复杂的数据处理与缓存处理。上文回顾Eedi大模型蒸馏方案01-竞赛信息解读与数据理解
2. AI工程架构 #mermaid-svg-WrBrU24qK69ILMTS {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-WrBrU24qK69ILMTS .error-icon{fill:#552222;}#mermaid-svg-WrBrU24qK69ILMTS .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-WrBrU24qK69ILMTS .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-WrBrU24qK69ILMTS .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-WrBrU24qK69ILMTS .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-WrBrU24qK69ILMTS .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-WrBrU24qK69ILMTS .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-WrBrU24qK69ILMTS .marker{fill:#333333;stroke:#333333;}#mermaid-svg-WrBrU24qK69ILMTS .marker.cross{stroke:#333333;}#mermaid-svg-WrBrU24qK69ILMTS svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-WrBrU24qK69ILMTS .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-WrBrU24qK69ILMTS .cluster-label text{fill:#333;}#mermaid-svg-WrBrU24qK69ILMTS .cluster-label span{color:#333;}#mermaid-svg-WrBrU24qK69ILMTS .label text,#mermaid-svg-WrBrU24qK69ILMTS span{fill:#333;color:#333;}#mermaid-svg-WrBrU24qK69ILMTS .node rect,#mermaid-svg-WrBrU24qK69ILMTS .node circle,#mermaid-svg-WrBrU24qK69ILMTS .node ellipse,#mermaid-svg-WrBrU24qK69ILMTS .node polygon,#mermaid-svg-WrBrU24qK69ILMTS .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-WrBrU24qK69ILMTS .node .label{text-align:center;}#mermaid-svg-WrBrU24qK69ILMTS .node.clickable{cursor:pointer;}#mermaid-svg-WrBrU24qK69ILMTS .arrowheadPath{fill:#333333;}#mermaid-svg-WrBrU24qK69ILMTS .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-WrBrU24qK69ILMTS .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-WrBrU24qK69ILMTS .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-WrBrU24qK69ILMTS .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-WrBrU24qK69ILMTS .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-WrBrU24qK69ILMTS .cluster text{fill:#333;}#mermaid-svg-WrBrU24qK69ILMTS .cluster span{color:#333;}#mermaid-svg-WrBrU24qK69ILMTS div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-WrBrU24qK69ILMTS :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-WrBrU24qK69ILMTS .process*{fill:#f9f!important;stroke:#333!important;stroke-width:2px!important;}#mermaid-svg-WrBrU24qK69ILMTS .process span{fill:#f9f!important;stroke:#333!important;stroke-width:2px!important;}#mermaid-svg-WrBrU24qK69ILMTS .data*{fill:#bbf!important;stroke:#333!important;stroke-width:2px!important;}#mermaid-svg-WrBrU24qK69ILMTS .data span{fill:#bbf!important;stroke:#333!important;stroke-width:2px!important;}#mermaid-svg-WrBrU24qK69ILMTS .decision*{fill:#ff9!important;stroke:#333!important;stroke-width:2px!important;}#mermaid-svg-WrBrU24qK69ILMTS .decision span{fill:#ff9!important;stroke:#333!important;stroke-width:2px!important;}#mermaid-svg-WrBrU24qK69ILMTS .module*{fill:#fff!important;stroke:#333!important;stroke-width:4px!important;}#mermaid-svg-WrBrU24qK69ILMTS .module span{fill:#fff!important;stroke:#333!important;stroke-width:4px!important;} 数据整合模块 初始化客户端 加载数据 并行处理生成 合并结果 保存CSV AI并行处理模块 定义数据处理函数 定义LLM客户端 定义JSON保存函数 定义分片函数 定义排序函数 数据预处理模块 配置路径和参数 导入依赖库 配置API和输出 3. 数据预处理模块
3.1 配置数据路径和处理参数
data_path ~/work/eedi_synthetic_data/MalAlgoQA_format.csv
index_start 0
index_end len(df)
step 100
max_workers 23.2 配置API参数
model_config dict(openai_api_base https://testshellapi.kimi.asia/v1, api_key ****,model gpt-4o,default_system_prompt ##TaskYou are a Mathematics teacher. Your task is to reason and identify the ConstructName and SubjectName and then the misconception behind the user input Incorrect Answers with the Question.ConstructName is Most granular level of knowledge related to question, appears to describe the specific mathematical method or procedure used to solve the question. It explains the technique or approach needed to reach the answer.SubjectName is More general context than the construct, represents the broader mathematical topic or category that the question belongs to.Misconceptions are a mistake in conceptual understanding and they have relations with all the applications of those concepts. For example, a single misconception on the connections among proportional relationships (part/whole, part/part, whole/part) can cause problems in identifying those patterns in drawings and can be the cause of failing to realize all parts must be of equal size, therefore associating the denominator of the fraction with the total number of parts regardless their size.Answer concisely what misconception it is to lead to getting the incorrect answer.Do not use The misconception is to start your answers.Do not mention the concrete details of the question or answers. ##User inputQuestion: The question textA: multiple choice answer A textB: multiple choice answer B textC: multiple choice answer C textD: multiple choice answer D textCorrect Answer: The correct answer text##You should answer in the following JSON format{ConstructName: here writes the constructName,SubjectName: here writes the SubjectNameMisconceptionAName: here writes the answer As misconception.,MisconceptionBName: here writes the answer Bs misconception.,MisconceptionCName: here writes the answer Cs misconception.,MisconceptionDName: here writes the answer Ds misconception.,}, # system prompt,default_temperature 0.5,max_tokens 256,
)3.3 配置输出路径
cache_folder f./cache_{model_config[model]}_model_misconceptions_result
if not os.path.exists(cache_folder):os.makedirs(cache_folder)
output_data_path fmisconception_data_{os.path.splitext(os.path.basename(data_path))[0]}_{model_config[model]}.csv4. AI并行处理模块
4.1 定义LLM客户端类
class LLMChat:def __init__(self, openai_api_base, api_key, model, default_temperature, default_system_prompt, max_tokens512):self.client OpenAI(api_key api_key,base_urlopenai_api_base,)self.model modelself.default_temperature default_temperatureself.default_system_prompt default_system_promptself.max_tokens max_tokensdef chat(self, user_prompt, system_promptNone, temperatureNone):if not system_prompt:system_prompt self.default_system_promptif not temperature:temperature self.default_temperaturechat_response self.client.chat.completions.create(modelself.model,temperaturetemperature,messages[{role: system, content: system_prompt},{role: user, content: user_prompt},],max_tokensself.max_tokens,response_format{type: json_object})return chat_response.choices[0].message.content4.2 定义数据处理函数
def process_row(args, debugFalse):user_prompt Question: {question}A: {answer_a}B: {answer_b}C: {answer_c}D: {answer_d}Correct Answer: {correct_answer}index, row argsca row[CorrectAnswer]correctanswer row[fAnswer{ca}Text]input_user_prompt user_prompt.format(questionrow[QuestionText],answer_arow[AnswerAText],answer_brow[AnswerBText],answer_crow[AnswerCText],answer_drow[AnswerDText],correct_answercorrectanswer,)ret_data {}try:ret_data vc.chat(input_user_prompt)if debug:print(ret_data\n)except Exception as e:print(fAn exception occur {str(e)})ret_data[error] str(e)passif debug:print(system: , model_config[default_system_prompt])print(* 50)print(user_input: , input_user_prompt)print(* 50)print(assistant: , ret_data)return ret_data4.3 定义JSON保存函数
def save_json(fn, obj):with open(fn, w) as f:json.dump(obj, f, ensure_asciiFalse, indent4)print(fsave file to {fn})4.4 定义数据分片函数
def slice_range(start, end, step):if step 0:raise ValueError(步长必须大于0)result []while start end:result.append(start)start stepif result[-1] end:result.append(end)return result4.5 定义分片处理函数
def process_pairs(sliced_range):slices []for first, second in zip(sliced_range, sliced_range[1:]):slices.append([first, second])return slices4.5 定义文件名排序函数
def natural_sort_key(filename):parts re.findall(r\d, filename)return tuple(map(int, parts))5. 数据整合模块
5.1 加载数据并生成分片
df pd.read_csv(data_path)
df.head()
sliced_range process_pairs(slice_range(index_start, index_end, step))df数据检查
5.2 初始化LLM客户端并测试
vc LLMChat(**model_config)
r process_row((7, df.iloc[7]), debugTrue)5.3 并行处理数据生成
for slices in tqdm(sliced_range, totallen(sliced_range)):output_filepath f{cache_folder}/cache_res_{slices[0]}.jsonif os.path.exists(output_filepath):print(fcache file exists, skip {output_filepath})continuedf_tasks df.iloc[slices[0]:slices[1]]results []with ProcessPoolExecutor(max_workersmax_workers) as executor:results list(tqdm(executor.map(process_row, df_tasks.iterrows()), totallen(df_tasks)))save_json(output_filepath, results)5.4 合并处理结果
f_names glob.glob(f{cache_folder}/*.json)
sorted_filenames sorted(f_names, keynatural_sort_key)
f_names sorted_filenamesresults []
for fn in f_names:with open(fn, r) as f:batch_results json.load(f)results.extend(batch_results)l len(results)
results [json.loads(r) for r in results]5.5 保存最终结果
df df.iloc[:l]
gen_df pd.DataFrame(results)
df pd.concat([df, gen_df], axis1)
df.to_csv(output_data_path, indexFalse)(To be continued)