个人网站自助建站,wordpress get_terms 排序,生产企业erp系统开发,知页怎么转换wordpress作者#xff1a;来自 Elastic Jeffrey Rengifo 及 Toms Mura 探索 RAG 应用程序中 Elasticsearch 同义词的功能。 同义词允许我们使用具有相同含义的不同词语在文档中搜索#xff0c;以确保用户无论使用什么确切的词语都能找到他们所寻找的内容。你可能会认为#xff0c;由于…作者来自 Elastic Jeffrey Rengifo 及 Tomás Murúa 探索 RAG 应用程序中 Elasticsearch 同义词的功能。 同义词允许我们使用具有相同含义的不同词语在文档中搜索以确保用户无论使用什么确切的词语都能找到他们所寻找的内容。你可能会认为由于 RAG 应用程序使用语义/向量搜索同义词功能的一部分已经被同义词涵盖因为根据定义同义词是语义相关的词。
这是真的吗语义搜索真的能取代同义词吗在本文中我们将分析在 RAG 应用程序中使用同义词的影响。 步骤
配置端点配置同义词索引文档语义搜索同义词和 RAG 配置推理端点
对于这个例子我们将在 HR 环境中实现带有和不带有同义词的 RAGRetrieval-Augmented Generation - 检索增强生成系统。我们将使用术语 PTOPaid Time Off - 带薪休假的变体如 “vacation” 或 “holiday”为不同的文档编制索引。然后我们将配置同义词来展示这些关系如何提高搜索的相关性和准确性。
首先让我们通过在 Kibana DevTools 中运行以下命令使用带有推理 APIinference api 的 ELSER 模型创建一个端点
PUT _inference/sparse_embedding/code-wave_inference
{service: elasticsearch,service_settings: {num_allocations: 1,num_threads: 1}
} 配置同义词 Elasticsearch 中的同义词是什么
在 Elasticsearch 中同义词synonyms是具有相同或相似含义的单词或短语存储为同义词集可以作为文件或通过 API 进行管理。它们允许用户找到相关信息即使他们使用不同的术语来指代同一概念。
因此例如如果我们创建一组同义词其中 “holiday” 和 “vacation” 是 “Paid Time Off” 的同义词当员工搜索其中任何一个词时他们就会找到与所有词相关的文档。
你可以在这篇文章中阅读有关它们的更多信息。
让我们使用同义词 APIsynonyms API: 创建一组同义词
PUT _synonyms/code-wave_synonyms
{synonyms_set: [{synonyms: holidays, paid time off}]
} 值得注意的是同义词集必须先进行配置然后才能应用于索引。 现在让我们定义数据的设置和映射
PUT /code-wave_index
{settings: {analysis: {filter: {synonyms_filter: {type: synonym_graph,synonyms_set: code-wave_synonyms,updateable: true}},analyzer: {my_search_analyzer: {type: custom,tokenizer: standard,filter: [lowercase,synonyms_filter]}}}},mappings: {properties: {text_field: {type: text,analyzer: standard,copy_to: semantic_field,fields: {synonyms: {type: text,analyzer: standard,search_analyzer: my_search_analyzer}}},semantic_field: {type: semantic_text,inference_id: code-wave_inference}}}
}
我们将使用 semantic_text 字段进行语义搜索并使用 synonyms graph token filter 来处理多词同义词。
我们还创建了 text_field.synonym 版本和 text_field 版本的字段可以针对这两种不同的类型进行搜索。请注意的是这两个类型都是 text 类型以便更好地控制如何使用或不考虑同义词来查询字段。
最后我们使用 copy_to 将 text_field 的值复制到该字段的 semantic_text 版本以实现全文和语义查询。 索引文档
我们现在将使用批量 API 索引我们的文档
POST _bulk
{index:{_index:code-wave_index,_id:1}}
{semantic_field:Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.,text_field:Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.}
{index:{_index:code-wave_index,_id:2}}
{semantic_field:Holidays: Paid public holidays recognized each calendar year.,text_field:Holidays: Paid public holidays recognized each calendar year.}
{index:{_index:code-wave_index,_id:3}}
{semantic_field:Sick leave: Paid sick leave of up to 15 days per year.,text_field:Sick leave: Paid sick leave of up to 15 days per year.}
{index:{_index:code-wave_index,_id:4}}
{semantic_field:Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!,text_field:Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!}
{index:{_index:code-wave_index,_id:5}}
{semantic_field:Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations.,text_field:Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations.}
{index:{_index:code-wave_index,_id:6}}
{semantic_field:Holidays travel: Find the best deals for your holidays flights and accommodations this season.,text_field:Holidays travel: Find the best deals for your holidays flights and accommodations this season.}
{index:{_index:code-wave_index,_id:7}}
{semantic_field:Holidays music: Stream your favorite holidays classics and discover new seasonal hits.,text_field:Holidays music: Stream your favorite holidays classics and discover new seasonal hits.}
{index:{_index:code-wave_index,_id:8}}
{semantic_field:Holidays decorations: Our store offers a wide range of holidays decorations to make your home festive.,text_field:Holidays decorations: Our store offers a wide range of holidays decorations to make your home festive.}
{index:{_index:code-wave_index,_id:9}}
{semantic_field:Holidays movies: Check out our list of must-watch holidays movies for cozy winter nights.,text_field:Holidays movies: Check out our list of must-watch holidays movies for cozy winter nights.}
{index:{_index:code-wave_index,_id:10}}
{semantic_field:Holidays festival: Join us at the citys annual holidays festival featuring lights, music, and local food.,text_field:Holidays festival: Join us at the citys annual holidays festival featuring lights, music, and local food.}
{index:{_index:code-wave_index,_id:11}}
{semantic_field:Holidays weather: Stay updated with our holidays weather forecast to plan your activities.,text_field:Holidays weather: Stay updated with our holidays weather forecast to plan your activities.}
{index:{_index:code-wave_index,_id:12}}
{semantic_field:Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list.,text_field:Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list.}
{index:{_index:code-wave_index,_id:13}}
{semantic_field:Holidays traditions: Explore unique holidays traditions celebrated around the world.,text_field:Holidays traditions: Explore unique holidays traditions celebrated around the world.}
我们现在就可以开始搜索了但首先让我们通过搜索 holidays 来确保同义词有效
GET code-wave_index/_search
{_source: {excludes: [*embeddings,*chunks]},query: {multi_match: {query: holidays,fields: [text_field^10,text_field.synonyms^0.6]}}
} 我们对 boost 进行调整使同义词的得分低于原始单词。 检查响应
{took: 3,timed_out: false,_shards: {total: 1,successful: 1,skipped: 0,failed: 0},hits: {total: {value: 12,relation: eq},max_score: 5.2014494,hits: [{_index: code-wave_index,_id: 2,_score: 3.0596757,_source: {text_field: Holidays: Paid public holidays recognized each calendar year.,semantic_field: {inference: {inference_id: code-wave_inference,model_settings: {task_type: sparse_embedding}},text: Holidays: Paid public holidays recognized each calendar year.}}},{_index: code-wave_index,_id: 1,_score: 3.023004,_source: {text_field: Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.,semantic_field: {inference: {inference_id: code-wave_inference,model_settings: {task_type: sparse_embedding}},text: Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.}}},{_index: code-wave_index,_id: 13,_score: 2.9230676,_source: {text_field: Holidays traditions: Explore unique holidays traditions celebrated around the world.,semantic_field: {inference: {inference_id: code-wave_inference,model_settings: {task_type: sparse_embedding}},text: Holidays traditions: Explore unique holidays traditions celebrated around the world.}}},...]}
}
我们可以看到当我们搜索 “holidays” 时第二个文档有同义词“Paid Time Off”。 混合搜索
混合搜索使我们能够将全文和语义搜索查询的结果组合成一个规范化的结果集方法是使用 RRFReciprocal Rank Fusion - 倒述排序融合来平衡来自不同检索器的分数。
GET code-wave_index/_search
{_source: text_field,retriever: {rrf: {retrievers: [{standard: {query: {nested: {path: semantic_field.inference.chunks,query: {sparse_vector: {inference_id: code-wave_inference,field: semantic_field.inference.chunks.embeddings,query: holidays}}}}}},{standard: {query: {multi_match: {query: holidays,fields: [text_field.synonyms]}}}}]}}
}
回复
{took: 11,timed_out: false,_shards: {total: 1,successful: 1,skipped: 0,failed: 0},hits: {total: {value: 13,relation: eq},max_score: 0.03175403,hits: [{_index: code-wave_index,_id: 7,_score: 0.03175403,_source: {text_field: Holidays music: Stream your favorite holidays classics and discover new seasonal hits.}},{_index: code-wave_index,_id: 13,_score: 0.031257633,_source: {text_field: Holidays traditions: Explore unique holidays traditions celebrated around the world.}},{_index: code-wave_index,_id: 4,_score: 0.031009614,_source: {text_field: Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!}},{_index: code-wave_index,_id: 2,_score: 0.030834913,_source: {text_field: Holidays: Paid public holidays recognized each calendar year.}},{_index: code-wave_index,_id: 6,_score: 0.03079839,_source: {text_field: Holidays travel: Find the best deals for your holidays flights and accommodations this season.}},{_index: code-wave_index,_id: 11,_score: 0.02964427,_source: {text_field: Holidays weather: Stay updated with our holidays weather forecast to plan your activities.}},{_index: code-wave_index,_id: 5,_score: 0.029418126,_source: {text_field: Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations.}},{_index: code-wave_index,_id: 12,_score: 0.028991597,_source: {text_field: Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list.}},{_index: code-wave_index,_id: 1,_score: 0.016393442,_source: {text_field: Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.}},{_index: code-wave_index,_id: 10,_score: 0.016393442,_source: {text_field: Holidays festival: Join us at the citys annual holidays festival featuring lights, music, and local food.}}]}
}
该查询将返回语义和文本相关的文档。 同义词和 RAG 在本节中我们将评估同义词和语义搜索如何改进 RAG 系统中的查询。我们将使用一个关于休息日的常见问题作为此示例
“How many vacation days are provided for holidays?”
对于这个问题我们对文档 1 中的信息感兴趣。文档 2 更接近我们想要的结果但并不精确。当我们不使用同义词进行搜索时我们将得到此结果。我们来看看它们的内容
[1] Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.[2] Holidays: Paid public holidays recognized each calendar year.
这两个文档都包含与休息日days off相关的信息但只有文档 2 特别使用了术语 “holidays”因此我们可以测试同义词和语义搜索在 Playground 中的工作方式。
你可以从 SearchPlayground 访问 Playground。从那里你需要配置你想要使用的 LLM 并选择我们已经创建的索引作为上下文发送。你可以在此处阅读有关 Playground 及其配置的更多信息
配置完 Playground 后如果我们点击查询按钮我们可以看到同义词已被停用 对于每个问题我们会将前一个查询的前三个结果发送给 LLM作为上下文 现在让我们向 Playground 提出问题并检查停用同义词后的结果 由于前三个搜索结果中没有列出说明员工每年可享受多少假期的文件因此 LLM 无法回答这个问题。在这种情况下最接近的结果在文档 [2] 中。 注意通过点击 “Snippet”我们可以看到答案在 Elasticsearch 中的具体内容。 让我们清理聊天记录激活同义词并再次提出同样的问题 请注意当你启用 semantic_text 字段和 text 字段时Playground 将自动生成混合搜索查询 让我们重复一下这个问题现在激活同义词 现在答案确实包含了我们正在搜索的文档因为同义词允许将文档 [1] 发送到 LLM。 结论
在本文中我们发现同义词是搜索系统的基本组成部分即使在使用语义搜索时也不一定涵盖同义词功能。
同义词允许我们根据用例控制要提升的文档并通过调整相关性来提高准确性。另一方面语义搜索对于 recall 很有用这意味着它可以引入潜在的相关结果而无需我们为每个相关术语添加同义词。
通过混合搜索我们可以同时进行同义词和语义搜索实现两全其美的效果。使用 Playground如果我们选择语义和文本字段的组合作为搜索字段它将自动为我们构建混合查询。
想要获得 Elastic 认证吗了解下一期 Elasticsearch 工程师培训何时举行
Elasticsearch 包含许多新功能可帮助你为你的用例构建最佳的搜索解决方案。深入了解我们的示例笔记本以了解更多信息开始免费云试用或立即在本地机器上试用 Elastic。 原文Are synonyms important in RAG? - Elasticsearch Labs