自己做网站广告法,科技有限公司注册资金最低多少,有没有99块钱做网站,电商网站建设教学总结在 Langchain 中#xff0c;文档转换器是一种在将文档提供给其他 Langchain 组件之前对其进行处理的工具。通过清理、处理和转换文档#xff0c;这些工具可确保 LLM 和其他 Langchain 组件以优化其性能的格式接收数据。
上一章我们了解了文档加载器#xff0c;加载完文档之…在 Langchain 中文档转换器是一种在将文档提供给其他 Langchain 组件之前对其进行处理的工具。通过清理、处理和转换文档这些工具可确保 LLM 和其他 Langchain 组件以优化其性能的格式接收数据。
上一章我们了解了文档加载器加载完文档之后还需要对文档进行转换。
文本分割器集成
Text Splitters
文本分割器专门用于将文本文档分割成更小、更易于管理的单元。
理想情况下这些块应该是句子或段落以便理解文本中的上下文和关系。
分割器考虑了 LLM 处理能力的局限性。通过创建更小的块LLM 可以在其上下文窗口内更有效地分析信息。
CharacterTextSplitterRecursiveCharacterTextSplitterSplit by tokensSemantic ChunkingHTMLHeaderTextSplitterMarkdownHeaderTextSplitterRecursiveJsonSplitterSplit Cod
CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplittertext_splitter CharacterTextSplitter(separator\n\n,chunk_size1000,chunk_overlap200,length_functionlen,is_separator_regexFalse,
)separator这是用于标识文本中自然断点的分隔符。在本例中它被设置为“\n\n”这意味着分割器将寻找双换行符作为潜在的分割点。chunk_size此参数指定每个文本块的目标大小以字符数表示。在这里它被设置为 1000这意味着分割器将旨在创建大约 1000 个字符长的文本块。chunk_overlap此参数允许连续块之间重叠字符。它被设置为 200这意味着每个块将包含前一个块末尾的 200 个字符。这种重叠可以帮助确保在块之间的边界上不会丢失任何重要信息。length_function这是一个用于测量文本块长度的函数。在本例中它被设置为内置的 len 函数该函数计算字符串中的字符数。is_separator_regex此参数指定分隔符是否为正则表达式。它被设置为 False表示分隔符是一个纯字符串而不是正则表达式模式。
CharacterTextSplitter根据指定的分隔符拆分文本默认情况下分隔符设置为 ‘\n\n’。chunk_size参数确定每个块的最大大小并且只有在可行的情况下才会进行拆分。如果字符串以 n 个字符开头后跟一个分隔符然后在下一个分隔符之前有 m 个字符则如果 chunk_size 小于 n m len(separator)则第一个块的大小将为 n。
from langchain_community.document_loaders import PyPDFLoaderloader PyPDFLoader(book.pdf)
pages loader.load_and_split()from langchain_text_splitters import CharacterTextSplittertext_splitter CharacterTextSplitter(separator\n,chunk_size1000,chunk_overlap200,length_functionlen,is_separator_regexFalse,
)texts text_splitter.split_text(pages[0].page_content)
print(len(texts))# 4texts[0]
Our goal with this book is to provide the guidance and framework for you,the reader, to grow on \nthe path to being a truly excellent database
reliability engineer (DBRE). When naming the book we \nchose to use thewords reliability engineer , rather than administrator. \nBen Treynor,
VP of Engineering at Google, says the following about reliability engi‐
neering: \nfundamentally doing work that has historically been done by an
operations team, but using engineers with software \nexpertise, and bankingon the fact that these engineers are inherently both predisposed to, and
have the ability to, \nsubstitute automation for human labor. \nToday’s
database professionals must be engineers, not administrators.
We build things. We create \nthings. As engineers practicing devops,
we are all in this together, and nothing is someone else’s \nproblem.As engineers, we apply repeatable processes, establ ished knowledge,
and expert judgment
texts[1]
things. As engineers practicing devops, we are all in this together, and nothing is someone else’s \nproblem. As engineers, we apply repeatable processes, establ ished knowledge, and expert judgment \nto design, build, and operate production data stores and the data structures within. As database \nreliability engineers, we must take the operational principles and the depth of database expertise \nthat we possess one ste p further. \nIf you look at the non -storage components of today’s infrastructures, you will see sys‐ tems that are \neasily built, run, and destroyed via programmatic and often automatic means. The lifetimes of these \ncomponents can be measured in days, and sometimes even hours or minutes. When one goes away, \nthere is any number of others to step in and keep the quality of service at expected levels. \nOur next goal is that you gain a framework of principles and practices for the design, building, andRecursiveCharacterTextSplitter
关键区别在于如果结果块仍然大于所需的 chunk_size它将继续分割结果块以确保所有最终块都在指定的大小限制内。它由字符列表参数化。
from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter RecursiveCharacterTextSplitter(# Set a really small chunk size, just to show.separators[\n\n, \n, , ],chunk_size50,chunk_overlap40,length_functionlen,is_separator_regexFalse,
)
texts text_splitter.split_text(pages[0].page_content)
print(len(texts))texts[2]
book is to provide the guidance and framework for
texts[3]
provide the guidance and framework for you, the在文本拆分的上下文中“递归”意味着拆分器将重复将其拆分逻辑应用于生成的块直到它们满足某些标准例如小于指定的最大长度。这在处理需要分解成更小、更易于管理的片段可能在不同的粒度级别的非常长的文本时特别有用。
Split By Tokens
原文“The quick brown fox jumps over the lazy dog。”
标记[“The”、“quick”、“brown”、“fox”、“jumps”、“over”、“the”、“lazy”、“dog”]
在此示例中文本根据空格和标点符号拆分为标记。每个单词都成为单独的标记。在实践中标记化可能更复杂尤其是对于具有不同书写系统的语言或处理特殊情况例如“don’t”可能拆分为“do”和“n’t”。
有各种标记器。
TokenTextSplitter 来自 tiktoken 库。
from langchain_text_splitters import TokenTextSplittertext_splitter TokenTextSplitter(chunk_size10, chunk_overlap1)texts text_splitter.split_text(pages[0].page_content)texts[0]
Our goal with this book is to provide the guidance
texts[1]guidance and framework for you, the reader, toSpacyTextSplitter 来自spacy库。
from langchain_text_splitters import SpacyTextSplittertext_splitter SpacyTextSplitter(chunk_size1000)texts text_splitter.split_text(pages[0].page_content)NLTKTextSplitter来自nltk库。
from langchain_text_splitters import NLTKTextSplittertext_splitter NLTKTextSplitter(chunk_size1000)texts text_splitter.split_text(pages[0].page_content)我们甚至可以利用 Hugging Face 标记器。
from transformers import GPT2TokenizerFasttokenizer GPT2TokenizerFast.from_pretrained(gpt2)text_splitter CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size100, chunk_overlap10
)
texts text_splitter.split_text(pages[0].page_content)HTMLHeaderTextSplitter
HTMLHeaderTextSplitter是一个网页代码分块器它根据 HTML 元素拆分文本并将相关元数据分配给分块内的每个标头。它可以返回单个分块或将具有相同元数据的元素组合在一起以保持语义分组并保留文档的结构上下文。此拆分器可与分块管道中的其他文本拆分器结合使用。
from langchain_text_splitters import HTMLHeaderTextSplitterhtml_string
!DOCTYPE html
html
bodydivh1Foo/h1pSome intro text about Foo./pdivh2Bar main section/h2pSome intro text about Bar./ph3Bar subsection 1/h3pSome text about the first subtopic of Bar./ph3Bar subsection 2/h3pSome text about the second subtopic of Bar./p/divdivh2Baz/h2pSome text about Baz/p/divbrpSome concluding text about Foo/p/div
/body
/html
headers_to_split_on [(h1, Header 1),(h2, Header 2),(h3, Header 3),
]html_splitter HTMLHeaderTextSplitter(headers_to_split_onheaders_to_split_on)
html_header_splits html_splitter.split_text(html_string)
html_header_splits
[Document(page_contentFoo),Document(page_contentSome intro text about Foo. \nBar main section Bar subsection 1 Bar subsection 2, metadata{Header 1: Foo}),Document(page_contentSome intro text about Bar., metadata{Header 1: Foo, Header 2: Bar main section}),Document(page_contentSome text about the first subtopic of Bar., metadata{Header 1: Foo, Header 2: Bar main section, Header 3: Bar subsection 1}),Document(page_contentSome text about the second subtopic of Bar., metadata{Header 1: Foo, Header 2: Bar main section, Header 3: Bar subsection 2}),Document(page_contentBaz, metadata{Header 1: Foo}),Document(page_contentSome text about Baz, metadata{Header 1: Foo, Header 2: Baz}),Document(page_contentSome concluding text about Foo, metadata{Header 1: Foo})]MarkdownHeaderTextSplitter
类似于 HTMLHeaderTextSplitter 专用于 markdown 文件。
from langchain_text_splitters import MarkdownHeaderTextSplittermarkdown_document # Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Mollyheaders_to_split_on [(#, Header 1),(##, Header 2),(###, Header 3),
]markdown_splitter MarkdownHeaderTextSplitter(headers_to_split_onheaders_to_split_on)
md_header_splits markdown_splitter.split_text(markdown_document)
md_header_splits
[Document(page_contentHi this is Jim \nHi this is Joe, metadata{Header 1: Foo, Header 2: Bar}),Document(page_contentHi this is Lance, metadata{Header 1: Foo, Header 2: Bar, Header 3: Boo}),Document(page_contentHi this is Molly, metadata{Header 1: Foo, Header 2: Baz})]RecursiveJsonSplitter
import requests# This is a large nested json object and will be loaded as a python dict
json_data requests.get(https://api.smith.langchain.com/openapi.json).json()from langchain_text_splitters import RecursiveJsonSplittersplitter RecursiveJsonSplitter(max_chunk_size300)# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks splitter.split_json(json_datajson_data)json_chunks{openapi: 3.0.2,info: {title: LangSmith, version: 0.1.0},paths: {/api/v1/sessions/{session_id}: {get: {tags: [tracer-sessions],summary: Read Tracer Session,description: Get a specific session.}}}},{paths: {/api/v1/sessions/{session_id}: {get: {operationId: read_tracer_session_api_v1_sessions__session_id__get}}}},{paths: {/api/v1/sessions/{session_id}: {get: {parameters: [{required: True,schema: {title: Session Id, type: string, format: uuid},name: session_id,in: path},{required: False,schema: {title: Include Stats,type: boolean,default: False},name: include_stats,in: query},{required: False,schema: {title: Accept, type: string},name: accept,in: header}]}}}},{paths: {/api/v1/sessions/{session_id}: {get: {responses: {200: {description: Successful Response,content: {application/json: {schema: {$ref: #/components/schemas/TracerSession}}}}}}}}},{paths: {/api/v1/sessions/{session_id}: {get: {responses: {422: {description: Validation Error,content: {application/json: {schema: {$ref: #/components/schemas/HTTPValidationError}}}}},security: [{API Key: []}, {Tenant ID: []}, {Bearer Auth: []}]}}}},
...{components: {securitySchemes: {API Key: {type: apiKey,in: header,name: X-API-Key},Tenant ID: {type: apiKey, in: header, name: X-Tenant-Id},Bearer Auth: {type: http, scheme: bearer}}}}]Split Code
Langchain 中的“Split Code”概念是指将代码划分为更小、更易于管理的段或块的过程。
from langchain_text_splitters import Language[e.value for e in Language]
[cpp,go,java,kotlin,js,ts,php,proto,python,rst,ruby,rust,scala,swift,markdown,latex,html,sol,csharp,cobol,c,lua,perl]from langchain_text_splitters import (Language,RecursiveCharacterTextSplitter,
)PYTHON_CODE
def hello_world():print(Hello, World!)# Call the function
hello_world()python_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.PYTHON, chunk_size50, chunk_overlap0
)
python_docs python_splitter.create_documents([PYTHON_CODE])
python_docs
[Document(page_contentdef hello_world():\n print(Hello, World!)),Document(page_content# Call the function\nhello_world())]JS_CODE
function helloWorld() {console.log(Hello, World!);
}// Call the function
helloWorld();js_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.JS, chunk_size60, chunk_overlap0
)
js_docs js_splitter.create_documents([JS_CODE])
js_docs
[Document(page_contentfunction helloWorld() {\n console.log(Hello, World!);\n}),Document(page_content// Call the function\nhelloWorld();)]