Document chunking is the process of splitting large documents into smaller, coherent pieces that can be:
The goal is to create chunks that are semantically coherent while being optimally sized for retrieval and processing.
The simplest approach that splits text based on fixed parameters.
Algorithm: Fixed-Size Chunking
Input: text, chunk_size, overlap_size
Output: list of chunks
1. Initialize: position = 0, chunks = []
2. While position < length(text):
a. Extract chunk from position to (position + chunk_size)
b. If chunk ends mid-word, backtrack to last word boundary
c. Add chunk to chunks list
d. Advance position by (chunk_size - overlap_size)
3. Return chunks