Skip to main content

Document Loading and Extraction

Document loading and extraction is the first step. You'll retrieve data from various folders or repositories that act as your sources of truth. They will consist of PDF and Word documents, websites, markdown files, Confluence and SharePoint pages, perhaps records from databases (SQL or others), images, videos and various other formats. The data you collect will typically be either structured, unstructured or multimodal. Regardless of format, the primary goal during this stage is to extract usable textual information that can be indexed, embedded and fed into the language models.

Key aspects of this step include:

  • Connectors to Sources: Connectors are responsible for accessing different Systems of Record (SOR). For example, you might have a connector for an AWS S3 bucket, one for a SharePoint document library, another for a SQL database. These connectors handle authentication and data fetching from the source.

  • Format-Specific Loaders: Data comes in varied formats – unstructured text (PDFs, DOCX, HTML), semi-structured (JSON, CSV, XML) and structured (database tables). Specialized document loaders parse each format. For instance, a PDF loader will extract text from PDF files (using libraries or OCR if needed) or a CSV loader will read the rows and columns. Using the right loader ensures that all content (including text in tables or image captions) is captured.

  • Extraction of Content: Once a file is loaded, the text content is extracted. Sometimes this is straightforward (as with plain text files). Other times, advanced processing is needed. For example, PDFs might contain images or scanned text which would mean requiring OCR to extract text. Web pages contain HTML and would need HTML parsing to get visible text.

  • Handling Complex Structures: If a document contains multiple data types (text, tables, images), you may use processors to handle each. One strategy is to extract text from all parts and label it. In some pipelines, images might be passed through an image-to-text model (for captioning) if their content is important. The goal is to not lose valuable information during extraction.

After this step, the data from various sources is unified in a basic textual form (often as a set of in-memory Document objects or records). For example, using LangChain, one could have a PyPDFLoader to turn a PDF into a list of text chunks (one per page or section). Each document typically comes with metadata (filename, source, perhaps creation date) attached to the extracted text.

Why it matters: Proper document loading ensures that no relevant data is missed. It also respects the structure. For example, it keeps page breaks or section headings can later help in chunking. This step is also where volume is addressed. If there are thousands of documents, the ingestion might be done in batches or using a pipeline that streams data for efficiency. Robust connectors and loaders are crucial for a scalable pipeline.

In summary, Document Loading and Extraction is about connecting to where the data lives, and pulling it into the system in text form. It sets the stage for cleaning and organizing that text in subsequent steps.