Skip to content

PDF Transform

The PDF transform allows users to extract text from pdf files. Autolabel offers both direct text extraction, useful for extracting text from pdfs that contain text, and optical character recognition (OCR) text extraction, useful for extracting text from pdfs that contain images. To use this transform, follow these steps:

Installation

For direct text extraction, install the pdfplumber package:

pip install pdfplumber

For OCR text extraction, install the pdf2image and pytesseract packages:

pip install pdf2image pytesseract

The tesseract engine is also required for OCR text extraction. See the tesseract docs for installation instructions.

Parameters for this transform

  1. file_path_column: the name of the column containing the file paths of the pdf files to extract text from
  2. ocr_enabled: a boolean indicating whether to use OCR text extraction or not
  3. page_format: a string containing the format to use for each page of the pdf file. The following fields can be used in the format string:
    • page_num: the page number of the page
    • page_content: the content of the page
  4. page_sep: a string containing the separator to use between each page of the pdf file
  5. lang: a string indicating the language of the text in the pdf file. See the [tesseract docs](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) for a full list of supported languages

Output Format

The page_format and page_sep parameters define how the text extracted from the pdf will be formatted. For example, if the pdf file contained 2 pages with "Hello," on the first page and "World!" on the second, a page_format of {page_num} - {page_content} and a page_sep of \n would result in the following output:

"1 - Hello,\n2 - World!"

The metadata column contains a dict with the field "num_pages" indicating the number of pages in the pdf file.

Using the transform

Below is an example of a pdf transform to extract text from a pdf file:

{
  ..., # other config parameters
  "transforms": [
    ..., # other transforms
    {
      "name": "pdf",
      "params": {
        "file_path_column": "file_path",
        "ocr_enabled": true,
        "page_format": "Page {page_num}: {page_content}",
        "page_sep": "\n\n"
      },
      "output_columns": {
        "content_column": "content",
        "metadata_column": "metadata"
      }
    }
  ]
}

Run the transform

from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)
ds = agent.transform(ds)

This runs the transformation. We will see the content in the correct column. Access this using ds.df in the AutolabelDataset.