Webpage Transform
The Webpage transform supports loading and processing webpage urls. Given a url, this transform will send the request to load the webpage and then parse the webpage returned to collect the text to send to the LLM.
Use this transform yourself here in a Colab -
In order to use this transform, use the following steps:
Installation¶
Use the following command to download all dependencies for the webpage transform. beautifulsoup4
must be version 4.12.2
or higher.
Make sure to do this before running the transform.
Parameters for this transform¶
url_column: str (Required)
: The column to retrieve the url from. This is the webpage that will be loaded by the transform.timeout: int (Optional: Default = 5)
: The timeout to wait until for loading the webpage. The request to the webpage will timeout after this. We will log an error and send an empty response after the timeout is reached.headers: Dict[str,str] (Optional: Default = {})
: Any headers that need to be passed into the webpage load request. Underneath we use requests to get the webpage and the headers are passed to request.
Using the transform¶
Below is an example of a webpage transform to extract text from a webpage:
{
..., # other config parameters
"transforms": [
..., # other transforms
{
"name": "webpage_transform",
"params": {
"url_column": "url"
},
"output_columns": {
"content_column": "webpage_content",
}
}
]
}
Run the transform¶
from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)
ds = agent.transform(ds)
This runs the transformation. We will see the content in the webpage_content column. Access this using ds.df
in the AutolabelDataset.