WebBaseLoader
This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.
If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using FireCrawlLoader or the faster option SpiderLoader.
Overview
Integration details
- TODO: Fill in table features.
- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.
- TODO: Make sure API reference links are correct.
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| WebBaseLoader | langchain-community | ✅ | ❌ | ❌ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| WebBaseLoader | ✅ | ✅ |
Setup
Credentials
WebBaseLoader does not require any credentials.
Installation
To use the WebBaseLoader you first need to install the langchain-community python package.
%pip install -qU langchain-community beautifulsoup4
Initialization
Now we can instantiate our model object and load documents:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.example.com/")
To bypass SSL verification errors during fetching, you can set the "verify" option:
loader.requests_kwargs = {'verify':False}
Initialization with multiple pages
You can also pass in a list of pages to load from.
loader_multiple_pages = WebBaseLoader(
["https://www.example.com/", "https://google.com"]
)
Load
docs = loader.load()
docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}
Load multiple urls concurrently
You can speed up the scraping process by scraping and parsing multiple urls concurrently.
There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the requests_per_second parameter to increase the max concurrent requests. Note, while this will speed up the scraping process, but may cause the server to block you. Be careful!
%pip install -qU nest_asyncio
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()
Note: you may need to restart the kernel to use updated packages.
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00, 8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms ')]