WebBaseLoader
This covers how to use WebBaseLoader
to load all text from HTML
webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader
, AZLyricsLoader
, and CollegeConfidentialLoader
.
If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using FireCrawlLoader
or the faster option SpiderLoader
.
Overview
Integration details
- TODO: Fill in table features.
- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.
- TODO: Make sure API reference links are correct.
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
WebBaseLoader | langchain-community | ✅ | ❌ | ❌ |
Loader features
Source | Document Lazy Loading | Native Async Support |
---|---|---|
WebBaseLoader | ✅ | ✅ |
Setup
Credentials
WebBaseLoader
does not require any credentials.
Installation
To use the WebBaseLoader
you first need to install the langchain-community
python package.
%pip install -qU langchain-community beautifulsoup4
Initialization
Now we can instantiate our model object and load documents:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.example.com/")
To bypass SSL verification errors during fetching, you can set the "verify" option:
loader.requests_kwargs = {'verify':False}
Initialization with multiple pages
You can also pass in a list of pages to load from.
loader_multiple_pages = WebBaseLoader(
["https://www.example.com/", "https://google.com"]
)
Load
docs = loader.load()
docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}
Load multiple urls concurrently
You can speed up the scraping process by scraping and parsing multiple urls concurrently.
There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the requests_per_second
parameter to increase the max concurrent requests. Note, while this will speed up the scraping process, but may cause the server to block you. Be careful!
%pip install -qU nest_asyncio
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()
Note: you may need to restart the kernel to use updated packages.
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00, 8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms ')]