How to Build a LangChain Chatbot Pt. 1: Using LangChain’s S3DirectoryLoader with FlashBlade

Follow our step-by-step guide to make a Q&A LangChain Chatbot using Pure Storages FlashBlade Object Storage S3 bucket.

FlashBlade Object Storage

With the rising popularity of generative artificial intelligence (AI) and frameworks like LangChain, companies and teams are in a race to leverage the technology against their repositories of data. LangChain is a powerful open source framework that simplifies the development of large language model (LLM) applications such as chatbots, generative question answering, summarization, code analysis, and more.

In this first post of a three-part series we’ll demonstrate how to create a Q&A chatbot with LangChain and a Pure Storage FlashBlade® S3 bucket.

Step 1: FlashBlade Prep

To create a chatbot that can answer questions from a repository of data on a FlashBlade S3 bucket, you’ll need to load data and get it ready for chunking, embedding, and indexing in a vectorstore for similarity searching.

Let’s start with the FlashBlade side before we dive into the coding portion. In most cases, you would already have a FlashBlade S3 bucket with data residing in it, so the following sub-steps are complete:

Create a FlashBlade S3 Account, User, and Bucket/li>
Securely store a FlashBlade S3 User secret key and access key
Configure a minimum of 1 Data VIP on FlashBlade. This will be the endpoint IP for LangChain configuration later

The above information can be gathered either through the FlashBlade GUI, CLI, or API calls.

Step 2: LangChain Configuration and Data Loading

First things first, make sure LangChain, unstructured, and boto3 are installed.

pip install langchain unstructured boto3

1	pip install langchain unstructured boto3

Now we can start our Python application by importing the LangChain S3DirectoryLoader, initializing the loader with all of our FlashBlade information, and load the bucket data as a List for usage:

from langchain.document_loaders import S3DirectoryLoader

loader = S3DirectoryLoader(
   "FB Bucket Name", 
   aws_access_key_id="FB User Access Key", 
   aws_secret_access_key="FB User Secret Key", 
   endpoint_url="https://FB Data VIP Address"
)

documents = loader.load()

from langchain.document_loaders import S3DirectoryLoader

loader = S3DirectoryLoader(

"FB Bucket Name",

aws_access_key_id="FB User Access Key",

aws_secret_access_key="FB User Secret Key",

endpoint_url="https://FB Data VIP Address"

)

documents = loader.load()

Running this code will return an object similar to this, with a document entry for each document within the bucket:

[Document(page_content='I have many leather-bound books and my apartment smells of rich mahogany.', metadata={'source': 's3://flashblade-bucket/anchorman.docx'}),
Document(page_content='I award you no points, and may God have mercy on your soul.', metadata={'source': 's3://flashblade-bucket/billymadison.docx'})]

[Document(page_content='I have many leather-bound books and my apartment smells of rich mahogany.', metadata={'source': 's3://flashblade-bucket/anchorman.docx'}),

Document(page_content='I award you no points, and may God have mercy on your soul.', metadata={'source': 's3://flashblade-bucket/billymadison.docx'})]

Since the S3DirectoryLoader is using boto3 under the hood, there are some parameters we can change to increase the throughput performance coming from the FlashBlade. Here’s an example of the settings:

max_concurrent_requests = 1000
max_queue_size = 10000
multipart_threshold = 64MB
multipart_chunksize = 16MB

max_concurrent_requests = 1000

max_queue_size = 10000

multipart_threshold = 64MB

multipart_chunksize = 16MB

For more performance-based analysis, check out this blog written by Joshua Robinson that compares performance across several S3 transfer tools. (Spoiler: boto3 is not the fastest option.)

As LangChain evolves, it would be useful to have additional data transfer options for the S3DirectoryLoader function to use such as s5cmd.

Stay Tuned for More Tutorials

We now have our LangChain code connecting to an on-premises, high-performance FlashBlade without a large lift! In the next installation of this blog series, we’ll cover taking this newly loaded data and chunking it up, embedding the chunks, creating the vectorstore, and persisting that index to storage.

By leveraging FlashBlade as the fast and scalable data platform foundation for this chat bot framework, we’ll be able to have our data in a centralized location that can not only ingest data from various sources but also fast retrieval of large amounts of data, allowing AI practitioners to use in house data sets with business rich context, to train and deploy more accurate models.

¹ https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/

Written By: Jonathan Cardadeiro

View Full Bio

How to Build a LangChain Chatbot Pt. 1: Using LangChain’s S3DirectoryLoader with FlashBlade

Step 1: FlashBlade Prep

Step 2: LangChain Configuration and Data Loading

Stay Tuned for More Tutorials

¹ https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/

Upskill Your Knowledge!

A New Power-up for Data Resilience: A Guide to Object-level Exclusions

Streamlining Azure VMware Solution: Automating Pure Cloud Block Store Expansion

Taming the Storage Sprawl: Simplify Your Life with Fan-in Replication for Snapshot Consolidation

Harnessing Static and Dynamic Code Scanning in DevSecOps

Streamlining Azure VMware Solution: Automating Pure Cloud Block Store Expansion

A New Power-up for Data Resilience: A Guide to Object-level Exclusions

Taming the Storage Sprawl: Simplify Your Life with Fan-in Replication for Snapshot Consolidation

Harnessing Static and Dynamic Code Scanning in DevSecOps

Why IOPS Don’t Matter

How to Build a LangChain Chatbot Pt. 1: Using LangChain’s S3DirectoryLoader with FlashBlade

Step 1: FlashBlade Prep

Step 2: LangChain Configuration and Data Loading

Stay Tuned for More Tutorials

¹ https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/

Upskill Your Knowledge!

Related Stories

Top Stories