LangChain in Chains #24: Utilizing DataFrame Agent

Analyzing Data in a Pandas DataFrame Using Agents

Okan Yenigün
Dev Genius

--

Photo by Rubaitul Azad on Unsplash

After covering the fundamentals of prompting, let’s explore agents further. In this section, we will learn how to analyze dataframes using LLMs with the help of an agent.

Previous Chapter:

First, let’s load some dummy data for demonstration purposes. We will use a dataset from the pandas-dev GitHub account.

import pandas as pd

df = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/baseball.csv"
)
df
Dataframe. Image by the author.

The create_pandas_dataframe_agent in LangChain is used to generate an agent that interacts with a pandas DataFrame.

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_openai import ChatOpenAI

agent = create_pandas_dataframe_agent(
ChatOpenAI(temperature=0, model="gpt-4-turbo-2024-04-09"),
df,
verbose=True,
agent_type=AgentType.OPENAI_FUNCTIONS,
)

agent.invoke("how many rows are there?")

"""
> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': 'len(df)'}`


100The dataframe `df` contains 100 rows.

> Finished chain.
"""

"""
{'input': 'how many rows are there?',
'output': 'The dataframe `df` contains 100 rows.'}
"""

AgentType.OPENAI_FUNCTIONS indicates that the agent should utilize functionalities and integrations specific to OpenAI's models. This setting likely optimizes the agent's interaction with the DataFrame using the capabilities of the OpenAI language model, such as parsing queries, generating Python code, or even leveraging model-specific features to enhance data interaction.

agent.invoke("how many unique players are there?")

"""
> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': "df['player'].nunique()"}`


82There are 82 unique players in the dataframe.

> Finished chain.
"""

"""
{'input': 'how many unique players are there?',
'output': 'There are 82 unique players in the dataframe.'}
"""

agent.invoke("What is the average of year?")

"""
> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': "import pandas as pd\ndf = pd.DataFrame({\n 'id': [88641, 88643, 88645, 88649, 88650],\n 'player': ['womacto01', 'schilcu01', 'myersmi01', 'helliri01', 'johnsra05'],\n 'year': [2006, 2006, 2006, 2006, 2006],\n 'stint': [2, 1, 1, 1, 1],\n 'team': ['CHN', 'BOS', 'NYA', 'MIL', 'NYA'],\n 'lg': ['NL', 'AL', 'AL', 'NL', 'AL'],\n 'g': [19, 31, 62, 20, 33],\n 'ab': [50, 2, 0, 3, 6],\n 'r': [6, 0, 0, 0, 0],\n 'h': [14, 1, 0, 0, 1],\n 'X2b': [1, 0, 0, 0, 0],\n 'X3b': [0, 0, 0, 0, 0],\n 'hr': [1, 0, 0, 0, 0],\n 'rbi': [2, 0, 0, 0, 0],\n 'sb': [1, 0, 0, 0, 0],\n 'cs': [1, 0, 0, 0, 0],\n 'bb': [4, 0, 0, 0, 0],\n 'so': [4, 1, 0, 2, 4],\n 'ibb': [0, 0, 0, 0, 0],\n 'hbp': [0, 0, 0, 0, 0],\n 'sh': [3, 0, 0, 0, 0],\n 'sf': [0, 0, 0, 0, 0],\n 'gidp': [0, 0, 0, 0, 0]\n})\ndf['year'].mean()"}`


2006.0The average of the year column in the dataframe is 2006.
"""

"""
{'input': 'What is the average of year?',
'output': 'The average of the year column in the dataframe is 2006.'}
"""

We can create a simple app using this data. It will accept a CSV file, allowing us to ask questions about its contents.

UI. Image by the author.
import os
import pandas as pd
import streamlit as st
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain_openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your-key"
llm = OpenAI()

#### FRONT END
st.title("Data Analysis")

data = st.file_uploader("Upload CSV file",type="csv")

query = st.text_area("Enter your query")
button = st.button("Generate Response")

#### SERVICE
def get_response(df):
agent = create_pandas_dataframe_agent(llm, df, verbose=True)
return agent.invoke(query)

if button:
df = pd.read_csv(data)
response = get_response(df)
st.write(response["output"])

The program reads the CSV file each time the button is clicked. This could be optimized with caching, but that is beyond the scope of this discussion.

Next:

Read More

Sources

--

--