September 26, 2023

Disclaimer

The inventory knowledge used on this article is solely fictitious. It’s purely for demo functions. Please don’t use this knowledge for making any monetary choices.

This text will present the right way to join a Kafka dealer, streaming instance inventory tick knowledge, to SingleStoreDB. We’ll then question the info utilizing English sentences by LangChain, which offers a fundamental question-and-answer functionality for the tick knowledge. We’ll construct a Python software, by a number of design iterations, to make use of OpenAI’s Whisper to ask questions by speech and use speech synthesis to answer.

The pocket book file, SQL, and Python code can be found on GitHub.

Introduction

The flexibility to ask questions on a database system utilizing pure language is acquainted. Nevertheless, it has develop into a lot simpler to implement with trendy instruments like LangChain and OpenAI’s Whisper. On this article, we’ll see how.

Create a SingleStoreDB Cloud Account

A previous article confirmed the steps to create a free SingleStoreDB Cloud account. We’ll use the next settings:

  • Workspace Group Title: Whisper Demo Group
  • Cloud Supplier: AWS
  • Area: US East 1 (N. Virginia)
  • Workspace Title: whisper-demo
  • Dimension: S-00
  • Settings: None chosen

As soon as the workspace is out there, we’ll make a remark of our password and host. The host will likely be out there from Whisper Demo Group > Overview > Workspaces > whisper-demo > Join > Join Immediately > SQL IDE > Host. We’ll want this data later for a Python software. We’ll briefly permit entry from wherever by configuring the firewall underneath Whisper Demo Group > Firewall.

Create a Database and Tables

From the left navigation pane, we’ll choose DEVELOP > SQL Editor to create a timeseries_db database, tick and stock_sentiment tables, as follows:

CREATE DATABASE IF NOT EXISTS timeseries_db;

USE timeseries_db;

CREATE TABLE IF NOT EXISTS tick (
  image VARCHAR(10),
  ts DATETIME SERIES TIMESTAMP,
  open NUMERIC(18, 2),
  excessive NUMERIC(18, 2),
  low NUMERIC(18, 2),
  worth NUMERIC(18, 2),
  quantity INT,
  KEY(ts)
);

CREATE TABLE IF NOT EXISTS stock_sentiment (
  headline VARCHAR(250),
  optimistic FLOAT,
  adverse FLOAT,
  impartial FLOAT,
  url TEXT,
  writer VARCHAR(30),
  ts DATETIME,
  image VARCHAR(10)
);

Create a Pipeline

Pipelines permit us to create streaming ingest feeds from numerous sources, reminiscent of Kafka, S3 and HDFS, utilizing a single command. With pipelines, we will additionally carry out ETL operations. For our use case, we’ll create a easy pipeline in SingleStoreDB as follows:

CREATE PIPELINE tick
AS LOAD DATA KAFKA 'public-kafka.memcompute.com:9092/stockticker'
BATCH_INTERVAL 2500
INTO TABLE tick
FIELDS TERMINATED BY ','
(image,ts,open,excessive,low,worth,quantity);

We’ll management the speed of knowledge ingestion utilizing the BATCH_INTERVAL. Initially, we’ll set this to 2500 milliseconds.

We’ll configure the pipeline to begin from the earliest offset, as follows:

ALTER PIPELINE tick SET OFFSETS EARLIEST;

and we’ll take a look at the pipeline earlier than we begin operating it, as follows:

TEST PIPELINE tick LIMIT 1;

The output must be much like the next:

+--------+---------------------+-------+-------+-------+-------+--------+
| image | ts                  | open  | excessive  | low   | worth | quantity |
+--------+---------------------+-------+-------+-------+-------+--------+
| AIV    | 2023-09-05 06:47:53 | 44.89 | 44.89 | 44.88 | 44.89 | 719    |
+--------+---------------------+-------+-------+-------+-------+--------+

Keep in mind, that is fictitious knowledge.

We’ll now begin the pipeline:

Load Inventory Sentiment Knowledge

We’ll now load the info into the stock_sentiment desk. The information are derived from a GitHub Gist and are a small subset of the info from Kaggle. The CSV file already contains the sentiment values. We’ll obtain this file after which use a MySQL consumer to hook up with SingleStoreDB Cloud, as follows:

mysql --local-infile -u admin -h <host> -P 3306 --default-auth=mysql_native_password -p

The <host> being changed with the worth we obtained from SingleStoreDB Cloud earlier. We’ll load the info into SingleStoreDB from the CSV file, as follows:

USE timeseries_db;

LOAD DATA LOCAL INFILE '/path/to/stock_sentiment.csv'
INTO TABLE stock_sentiment
IGNORE 1 LINES
COLUMNS TERMINATED BY 't';

The /path/to/ changed with the precise path to the place the CSV file is positioned.

Load the Pocket book

We’ll use the pocket book file out there on GitHub. From the left navigation pane in SingleStoreDB Cloud, we’ll choose Notebooks. Within the prime proper of the net web page will likely be New Pocket book with a pulldown that has two choices:

  1. New Pocket book
  2. Import From File

We’ll choose the second choice, find the pocket book file we downloaded from GitHub and cargo it into SingleStoreDB Cloud.

Run the Pocket book

We’ll begin by choosing the Connection (whisper-demo) and Database (timeseries_db) utilizing the drop-down menus above the pocket book.

Analyzing Time Collection Knowledge

Half 1 of the pocket book incorporates a set of Time Collection operations on the info within the tick desk. These operations had been described in better element in a earlier article.

LangChain OnlinePDFLoader

Half 2 of the pocket book masses the contents of a PDF doc and vector embeddings right into a desk known as fintech_docs. These operations had been described in better element in a previous article. Exchange <PDF doc URL> with the hyperlink of your chosen FinTech doc:

from langchain.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("<PDF doc URL>")

knowledge = loader.load()

The Totally Certified Area Title (FQDN) the place the PDF file is positioned have to be added to the firewall by choosing the Edit Firewall choice within the prime proper.

We’ll use ChatGPT to reply questions on the PDF file, reminiscent of:

"What are the perfect funding alternatives in Blockchain?"

LangChain SQL Agent

Half 3 of the pocket book incorporates LangChain agent operations on the info within the tick and stock_sentiment tables. This would be the essential focus of this text.

First, we’ll set up some libraries:

!pip set up langchain --quiet
!pip set up openai --quiet

Now, we’ll enter our OpenAI API Key:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Subsequent, we’ll want some imports:

from langchain import OpenAI, SQLDatabase
from langchain.brokers.agent_toolkits import SQLDatabaseToolkit
from langchain.brokers import create_sql_agent

We’ll configure the LangChain toolkit and agent, as follows:

db = SQLDatabase.from_uri(connection_url)

llm = OpenAI(temperature = 0, verbose = False)

toolkit = SQLDatabaseToolkit(db = db, llm = llm)

agent_executor = create_sql_agent(
  llm = OpenAI(temperature = 0), toolkit = toolkit, verbose = False
)

We’ll now take a look at the code utilizing an instance question, as follows:

agent_executor.run(
"""
    From the tick desk, which inventory image noticed the least volatility in share buying and selling within the dataset?
"""
)

Right here is a few instance output:

'The inventory image with the least volatility in share buying and selling within the dataset is FTR.'

We’ll make this a little bit extra interactive utilizing the next code:

question = enter("Please enter your query:")
agent_executor.run(question)

We’ll take a look at this with an instance question:

Utilizing the image A, what's the most optimistic sentiment within the stock_sentiment desk and the present greatest worth for this image from the tick desk?

Right here is a few instance output:

'Probably the most optimistic sentiment for image A is 0.9576 and the present greatest worth is 46.43.'

To see the chain output, we’ll set verbose to True, as follows:

agent_executor = create_sql_agent(
  llm = OpenAI(temperature = 0), toolkit = toolkit, verbose = True
)

An instance of chain output from one other database was proven in a previous article.

Bonus: Construct a Visible Python Software

We’ll now construct a easy Python software that makes use of OpenAI’s Whisper to ask questions concerning the database system. An excellent article impressed this software.

Set up the Required Software program

For this text, that is the software program that was required in a clear set up of Ubuntu 22.04.2 operating in a VMware Fusion Digital Machine:

sudo apt set up ffmpeg
sudo apt set up libespeak1
sudo apt set up portaudio19-dev
sudo apt set up python3-tk
sudo apt set up python3-pil python3-pil.imagetk

in addition to the next packages:

langchain
matplotlib
openai
openai-whisper
pyaudio
pymysql
pyttsx3 wave

These may be discovered within the necessities.txt file on GitHub. Run the file as follows:

pip set up -r necessities.txt

openai-whisper might take some time to put in.

We’ll want to supply an OpenAI API Key in the environment. For instance:

export OPENAI_API_KEY="<OpenAI API Key>" 

Exchange <OpenAI API Key> together with your key.

In every of the next functions, we’ve the next code:

s2_password = "<password>"
s2_host = "<host>"

We’ll exchange <password> and <host> with the values that we saved earlier from SingleStoreDB Cloud.

First Iteration

Let’s begin with a easy visible Python software utilizing Tkinter. We’ll additionally add voice recognition utilizing OpenAI’s Whisper. The appliance allows as much as 20 seconds of recorded speech. It may be run as follows:

python3 record-transcribe.py

Instance output is proven in Determine 1.

Determine 1. First Iteration.

Second Iteration

Within the subsequent iteration, we’ll add an Audio Waveform. We’ll run this system as follows:

python3 record-transcribe-visualise.py

Instance output is proven in Determine 2.

Determine 2. Second Iteration.

Third Iteration

Within the third and remaining iteration, we’ll take away a text-based response by the appliance and exchange it with speech synthesis. We’ll run this system as follows:

python3 record-transcribe-visualise-speak.py

Instance output is proven in Determine 3.

Determine 3. Third Iteration.

Within the code for all three iterations, we will name the OpenAI Whisper API as an alternative of utilizing the native Whisper set up. To do that, we would uncomment these traces of code within the transcribe_audio perform:

# transcript = openai.Audio.transcribe(
#     mannequin = "whisper-1",
#     file = audio_file,
#     language = "en"
# )

And remark out this line of code:

transcript = mannequin.transcribe(filename)

Nevertheless, calling the OpenAI Whisper API would incur extra prices. The native Whisper set up already offers wonderful outcomes.

Abstract

On this article, we have seen that with out creating vector embeddings, we have been in a position to entry our knowledge fairly successfully utilizing LangChain. Nevertheless, our queries must be extra centered, and we have to perceive the database schema earlier than asking questions. Integrating a speech functionality allows our functions to be extra broadly used.