Sunday, March 17, 2024

Natural Language to SQL generation - Part II

Using natural language to query the data is an easy way to answer business questions. Which we have covered the basics in the previous blog post, in this post we will see about how can this work on my data when tables and column names are meaningless? It is possible when using Autonomous database in place. There is no magic, if the tables and column names are not descriptive, then we can help the LLM interpret the meaning of tables and columns using the build-in database feature called “Comments”. Yes comments are descriptive notes about tables and column purpose or usage. And better the comment, the more likely the LLM will know how to use the table or column to generate the right query.
 
Let’s take an example, my database has three tables. their table and column names are meaningless.
 
rajesh@ATP19C> create table t1(
  2    c1 number,
  3    c2 varchar2(2),
  4    c3 varchar2(40),
  5    c4 varchar2(30),
  6    c5 number,
  7    c6 varchar2(20),
  8    c7 number,
  9    c8 varchar2(11),
 10    c9 number,
 11    c10 varchar2(40) );
 
rajesh@ATP19C> create table t2(
  2    c1 number,
  3    c2 varchar2(20),
  4    c3 varchar2(40),
  5    c4 varchar2(1),
  6    c5 number,
  7    c6 varchar2(20),
  8    c7 varchar2(40),
  9    c8 varchar2(10),
 10    c9 varchar2(30),
 11    c10 number ,
 12    c11 varchar2(40),
 13    c12 number,
 14    c13 number,
 15    c14 varchar2(25),
 16    c15 varchar2(30),
 17    c16 number,
 18    c17 varchar2(50),
 19    c18 varchar2(14),
 20    c19 number,
 21    c20 number,
 22    c21 Date,
 23    c22 Date,
 24    c23    varchar2(1)  );
 
rajesh@ATP19C> create table t3(
  2    c1 number,
  3    c2 number,
  4    c3 date,
  5    c4 number,
  6    c5 number,
  7    c6 number,
  8    c7 number );   
 
There is a zero chance, that a Natural language query will know that these tables represent, countries , customers and sales Information, we can fix that ambiguity by adding database comments.
 
rajesh@ATP19C> set feedback off
rajesh@ATP19C> insert /*+ append */ into t1 select * from sh.countries;
rajesh@ATP19C> insert /*+ append */ into t2 select * from sh.customers;
rajesh@ATP19C> insert /*+ append */ into t3 select * from sh.sales;
rajesh@ATP19C> commit;
rajesh@ATP19C>
rajesh@ATP19C> comment on table t1 is 'contains country name, region , subregion and ISO code';
rajesh@ATP19C> comment on column t1.c1 is 'country ids. use this column to join with other tables';
rajesh@ATP19C> comment on column t1.c2 is 'country ISO code';
rajesh@ATP19C> comment on column t1.c3 is 'country name';
rajesh@ATP19C> comment on column t1.c4 is 'country subregion name';
rajesh@ATP19C> comment on column t1.c5 is 'country subregion ids';
rajesh@ATP19C> comment on column t1.c6 is 'country region name';
rajesh@ATP19C> comment on column t1.c7 is 'country region ids';
rajesh@ATP19C>
rajesh@ATP19C>
rajesh@ATP19C> comment on table t2 is 'contains customer information like name, gender, date of birth, marital status, postal address';
rajesh@ATP19C> comment on column t2.c1 is 'customer ids. primary key column use this column to join with other tables';
rajesh@ATP19C> comment on column t2.c2 is 'first name of the customer';
rajesh@ATP19C> comment on column t2.c3 is 'last name of the customer';
rajesh@ATP19C> comment on column t2.c4 is 'customer gender';
rajesh@ATP19C> comment on column t2.c5 is 'customer year of birth';
rajesh@ATP19C> comment on column t2.c6 is 'customer marital status';
rajesh@ATP19C> comment on column t2.c7 is 'customer street address';
rajesh@ATP19C> comment on column t2.c8 is 'postal code of the customer';
rajesh@ATP19C> comment on column t2.c9 is 'city where the customer lives';
rajesh@ATP19C> comment on column t2.c11 is 'customer geography: state or province';
rajesh@ATP19C> comment on column t2.c13 is 'foreign key to the countries table';
rajesh@ATP19C> comment on column t2.c14 is 'customer main phone number';
rajesh@ATP19C> comment on column t2.c15 is 'customer income level';
rajesh@ATP19C> comment on column t2.c16 is 'customer credit limit';
rajesh@ATP19C> comment on column t2.c17 is 'customer email id';
rajesh@ATP19C>
rajesh@ATP19C> comment on table t3 is 'contains sales information about each customers';
rajesh@ATP19C> comment on column t3.c2 is 'foreign key to the customer table';
rajesh@ATP19C> comment on column t3.c6 is 'product quantity sold with the transaction';
rajesh@ATP19C> comment on column t3.c7 is 'invoiced amount to the customer';
rajesh@ATP19C>
rajesh@ATP19C>
 
 
that’s it, those meaning less tables and column names can be understood by LLM using Select AI. This Select AI “profile” encapsulates the information needed to interact with LLM. It includes details like AI provider, the model to use, the source tables used for natural language queries and whether comments should be passed to the model for SQL generation etc.
 
rajesh@ATP19C> begin
  2   dbms_cloud_ai.create_profile( 'MY_AI_PROFILE',
  3            '{"provider": "cohere",
  4            "credential_name": "MY_AI_CRED",
  5            "comments":"true",
  6            "object_list": [
  7                                        {"owner":"RAJESH","name":"T1"},
  8                                        {"owner":"RAJESH","name":"T2"},
  9                                        {"owner":"RAJESH","name":"T3"}
 10                                      ]      }' );
 11  end;
 12  /
 
PL/SQL procedure successfully completed.
 
Now we can start asking questions using natural language against the complex schema, even though the tables and column names are meaningless, the LLM is able to identify the tables and column names through comments and generate queries.
 
rajesh@ATP19C> exec dbms_cloud_ai.set_profile('MY_AI_PROFILE');
 
PL/SQL procedure successfully completed.
 
rajesh@ATP19C> Select AI showsql how many customers are there ;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT COUNT(*) AS customer_count
FROM RAJESH.T2
 
 
rajesh@ATP19C> Select AI showsql what are our total sales;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT SUM(T3.C7) AS TotalSales
FROM T1 T1 JOIN T2 T2 ON T1.C1 = T2.C13 JOIN T3 ON T2.C1 = T3.C2
 
 
rajesh@ATP19C> Select AI showsql list total sales by gender;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT t3.c4, SUM(t3.c6)
FROM rajesh.t3 JOIN rajesh.t2 ON rajesh.t3.c1 = rajesh.t2.c1
GROUP BY t3.c4
ORDER BY SUM(t3.c6) DESC
 
 
rajesh@ATP19C> Select AI showsql list customers born in the year 1980;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT t2.C16 AS customer_credit_limit, t2.C4 AS customer_gender, t2.C6 AS custo
mer_marital_status, t2.C11 AS customer_province, t2.C9 AS customer_city, t2.C15
AS customer_income_level, t2.C5 AS customer_year_of_birth, t2.C17 AS customer_em
ail, t2.C3 AS customer_last_name, t2.C2 AS customer_first_name
FROM rajesh.t2
WHERE t2.C5 = 1980
 
 
Properly describing your data will help you use natural language to get answers. Comments not only help an LLM successfully formulate queries, they also help you understand your data as well, starting with Oracle database 23c, not just comments even Annotations can help LLM to get the best out of data models, stay tuned.