Sunday, March 17, 2024

Natural Language to SQL generation - Part II

Using natural language to query the data is an easy way to answer business questions. Which we have covered the basics in the previous blog post, in this post we will see about how can this work on my data when tables and column names are meaningless? It is possible when using Autonomous database in place. There is no magic, if the tables and column names are not descriptive, then we can help the LLM interpret the meaning of tables and columns using the build-in database feature called “Comments”. Yes comments are descriptive notes about tables and column purpose or usage. And better the comment, the more likely the LLM will know how to use the table or column to generate the right query.
 
Let’s take an example, my database has three tables. their table and column names are meaningless.
 
rajesh@ATP19C> create table t1(
  2    c1 number,
  3    c2 varchar2(2),
  4    c3 varchar2(40),
  5    c4 varchar2(30),
  6    c5 number,
  7    c6 varchar2(20),
  8    c7 number,
  9    c8 varchar2(11),
 10    c9 number,
 11    c10 varchar2(40) );
 
rajesh@ATP19C> create table t2(
  2    c1 number,
  3    c2 varchar2(20),
  4    c3 varchar2(40),
  5    c4 varchar2(1),
  6    c5 number,
  7    c6 varchar2(20),
  8    c7 varchar2(40),
  9    c8 varchar2(10),
 10    c9 varchar2(30),
 11    c10 number ,
 12    c11 varchar2(40),
 13    c12 number,
 14    c13 number,
 15    c14 varchar2(25),
 16    c15 varchar2(30),
 17    c16 number,
 18    c17 varchar2(50),
 19    c18 varchar2(14),
 20    c19 number,
 21    c20 number,
 22    c21 Date,
 23    c22 Date,
 24    c23    varchar2(1)  );
 
rajesh@ATP19C> create table t3(
  2    c1 number,
  3    c2 number,
  4    c3 date,
  5    c4 number,
  6    c5 number,
  7    c6 number,
  8    c7 number );   
 
There is a zero chance, that a Natural language query will know that these tables represent, countries , customers and sales Information, we can fix that ambiguity by adding database comments.
 
rajesh@ATP19C> set feedback off
rajesh@ATP19C> insert /*+ append */ into t1 select * from sh.countries;
rajesh@ATP19C> insert /*+ append */ into t2 select * from sh.customers;
rajesh@ATP19C> insert /*+ append */ into t3 select * from sh.sales;
rajesh@ATP19C> commit;
rajesh@ATP19C>
rajesh@ATP19C> comment on table t1 is 'contains country name, region , subregion and ISO code';
rajesh@ATP19C> comment on column t1.c1 is 'country ids. use this column to join with other tables';
rajesh@ATP19C> comment on column t1.c2 is 'country ISO code';
rajesh@ATP19C> comment on column t1.c3 is 'country name';
rajesh@ATP19C> comment on column t1.c4 is 'country subregion name';
rajesh@ATP19C> comment on column t1.c5 is 'country subregion ids';
rajesh@ATP19C> comment on column t1.c6 is 'country region name';
rajesh@ATP19C> comment on column t1.c7 is 'country region ids';
rajesh@ATP19C>
rajesh@ATP19C>
rajesh@ATP19C> comment on table t2 is 'contains customer information like name, gender, date of birth, marital status, postal address';
rajesh@ATP19C> comment on column t2.c1 is 'customer ids. primary key column use this column to join with other tables';
rajesh@ATP19C> comment on column t2.c2 is 'first name of the customer';
rajesh@ATP19C> comment on column t2.c3 is 'last name of the customer';
rajesh@ATP19C> comment on column t2.c4 is 'customer gender';
rajesh@ATP19C> comment on column t2.c5 is 'customer year of birth';
rajesh@ATP19C> comment on column t2.c6 is 'customer marital status';
rajesh@ATP19C> comment on column t2.c7 is 'customer street address';
rajesh@ATP19C> comment on column t2.c8 is 'postal code of the customer';
rajesh@ATP19C> comment on column t2.c9 is 'city where the customer lives';
rajesh@ATP19C> comment on column t2.c11 is 'customer geography: state or province';
rajesh@ATP19C> comment on column t2.c13 is 'foreign key to the countries table';
rajesh@ATP19C> comment on column t2.c14 is 'customer main phone number';
rajesh@ATP19C> comment on column t2.c15 is 'customer income level';
rajesh@ATP19C> comment on column t2.c16 is 'customer credit limit';
rajesh@ATP19C> comment on column t2.c17 is 'customer email id';
rajesh@ATP19C>
rajesh@ATP19C> comment on table t3 is 'contains sales information about each customers';
rajesh@ATP19C> comment on column t3.c2 is 'foreign key to the customer table';
rajesh@ATP19C> comment on column t3.c6 is 'product quantity sold with the transaction';
rajesh@ATP19C> comment on column t3.c7 is 'invoiced amount to the customer';
rajesh@ATP19C>
rajesh@ATP19C>
 
 
that’s it, those meaning less tables and column names can be understood by LLM using Select AI. This Select AI “profile” encapsulates the information needed to interact with LLM. It includes details like AI provider, the model to use, the source tables used for natural language queries and whether comments should be passed to the model for SQL generation etc.
 
rajesh@ATP19C> begin
  2   dbms_cloud_ai.create_profile( 'MY_AI_PROFILE',
  3            '{"provider": "cohere",
  4            "credential_name": "MY_AI_CRED",
  5            "comments":"true",
  6            "object_list": [
  7                                        {"owner":"RAJESH","name":"T1"},
  8                                        {"owner":"RAJESH","name":"T2"},
  9                                        {"owner":"RAJESH","name":"T3"}
 10                                      ]      }' );
 11  end;
 12  /
 
PL/SQL procedure successfully completed.
 
Now we can start asking questions using natural language against the complex schema, even though the tables and column names are meaningless, the LLM is able to identify the tables and column names through comments and generate queries.
 
rajesh@ATP19C> exec dbms_cloud_ai.set_profile('MY_AI_PROFILE');
 
PL/SQL procedure successfully completed.
 
rajesh@ATP19C> Select AI showsql how many customers are there ;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT COUNT(*) AS customer_count
FROM RAJESH.T2
 
 
rajesh@ATP19C> Select AI showsql what are our total sales;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT SUM(T3.C7) AS TotalSales
FROM T1 T1 JOIN T2 T2 ON T1.C1 = T2.C13 JOIN T3 ON T2.C1 = T3.C2
 
 
rajesh@ATP19C> Select AI showsql list total sales by gender;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT t3.c4, SUM(t3.c6)
FROM rajesh.t3 JOIN rajesh.t2 ON rajesh.t3.c1 = rajesh.t2.c1
GROUP BY t3.c4
ORDER BY SUM(t3.c6) DESC
 
 
rajesh@ATP19C> Select AI showsql list customers born in the year 1980;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT t2.C16 AS customer_credit_limit, t2.C4 AS customer_gender, t2.C6 AS custo
mer_marital_status, t2.C11 AS customer_province, t2.C9 AS customer_city, t2.C15
AS customer_income_level, t2.C5 AS customer_year_of_birth, t2.C17 AS customer_em
ail, t2.C3 AS customer_last_name, t2.C2 AS customer_first_name
FROM rajesh.t2
WHERE t2.C5 = 1980
 
 
Properly describing your data will help you use natural language to get answers. Comments not only help an LLM successfully formulate queries, they also help you understand your data as well, starting with Oracle database 23c, not just comments even Annotations can help LLM to get the best out of data models, stay tuned.
 

Monday, March 11, 2024

Natural Language to SQL generation - Part I

Oracle Autonomous database enable you to query data using Natural language, by combining generative AI large language models (LLM) with Oracle SQL empowers you to describe what we want (declarative intent) and let the database generate the SQL query relevant to your schema.
 
Enabling this feature is a new package DBMS_CLOUD_AI available in autonomous database, that enables the use of LLM for generating SQL’s from natural language prompts, this package provides the access to the user specified LLM along with the knowledge of user accessible database metadata. This enables producing runnable SQL queries applicable to the schema.  In this blogpost we will see how to get started with Select AI in the autonomous database.
 
Sign into your autonomous database instance with admin privileged account and add your user account to the ACL list and grant access to DBMS_CLOUD_AI package.
 
admin@ATP19C> grant connect,resource to rajesh identified by "*************";
 
Grant succeeded.
 
admin@ATP19C> grant execute on dbms_cloud_ai to rajesh;
 
Grant succeeded.
 
admin@ATP19C> begin
  2    dbms_network_acl_admin.append_host_ace(
  3        host => 'api.cohere.ai',
  4        ace  => xs$ace_type(privilege_list => xs$name_list('http'),
  5                            principal_name => 'RAJESH',
  6                            principal_type => xs_acl.ptype_db));
  7  end;
  8  /
 
PL/SQL procedure successfully completed.
 
admin@ATP19C>
 
for this setup I will be using the Cohere as “AI Provider”. Although openAI an Azure openAI also currently supported in Autonomous database.
 
To enable Cohere to generate SQL from natural language prompts, obtain API keys from Cohere free Trial account (https://dashboard.cohere.com/) and click on the Dashboard, and click the API keys on the left navigation.
 
 
Then create a database credential to your AI provider account, in case of Cohere, the password is uniquely generated token for API usage.



 
admin@ATP19C> conn rajesh/"*************"@atp19c_vpn
Connected.
rajesh@ATP19C>
rajesh@ATP19C> begin
  2     dbms_cloud.create_credential(
  3             credential_name => 'MY_AI_CRED',
  4             username => <your_signin_user_account_for_cohere>,
  5             password => <your_uniquely_generated_token_for_API_usage>);
  6  end;
  7  /
 
PL/SQL procedure successfully completed.
 
Then, create your AI profile, with the list of schemas to consider and optionally any tables.
 
rajesh@ATP19C> begin
  2   dbms_cloud_ai.create_profile( 'MY_AI_PROFILE',
  3              '{"provider": "cohere",
  4             "credential_name": "MY_AI_CRED",
  5             "object_list": [
  6                                               {"owner":"SH","name":"PROMOTIONS"},
  7                                               {"owner":"SH","name":"PRODUCTS"},
  8                                               {"owner":"SH","name":"CUSTOMERS"},
  9                                               {"owner":"SH","name":"TIMES"},
 10                                               {"owner":"SH","name":"COUNTRIES"},
 11                                               {"owner":"SH","name":"COSTS"},
 12                                               {"owner":"SH","name":"CHANNELS"},
 13                                               {"owner":"SH","name":"SALES"},
 14                                               {"owner":"SH","name":"SUPPLEMENTARY_DEMOGRAPHICS"}
 15                                             ]       }' );
 16  end;
 17  /
 
PL/SQL procedure successfully completed.
 
Finally, we need to set our AI profile for each database session.
 
rajesh@ATP19C> exec dbms_cloud_ai.set_profile('MY_AI_PROFILE');
 
PL/SQL procedure successfully completed.
 
Let’s look at few examples, we will start simply. How many customers are there ? using the keyword AI we run the SQL commands, we see the result set as well as  SQL query generated by our LLM adding the keyword “showsql” after the “Select AI”
 
rajesh@ATP19C> Select AI how many customers are there;
 
CUSTOMER_COUNT
--------------
         55500
 
rajesh@ATP19C> Select AI showsql how many customers are there;
 
RESPONSE
---------------------------------------------------
SELECT COUNT(*) AS customer_count
FROM SH.CUSTOMERS
 
The next query highlights using the LLM broader knowledge. Find the customers that are in India.
 
rajesh@ATP19C> Select AI how many customers are there in the country India;
 
CUSTOMER_COUNT
--------------
             0
 
rajesh@ATP19C> Select AI showsql how many customers are there in the country India;
 
RESPONSE
---------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT COUNT(*) AS customer_count
FROM SH.CUSTOMERS c
JOIN SH.COUNTRIES country ON c.COUNTRY_ID = country.COUNTRY_ID
WHERE country.COUNTRY_NAME = 'India'
 
 
You can use the chat option to interact with the LLM as you would a chatbot with a single message. In the following example, a developer may need to create a table. By describing the table you want, it generates a CREATE TABLE statement you can use directly or modify to suite your needs
 
 
rajesh@ATP19C> Select AI chat show the create table statement for a table called agents with agent name and age;
 
RESPONSE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here is the SQL code to create a table called `Agents` with columns for agent name and age:
```sql
CREATE TABLE Agents (
    agent_name VARCHAR(255) NOT NULL,
    age INT NOT NULL
);
```
 
This code creates a table with a `VARCHAR` column for the agent's name and an `INT` column for the agent's age. The `NOT NULL` constraint prevents from inserting null values in these columns.
 
You can add more columns as needed, such as an ID column for uniqueness and indexing for faster lookup.
 
Once the table is created, you can add data to it using SQL insert statements or other data manipulation techniques.
 
Even a complex sql like this
 
rajesh@ATP19C> Select AI what is Top 3 selling products;
 
PROD_NAME                                          SUM(AMOUNT_SOLD)       RANK
-------------------------------------------------- ---------------- ----------
Envoy Ambassador                                         15011642.5          1
Mini DV Camcorder with 3.5" Swivel LCD                    8314815.4          2
17" LCD w/built-in HDTV Tuner                            7189171.77          3
 
rajesh@ATP19C> Select AI showsql what is Top 3 selling products;
 
RESPONSE
---------------------------------------------------------------------------------
SELECT p.prod_name, SUM(amount_sold), RANK() OVER (ORDER BY SUM(amount_sold) DESC) as rank
FROM sh.sales, sh.products p
WHERE sales.prod_id = p.prod_id
GROUP BY p.prod_name
ORDER BY rank
FETCH FIRST 3 ROWS ONLY
 
You can also use ‘chat’ to understand what a query is doing, as shown below. 
 
rajesh@ATP19C> select AI chat
  2  with rws as (
  3  select amount_sold , prod_category, cust_gender
  4  from sh.sales
  5  join sh.products using (prod_id)
  6  join sh.customers using (cust_id)
  7  group by prod_category, cust_gender )
  8  select *
  9  from rws
 10  pivot( count(*) for cust_gender in ('F','M') )
 11  /
 
RESPONSE
-----------------------------------------------------------------------------------
 
Sorry, unfortunately a valid SELECT statement could not be generated for your natural language prompt. Here is some more information to help you further:
 
Here is the output after executing the SQL query on the given table and subquery:
```sql
with rws as (
    select amount_sold, prod_category, cust_gender
    from sh.sales
    join sh.products using (prod_id)
    join sh.customers using (cust_id)
    group by prod_category, cust_gender
)
select *
from rws
pivot(count(*) for cust_gender in ('F', 'M'))
```
| ProdCategory | AmountSold | CustGender | SalesCount |
| ------------ | ----------- | ----------- | ----------- |
| Food        | 1245.78     | M         | 262 |
| Food        | 3054.68     | M         | 390 |
| Food        | 1434.34     | M         | 337 |
| Food        | 738.28      | M         | 165 |
| Total Foods | 6263.88     | M         | 1030 |
| Dresses     | 4699.66     | F         | 704 |
| Dresses     | 3789.84     | F         | 557 |
| Dresses     | 497.40      | F         | 83 |
| Total Dresses| 8972.86     | F         | 1344 |
| Total       | 15226.74    | M/F        | 2237 |
 
The pivot table summarizes the amount sold and the number of sales according to the product category and customer gender. The amounts are aggregated with the `count` function and categorized into 'M' and 'F' for male and female customer genders.  The `total` row is a compilation of the aggregated am
ounts and gender counts.
 
Are there any other outputs you would like to know? Feel free to ask and I'll be more than happy to assist you!
 
 
The ability to generate SQL queries using natural language to get useful results sounds amazing, you can get the ease of use without SQL expertise, since this feature is integrated with database because it is SQL it is readily available with analytics and applications.
 
Finally for the stateless environment like Database actions SQL worksheet in Autonomous database, we can use the DBMS_CLOUD_AI.GENERATE functions in our sql query, since this allows to specify the AI profile directly for stateless invocations.
 
rajesh@ATP19C> col response for a80
rajesh@ATP19C> select dbms_cloud_ai.generate(prompt=>'how many customers are there',
  2             profile_name=>'MY_AI_PROFILE',
  3             action=>'showsql')  as response
  4  from dual;
 
RESPONSE
--------------------------------------------------------------------------------
SELECT COUNT(*) AS customer_count
FROM SH.CUSTOMERS


Wednesday, March 6, 2024

SQL Transpiler

One of the flexibilities of Oracle database is that it let’s you to migrate from Third party database with minimal change in the application source code.
 
Let’s say we have application supported on SQL Server database that got queries like this (involving sql-server specific functions “DAY” )
 
select *
from emp
where day(HIREDATE) < 10;
 
as part of application database migration to Oracle database, the above queries will end up with errors
 
demo@PDB1> select *
  2  from emp
  3  where day(HIREDATE) < 10 ;
where day(HIREDATE) < 10
      *
ERROR at line 3:
ORA-00904: "DAY": invalid identifier
 
To fix that error, we will either
 
Re-write the queries using the equivalent function available in Oracle database  to_number(to_char(hiredate,’DD’)), which will require the application code change, followed by testing.
 
In order to support migration with no application changes, most experience professional will create custom function in database like this.

 
demo@PDB1> create or replace function day(p_hiredate date)
  2  return number
  3  as
  4  begin
  5     return to_number(to_char(p_hiredate,’DD’)) ;
  6  end;
  7  /
 
Function created.
 
demo@PDB1> select *
  2  from emp
  3  where day(HIREDATE) < 10;
 
        ID      EMPNO ENAME       JOB              MGR HIREDATE           SAL       COMM     DEPTNO
---------- ---------- ----------- --------- ---------- ----------- ---------- ---------- ----------
         4       7566 JONES       MANAGER         7839 02-APR-1981       2975                    20
         6       7698 BLAKE       MANAGER         7839 01-MAY-1981       2850                    30
         7       7782 CLARK       MANAGER         7839 09-JUN-1981       2450                    10
        10       7844 TURNER      SALESMAN        7698 08-SEP-1981       1500          0         30
        12       7900 JAMES       CLERK           7698 03-DEC-1981        950                    30
        13       7902 FORD        ANALYST         7566 03-DEC-1981       3000                    20
 
6 rows selected.
 
demo@PDB1>
 
However having the function call in the predicates, will lead to context switch between the sql and pl/sql runtime engine and lead to additional performance issue.
 
So how can we avoid that? If we are with Oracle 21c and above we can take the advantage of SQL (scalar) Macro technology

 
demo@PDB1> select *
  2  from emp
  3  where day(HIREDATE) < 10 ;
 
Execution Plan
----------------------------------------------------------
Plan hash value: 3956160932
 
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    41 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| EMP  |     1 |    41 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DAY"("HIREDATE")<10)
 
demo@PDB1>
demo@PDB1> create or replace function day(p_hiredate date)
  2  return varchar2
  3  sql_macro(scalar)
  4  as
  5  begin
  6     return q'# to_number(to_char(p_hiredate,’DD’)) #';
  7  end;
  8  /
 
Function created.
 
demo@PDB1> select *
  2  from emp
  3  where day(HIREDATE) < 10 ;
 
Execution Plan
----------------------------------------------------------
Plan hash value: 3956160932
 
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    41 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| EMP  |     1 |    41 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("HIREDATE"),'DD'))<10)
 
demo@PDB1>
 
 
with SQL Macro in place, the function call in the predicates got replace with the expression in the function, this will reduce the context switch between sql and pl/sql runtime environments.
 
Of course you could use the same technology in 23c now. However in 23c, Oracle goes one step further and introduces an automatic conversion with the SQL Transpiler even for non-macro functions. You only need to setup and enable the transpiler functionality with the new parameter SQL_TRANSPILER. No further manual interaction is required. The SQL Transpiler is disabled by default. You can enable it with an ALTER SYSTEM or ALTER SESSION command and change the parameter value accordingly. When SQL_TRANSPILER is set to ON, the SQL transpiler feature is enabled and PL/SQL functions are automatically transpiled (converted) into SQL expressions whenever possible. When this parameter is set to OFF, the SQL Transpiler feature is disabled

 
demo@FREEPDB1> create or replace function day(p_hiredate date)
  2  return number
  3  as
  4  begin
  5     return to_number(to_char(p_hiredate,'DD')) ;
  6  end;
  7  /
 
Function created.
 
demo@FREEPDB1> set autotrace traceonly exp
demo@FREEPDB1> select * from emp where day(hiredate) <= 10;
 
Execution Plan
----------------------------------------------------------
Plan hash value: 3956160932
 
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    38 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| EMP  |     1 |    38 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DAY"("HIREDATE")<=10)
 
demo@FREEPDB1> alter session set sql_transpiler = ON;
 
Session altered.
 
demo@FREEPDB1> select * from emp where day(hiredate) <= 10;
 
Execution Plan
----------------------------------------------------------
Plan hash value: 3956160932
 
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    38 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| EMP  |     1 |    38 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("HIREDATE"),'DD'))<=10)
 
demo@FREEPDB1>
 
the function call in the predicates is replace with expression and can be seen in the Predicate Information section, this Indicates that transpilation has occurred.