Natural Language Queries to SQL with Code-Llama

Introduction

Utilizing natural language queries for SQL database interaction represents a revolutionary paradigm shift, streamlining the complexities of database management and querying. This innovative approach harnesses the power of natural language processing (NLP) techniques, enabling users to engage with relational databases using everyday language, free from the constraints of intricate SQL coding. It empowers users to pose inquiries or request data from databases in a conversational manner, mirroring human interactions. By bridging the chasm between human language and SQL, this technology democratizes database accessibility, simplifying data extraction and comprehension for non-technical users grappling with extensive datasets.

Natural Language Queries (NLQ) offer a transformative boost to data accessibility and usability, expediting decision-making processes across diverse domains, encompassing realms from business intelligence to data analysis and reporting. This approach proves particularly advantageous for data administrators, affording them the capability to pose inquiries naturally, circumventing the need for direct code composition.

Technologies

Code-Llama, developed by Meta AI, is a sophisticated large language model (LLM) chatbot. This AI model is extensively trained on a vast corpus of text and code, granting it the capability to perform a variety of tasks, including text generation, language translation, creative content creation, and providing informative responses to questions. One of its standout features is its ability to convert natural language queries into SQL queries.

Code-Llama is a specialized version of the Llama-2 language tool, fine-tuned specifically for tasks involving code-related content. It has been meticulously trained on code-centric datasets to offer a deeper and more specialized understanding, yielding precise outputs in the realm of code-specific domains. Code-Llama is available in three parameter sizes: 7B, 13B, and 34B, catering to different use cases. Additionally, it boasts a unique “Fill in the Middle” (FIM) feature, enabling it to seamlessly insert code snippets into existing code structures.

Furthermore, there are two distinct fine-tuned iterations of Code-Llama. The first, Code-Llama Python, has undergone additional training on a substantial dataset of 100 billion tokens of Python code, enhancing its proficiency in Python-related tasks. The second, Code-Llama Instruct, excels in understanding and interpreting natural language queries, effectively discerning user expectations and providing relevant responses to prompts.

LLama 2 architecture used for Code llama

The text-to-SQL task is a complex challenge that revolves around automatically translating natural language text into SQL queries. This challenge is particularly daunting due to the inherent ambiguity in natural language queries, as they can encompass various entities and relationships within a database. Code-Llama plays a pivotal role in addressing this task by first comprehending the underlying meaning of the given natural language query and subsequently crafting the corresponding SQL query.

The process employed by Code-Llama initiates with a natural language parser, which dissects the natural language query into its fundamental components. This parsing phase discerns the entities, relationships, and required operations specified within the query. Subsequently, Code-Llama harnesses this comprehensive representation to generate a semantically accurate SQL query, tailored for execution within a database environment.

Code-Llama has demonstrated remarkable effectiveness in the realm of translating natural language queries into SQL. In a recent research endeavor, Code-Llama exhibited an impressive capability to produce correct SQL queries for 85% of the queries contained within a standard benchmark dataset. This achievement signifies a substantial advancement compared to previous methodologies for tackling the text-to-SQL challenge.

Prerequisites

To test it out, first, you need to head to [E2E Cloud](https://myaccount.e2enetworks.com), and sign in or sign up for an account. Once in, click on ‘TIR - AI Platform’ on top, to switch to the AI Platform.

You can now create a notebook. Free tier is available, and that suffices for the following experiment.

Create Notebook on TIR

Once that’s done, you can Launch the Notebook, and move on to the next step.

How to Use and Querying Examples

Meta has released their code on their github repository for Code-llama. One can also download the model by sending a request to MetaAI for the model access: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

However, instead of going through the hassle of doing so, one can also directly use the Code-Llama model inference API through HuggingFace (https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf)

To use the model API, here is a simple code that can be run directly on TIR - on E2E Cloud:

!pip install transformers
!pip install accelerate

This piece of code downloads the required libraries to TIR notebook.

from IPython.display import Markdown, display

This import is important for displaying the output in a more readable manner.

import requests
input="Find the names of all employees who are from California."
API_URL = "https://api-inference.huggingface.co/models/codellama/CodeLlama-34b-Instruct-hf"
headers = {"Authorization": "Bearer hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
        "inputs": input,
})

while(output[0]['generated_text']!=input):
    input=output[0]['generated_text']
    output=query({'inputs':input})

This is a piece of code for direct inference from the Code-Llama Inference 34B parameters model. The input shown here is just an example input and it can be changed depending on the prompt the user wants.

It should also be noted here, that the Authorization API key is unique to each person and can be generated directly through HuggingFace (https://huggingface.co/settings/tokens).

It should also be noted that the output here is being generated multiple times, to ensure that the SQL query generated is complete, therefore it is run until the inference API does not generate any new output from the input.

display(Markdown(output[0]['generated_text']))

This final cell is for reading the final output. The output can be found below.

Example queries and their outputs from Code-Llama Inference 34B parameters:

Query: “Retrieve the names and email addresses of all customers who made a purchase in the last month and spent more than $100.”

Output:

    SELECT name, email FROM customers
    WHERE id IN ( SELECT customer_id FROM orders WHERE total_amount > 100 AND created_at >DATE_SUB(NOW(), INTERVAL 1 MONTH) );

Query: “Find the average salary of employees in each department, along with the total number of employees in each department, for departments with more than 10 employees. Additionally, display the department name and the highest salary within each of those departments.”

Output:
    SELECT
        d.name AS department_name
        , AVG(e.salary) AS average_salary
        , COUNT(e.id)AS total_employees
        , MAX(e.salary) AS highest_salary
    FROM
        departments d
        JOIN employees e ON d.id = e.department_id
    GROUP BY
        d.name
    HAVING
        COUNT(e.id) > 10

It should be ensured that the prompts that are entered should not be ambiguous and should contain the entire information required by the model to generate a valid SQL query which will work well. An example of such prompt and it’s output is given below.

Query:
“Retrieve the names and email addresses of all employees in the 'Sales' department who joined the company after January 1, 2022, sorted by their hire date in ascending order.”

Output:
    SELECT
        first_name,
        last_name,
    email
    FROM
        employees
    WHERE
        department_id = (
            SELECT
            id
            FROM
                departments
            WHERE
                name = 'Sales'
    )
        AND hire_date > '2022-01-01'
    ORDER BY
        hire_date ASC

Challenges and Solutions

Variability in SQL Dialects

Challenge: SQL dialects vary depending on the database system (MySQL, Oracle, PostgreSQL), making it essential to generate compatible SQL queries. Solution: Tailor queries to the specific SQL database system in the appropriate dialect.

Handling Complex Queries

Challenge: Complex SQL queries involving multiple joins, subqueries, and aggregations can be difficult to generate accurately from natural language input. Solution: Implement a robust query generation engine capable of handling a wide range of SQL constructs. Incorporate algorithms that deconstruct complex queries into simpler subqueries and reassemble them.

Lack of Data Context

Challenge: Natural language queries often lack contextual information, crucial for accurate query generation (e.g., understanding “Show me the sales for this year”). Solution: Enhance context-awareness by considering the user’s session history or explicitly incorporating contextual information into the query.

Handling Large and Complex Databases

Challenge: Mapping natural language queries to the appropriate schema can be challenging in databases with numerous tables and columns. Solution: Implement a robust schema mapping mechanism utilizing database metadata to aid query generation. Encourage user feedback to refine mapping accuracy.

Error Handling and User Feedback

Challenge: Effective error handling and user feedback mechanisms are essential when Code-Llama generates incorrect SQL queries. Solution: Develop a user-friendly interface that explains errors and provides correction suggestions. Leverage user feedback to continually enhance query accuracy.

Security and Data Privacy

Challenge: Ensuring security and data privacy when granting natural language access to databases is critical. Solution: Implement strong access control and authentication mechanisms. Restrict query execution to prevent unauthorized access to sensitive data.

User Acceptance and Learning Curve

Challenge: Users may initially struggle to adapt to querying databases using natural language. Solution: Offer user training and comprehensive documentation to help users become proficient with the system. Introduce the technology gradually to ease the transition.

Addressing these challenges during the development and deployment of a Text-to-SQL system like Code-Llama is paramount to ensure accuracy, usability, and security in real-world applications. Continual improvement, driven by user feedback and advancements in NLP and AI technologies, will further enhance the efficacy of such systems.

Conclusion

Code-Llama stands out as an exceptional tool for transforming natural language queries into precise SQL statements, outperforming even state-of-the-art models, particularly in code-specific and text-to-SQL applications—a primary focus of this article. Its ease of use is a testament to its usability, offering a seamless experience for SQL query generation.

However, it’s crucial to underscore that the effectiveness of Code-Llama hinges on the specificity and precision of the provided natural language queries. To maximize its utility, users should ensure their queries leave no room for ambiguity or assumptions. The tool thrives when users articulate their questions with meticulous clarity.

In the real world, Text-to-SQL models like Code-Llama find applications across diverse scenarios, greatly benefiting data administrators and professionals in various fields. These tools empower users to interact with databases more intuitively and efficiently.

Nonetheless, it’s imperative to recognize that Code-Llama, while a powerful asset, should not be viewed as a standalone solution. Rather, it should be regarded as a valuable tool that enhances human capabilities and simplifies complex tasks. When used in conjunction with domain expertise and best practices, Code-Llama can significantly streamline data-related workflows and amplify productivity.