Goglides Dev 🌱

Cover image for Code Translation: SQL Query to equivalent pandas statement
Mallikarjun for goglides

Posted on

Code Translation: SQL Query to equivalent pandas statement

Today we will deep dive into learning How to convert SQL query to Pandas code using LLMs

Project Workflow: Code Translation-SQL Query to equivalent pandas statement

  1. Introduction
  2. Background
  3. Tools and Technologies
  4. About Project Components
  5. How It Works
  6. Benefits
  7. Challenges and Considerations
  8. Future Directions
  9. Questions asked
  10. Conclusion

You can visit our GitHub page for complete code.

1. Introduction:

In this project, We are focusing on translation of SQL queries to pandas statements. We have developed a comprehensive solution that bridges the gap between SQL and pandas. SQL and pandas are both most used tools to have in your data wrangling and analysis toolbox, and we are looking to make the transition to panda-land as smooth as possible. Data professionals are the intended audience. We assume of course that you are familiar with SQL.

2. Background:

This project is a response to the reality that many data professionals find themselves in — they often have to migrate their workflows between SQL and Python environments in the course of doing data analysis. By automating this translation process we hope to smooth out this transition, and save time and effort while increasing the flexibility and efficiency of these data analysis workflows.

3. Tools and Technologies:

We have used Python programming language as a base to develop this application, OpenAI's Language Model is incorporated to accurately understand and convert SQL queries to pandas code.
also uses Streamlit, a web interface tool where users can enter their SQL queries and get the equivalent pandas code output

4. About Project Components:

A) Streamlit: Streamlit helps us to transform Python scripts into interactive data science/Machine learning web apps in minutes, instead of weeks with regular frontend tools. Build dashboards, generate reports, or create chat apps. Once the app is created, we can use the Community Cloud platform to deploy, manage, and share.

B) OpenAI Models: OpenAI models are, like, super powerful and, you know, efficient in generating and understanding natural language texts. it is like, based on GPT (Generative pre-trained transformers) architecture. We have, like, totally used the GPT-3.5-turbo model in our project!

5. How It Works:

A prompt is a piece of text that guides a large language model (LLM) to generate a response. For example, if I want to summarize given documents/Text, My prompt will look like this

Provide a summary of the given Text.

We get an SQL query from the Streamlit interface as input! then, combine the SQL query with the prompt; to make it a complete prompt, which is utilized by a large language model (LLM).

Based on the given prompt(both instruction and query) LLM generates a response which is pandas code in this project.

6. Benefits:

For any SQL professional making transition from SQL to pandas, this tool is their default go to choice leveraging their existing SQL expertise, users can easily grasp pandas data manipulations, fuctionalities and accelerate their learning curve.

7. Challenges and Considerations:

A well-defined prompt is crucial for maximizing the performance of LLM. Ambiguous prompts can lead to unexpected or undesirable responses, highlighting how crucially important clarity and precision is.

8. Future Directions:

I have implemented this super exciting and fun project to take input as SQL query and produce an output as a pandas statement, You can totally extend this whole thing with any input (like literally any database query) and an output (from, like, any python framework) of your choice or,tweaking to meet your requirements.

9. Questions asked:

Q1. SQL Query Input(Simple):

SELECT * FROM Customers
WHERE CustomerID > 80;
Enter fullscreen mode Exit fullscreen mode

Pandas Output:

import pandas as pd
# Assuming `df` is the pandas DataFrame that contains the Customers data
result = df[df['CustomerID'] > 80]
Enter fullscreen mode Exit fullscreen mode

Q2. SQL Query Input(Intermediate):

SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
Enter fullscreen mode Exit fullscreen mode

Pandas Output:

import pandas as pd
# Assuming customers and orders are pandas dataframes
result = pd.merge(customers, orders, how='outer',on='CustomerID')[['CustomerName', 'OrderID']].sort_values(by='CustomerName')
Enter fullscreen mode Exit fullscreen mode

Q3. SQL Query Input(Expert):

WITH employee_ranking AS (
    RANK() OVER (ORDER BY salary ASC) as ranking
  FROM employee
FROM employee_ranking
WHERE ranking <= 5
ORDER BY ranking
Enter fullscreen mode Exit fullscreen mode

Pandas Output:

import pandas as pd
employee = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'last_name': ['Doe', 'Smith', 'Johnson', 'Brown', 'Lee'],
    'first_name': ['John', 'Jane', 'Bob', 'Sarah', 'Emma'],
    'salary': [50000, 60000, 55000, 48000, 70000]
employee_ranking = employee.assign(ranking=employee['salary'].rank(method='min')).copy()
employee_ranking = employee_ranking[employee_ranking['ranking'] <= 5].sort_values(by='ranking')

result = employee_ranking[['employee_id', 'last_name', 'first_name', 'salary']]
Enter fullscreen mode Exit fullscreen mode

10. Conclusion:

Our project provides a unique solution for data professionals who want to translate their code from SQL to pandas. It significantly bridges the gap between the two most widely used tools. we can further customize this solution to replace the input and output tools with the user's individual needs and preferences.

Top comments (0)