I’ve been wanting to use SQL in combination with Python for a while. As a data analyst I use SQL all the time and it’s great for getting data from a large database. But it’s not always the best at complicated analysis. I’ve been reviewing Python lately and I’ve used it for things like linear regression. But I wasn’t sure how to actually use SQL within Python.
Why? Before I was (inefficiently) loading my data into my Python scripts. I manually ran the SQL queries, exported the results as a CSV file, then imported the CSV file into a Python script. It took a while and it was a pain to update if my query/CSV file changed.
I recently figured out how to use Python to get data directly from our database, load it into a pandas DataFrame, and work with that DataFrame, all in one script. This is what I’ll be going into in my blog post.
To get the final code check out my Github repository for it: run-sql-in-python.
Notes
Before we start here are a few important things to note:
I am not going over the basics of Python like how to use Anaconda Distribution, etc. (hint, I’m coming out with a course later this month that will go into those details). This article assumes you already have a basic knowledge of SQL & Python.
You need to have a database already established. In my case I use both PostgreSQL and Google BigQuery frequently. So these are the two types of databases we’ll connect to.
If you’re wondering why I used these specific libraries it’s because they are common, widely used, and I’m familiar with most of them.
The Git/Github is optional but it’s good practice to use something to track changes. I won’t go into how Git/Github works. But all of this is in a repo that I’m working with.
All examples in this blog post use simplified, fictional data for educational purposes only. In the real-world, tables typically contain many more rows and columns. But, I have simplified the data here to focus understanding and to protect private information.
Anyway onto what you’ll need for this.
Requirements
Here's what you'll need before starting:
IDE: Visual Studio Code
Database: PostgreSQL or Google BigQuery
Package Manager: Anaconda Distribution
Version Control: Git/Github (Optional)
Code
Below are the two scripts I created:
I am connecting to these two databases in two separate Python scripts.
PostgreSQL
Below is the Python script which I use to connect to a PostgreSQL database.
Libraries
Import libraries.
pyscopg2 and sqlalchemy: Connect to our PostgreSQL database
psycorg2 - handles PostgreSQL interaction
sqlalchemy - is the actual interface and has more functionality
pandas : To load the query results into a Pandas DataFrame.
google.cloud: Use Secrets Manager
Manage credentials (e.g. database password) using Secrets Manager
Essential for security.
Process
First we set up the environment with the necessary Python libraries.
import psycopg2
from google.cloud import secretmanager
import pandas as pd
from sqlalchemy import create_engine
Then we get set up the secretmanager from Google. We initialize the secret manager client and then define get the secrets, in my case the database password that’s stored in the secret_string_key .
project_id = 'project-id-123'
# Secret Manager client
secretmanager_client = secretmanager.SecretManagerServiceClient()
# Secret names
secret_name_key = "SERVICE_ACCOUNT_KEY"
# Retrieve secrets from Secret Manager
request_key = {"name": f"projects/{project_id}/secrets/{secret_name_key}/versions/latest"}
response_key = secretmanager_client.access_secret_version(request_key)
secret_string_key = response_key.payload.data.decode("UTF-8")
Next, connect to the PostgreSQL database using connection parameters and our credentials. This prevents us from hardcoding sensitive information in the script.
# Connect to with credentials
conn_params = {
"host": "host-name",
"database": "database-name",
"user": "user-name",
"password": secret_string_key}
We create a URI (using SQLAlchemy) to connect to the database. And we create a SQLALchemy engine to write SQL and get the query results.
# Construct database connection string for SQLAlchemy
db_uri = f"postgresql+psycopg2://{conn_params['user']}:{conn_params['password']}@{conn_params['host']}/{conn_params['database']}"
# Create SQLAlchemy engine
engine = create_engine(db_uri)
In a try block we write the actual query as a multi-line string for readability. In this simple example we are getting the id and timestamp from a fictional customers table where the timestamp is the date of: 2024-05-01. We also parameterize the query to improve security and prevent SQL injection attacks. But you can replace this with any query.
Then we use pandas to read the SQL (read_sql_query) query results which uses the connection created by SQLAlchemy engine to execute the SQL query. Finally it returns the results in a DataFrame called df .
try:
# Execute SQL query using SQLAlchemy engine with parameterization
query = """
SELECT id, timestamp
FROM customers
WHERE timestamp::date = %(date)s
"""
df = pd.read_sql_query(query, engine, params={'date': '2024-05-14'})
# Print query results
print(df)
In the except block we use it to handle exceptions that might occur during the database connection or when we run the query. If there’s an error it prints the error message. It’s good for debugging the code.
except Exception as e:
# Handle any exceptions
print("An error occurred:", e)
Final Code
To view the final code in Github: postgresql.py.
import psycopg2
from google.cloud import secretmanager
import pandas as pd
from sqlalchemy import create_engine
project_id = 'project-id-123'
# Secret Manager client
secretmanager_client = secretmanager.SecretManagerServiceClient()
# Secret names
secret_name_key = "SERVICE_ACCOUNT_KEY"
# Retrieve secrets from Secret Manager
request_key = {"name": f"projects/{project_id}/secrets/{secret_name_key}/versions/latest"}
response_key = secretmanager_client.access_secret_version(request_key)
secret_string_key = response_key.payload.data.decode("UTF-8")
# Connect to DB
conn_params = {
"host": "host-name",
"database": "database-name",
"user": "user-name",
"password": secret_string_key}
# Construct database connection string for SQLAlchemy
db_uri = f"postgresql+psycopg2://{conn_params['user']}:{conn_params['password']}@{conn_params['host']}/{conn_params['database']}"
# Create SQLAlchemy engine
engine = create_engine(db_uri)
try:
# Execute SQL query using SQLAlchemy engine
query = """
SELECT id, timestamp
FROM customers
WHERE timestamp::date = '2024-05-14'
"""
df = pd.read_sql_query(query, engine)
# Print query results
print(df)
except Exception as e:
# Handle any exceptions
print("An error occurred:", e)
BigQuery
Below is the Python script which I use to connect to a BigQuery database.
Libraries
Import libraries and modules.
pandas: To load the query results into a Pandas DataFrame.
google.cloud : Use BigQuery to run a query in BigQuery
Process
First we set up the environment with the necessary Python libraries.
import pandas as pd
from google.cloud import bigquery
Then we initialize the BigQuery client.
# BigQuery client
client = bigquery.Client()
We write the actual query as a multi-line string for readability. In this simple example we are getting the id and timestamp from a fictional customers table (located in BigQuery). But you can replace this with any query.
# Perform a query
query = """
SELECT id, timestamp
FROM `project-id-123.database-name.customers`
"""
In a try block we’ll use the BigQuery client to execute the query in our BigQuery database. This doesn’t use any query parameterization because we are not taking any user inputs. Then use pandas to save those results to a DataFrame named df . Finally, we print out the df .
try:
query_job = client.query(query)
df = query_job.to_dataframe()
# Print DataFrame
print(df)
In the except block we use it to handle exceptions that might occur during the database connection or when we run the query. If there’s an error it prints the error message. It’s good for debugging the code.
except Exception as e:
# Handle any exceptions
print("An error occurred during query execution:", e)
Final Code
To view the final code in Github: bigquery.py.
import pandas as pd
from google.cloud import bigquery
# BigQuery client
client = bigquery.Client()
# Perform a query
query = """
SELECT id, timestamp
FROM `project-id-123.database-name.customers`
"""
try:
query_job = client.query(query)
df = query_job.to_dataframe()
# Print DataFrame
print(df)
except Exception as e:
# Handle any exceptions
print("An error occurred during query execution:", e)
Conclusion
In conclusion, integrating SQL with Python provides an efficient way to interact with the database directly within Python scripts. This approach not only streamlines the data retrieval process by eliminating the need to manually export and import data but is the first step for more advanced analysis in Python.