How to Work with SQL Databases Efficiently in Pandas

Image by GraphicMama-team on Pixabay

A major benefit of working with SQL data in pandas is that we can manipulate a large amount of data much more efficiently. If you have been writing aggregation queries using GROUP BY in SQL, or worse, looping through database records one by one in Python, you should try to use pandas to perform these tasks and it can be magnitudes more efficient.

In this post, we will introduce how to write data to and read data from a SQL database using pandas. Importantly, some pitfalls are demonstrated with simple code and potential errors, which can help us avoid these pitfalls and work more efficiently in practice.


Preparation

In this post, we will spin up a local MySQL server with Docker. All SQL databases will work similarly.

# Create a volume to persist the data.
$ docker volume create mysql8-data

# Create the container for MySQL.
$ docker run --name mysql8 -d -e MYSQL_ROOT_PASSWORD=root -p 13306:3306 -v mysql8-data:/var/lib/mysql mysql:8

# Connect to the local MySQL server in Docker.
$ docker exec -it mysql8 mysql -u root -proot

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.31    |
+-----------+
1 row in set (0.00 sec)

Then we will create a virtual environment and install all the libraries needed, which include pandas, SQLAlchemy, and pymysql:

conda create --name pandas-sql python=3.11
conda activate pandas-sql

pip install -U "pandas>=1.5.0,<1.6.0"
pip install -U "SQLAlchemy>=2.0.0,<2.1.0"
pip install -U "pymysql>=1.0.0,<1.1.0"
pip install -U "cryptography>=40.0.0,<40.1.0"

Note that version ranges are specified for the libraries so maintenance upgrades can be picked up automatically when you run these commands in the future.

Especially, SQLAlchemy will be used to interact with our MySQL database (with cryptography as the cryptographic package for authentication) and pymysql is a driver specifically for MySQL.


Create the database and table

When working with a SQL database, it’s better to create the database and table in advance so we can better define the data types and indexes for the efficiency of storage and reading.

CREATE DATABASE `data`;

CREATE TABLE `data`.`student_scores` (
  `student_id` smallint NOT NULL,
  `subject` varchar(50) NOT NULL,
  `score` tinyint DEFAULT '0',
  PRIMARY KEY (`student_id`,`subject`),
  KEY `ix_subject` (`subject`),
  KEY `ix_score` (`score`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
;

We can create tables with pandas on the fly, but the indexes will not be created as we will see soon.


Write to SQL with pandas

Let’s first check out how to write data to a SQL table using pandas.

We need to create a data frame that can be mapped as the rows to be written:

import pandas as pd

df_to_sql = pd.DataFrame(
    columns=["student_id", "subject", "score"],
    data=[
        [1, "Literature", 90],
        [1, "Math", 60],
        [2, "Literature", 80],
        [2, "Math", 80],
        [3, "Literature", 70],
        [3, "Math", 95],
    ],
)

With columns the columns or fields of the data frame are specified, and the data is specified as a list of lists, each of which corresponds to a row to be written.

Before we can write this data frame to the database, we need to create an SQLAlchemy connection to interact with our database:

from sqlalchemy import create_engine

db_url = "mysql+pymysql://root:root@localhost:13306/data"

engine = create_engine(
    db_url, pool_size=5, pool_recycle=3600, isolation_level="AUTOCOMMIT"
)
con = engine.connect()

Then we can write the data frame created above to our MySQL database using the connection created. However, there are several pitfalls that should be avoided.


Pitfalls when writing to SQL using pandas

Let’s explore different arguments for the pandas.to_sql() function and check out the pitfalls that should be avoided in practice.

Let’s first run it with no additional arguments:

df_to_sql.to_sql("student_scores", con=con)

When this command is run, we will see the error:

ValueError: Table 'student_scores' already exists.

This happens because pandas.to_sql() by default will raise a ValueError if the table already exists. To append data to an existing table, we can specify the if_exists parameter to be append:

df_to_sql.to_sql("student_scores", con=con, if_exists="append")

However, when this command is run, we will see a new error:

OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'index' in 'field list'")

This is because the index of the data frame will be written to the database by default. Since we don’t have an index column in our table, an error will be raised.

We set the index parameter to be False to skip the data frame index when writing it to the database:

df_to_sql.to_sql("student_scores", con=con, if_exists="append", index=False)

This time everything works properly. Now if you check the table in the database, you will see six rows inserted.

Note that if the isolation_level is not set to be AUTOCOMMIT when creating the SQLAlchemy engine, you would need to run con.commit() explicitly to persist the changes to the database.

Now let’s see what will happen when we run the above command again:

df_to_sql.to_sql("student_scores", con=con, if_exists="append", index=False)

This time a new error is raised about duplicate keys:

IntegrityError: (pymysql.err.IntegrityError) (1062, "Duplicate entry '1-Literature' for key 'student_scores.PRIMARY'")

The reason is that the above command will append the records to the table, and since they are already in the database, they cannot be inserted again due to the limitation of the primary key.

Unfortunately, there is no such ON DUPLICATE KEY UPDATE or upsert functionality of pandas.to_sql(). We can, however, specify if_exists to replace to drop the table before inserting new values. Note that it’s “drop the table” but not “truncate the table” which is pretty buggy in my opinion.

When the table is dropped and recreated, the indexes will be dropped and the data types will be assigned automatically. Let’s see how it works:

df_to_sql.to_sql("student_scores", con=con, if_exists="replace", index=False)

This time the command works. Let’s check the data definition language (DDL) of the student_scores table as demonstrated in this post:

SHOW CREATE TABLE `data`.`student_scores`;

CREATE TABLE `student_scores` (
  `student_id` bigint DEFAULT NULL,
  `subject` text,
  `score` bigint DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

As we can see, the indexes are dropped, including the primary key, and the data types are assigned automatically, which is not desired in most practical cases.

Therefore it’s better to truncate the existing table before inserting new records or use ON DUPLICATE KEY UPDATE or upsert to update existing ones, which is a bit out of the scope of this post and will be covered in future ones.


Read data from SQL with pandas

While it’s not so common to write data to SQL with pandas, it’s way more common to read data from SQL using pandas, and it’s also much simpler.

We can use pandas.read_sql() to read data from a SQL database. pandas.read_sql() is a wrapper function for pandas.read_sql_table() and pandas.read_sql_query().

In most cases, you can just use pandas.read_sql(). However, when you need to specify the schema when reading by the table name, you would need to use pandas.read_sql_table().

Let’s first try to read by table name:

df_from_sql = pd.read_sql("data.student_scores", con=con)

This will raise an error:

ObjectNotExecutableError: Not an executable object: 'data.student_scores'

Because we cannot specify a schema with pandas.read_sql(). We need to specify a schema with pandas.read_sql_table():

df_from_sql = pd.read_sql_table("student_scores", con, schema="data")

And this command will work.

We can specify which columns to read when reading by table name:

df_from_sql = pd.read_sql_table(
    "student_scores", con, schema="data", columns=["student_id", "subject", "score"]
)

Besides reading by the table name, we can also read by plain SQL query. This time you may be surprised that it may not work as you expected:

df_from_sql = pd.read_sql("SELECT * FROM data.student_scores", con=con)
df_from_sql = pd.read_sql_query("SELECT * FROM data.student_scores", con=con)

Both these two commands will fail with this error:

ObjectNotExecutableError: Not an executable object: 'SELECT * FROM data.student_scores'

This is because, with newer versions of SQLAlchemy, we cannot pass a plain SQL query as a string anymore, but instead need to construct a TextClause:

from sqlalchemy import text

df_from_sql = pd.read_sql(text("SELECT * FROM data.student_scores"), con=con)
df_from_sql = pd.read_sql_query(text("SELECT * FROM data.student_scores"), con=con)

Now both commands will work and we can use pandas.read_sql() for simplicity.


In this post, we have covered how to write data to and read data from a SQL database using pandas. A MySQL database is used because it’s a commonly used one in practice. However, the Python code for pandas will work for all SQL databases with minor adjustments because SQLAlchemy, a universal and versatile Python SQL toolkit, is used for database connection.

We have covered the fundamentals in this post which should be enough if you work with plain SQL queries. A more advanced post will come soon and will cover how to work with SQLAlchemy Table and ORM models in pandas.


Related posts:



Leave a comment

Blog at WordPress.com.