How DuckDB Made Me Love SQL

DuckDB is a relational database management system (RDBMS) that is optimized for analytical, column-oriented workflows—things like merging tables together, aggregations, and more. The more formal term for this type of RDBMS is OLAP—Online Analytical Processing. It works “in-process”, which means that unlike most other RDBMS tools, it does not use a client-server model. It can run embedded in a Python or R session or run as a standalone command-line process.

Over the last few years, DuckDB has become one of my favorite data science tools. Reading, inspecting, understanding, merging, filtering, and otherwise manipulating tabular datasets is a huge part of my workflow and that’s exactly where DuckDB shines. Here, I explain why and how DuckDB became such a big part of my workflow. Hopefully it will motivate more people to check out this great tool!

The Background

My first serious use of statistical software for data analysis was around 2015. I used Stata to replicate a paper’s instrumental variable analysis for my “Identification Strategies in Political Economy” course at Humboldt. (It was a great course!) My instructor for the course later also held an R workshop, where I learned the very basics of R—which came in handy for my soon-to-begin doctoral studies at Penn State.

Here’s what happened afterwards during my PhD: I learned R at a deeper level in my first year PhD methods sequence (I would later come to call the R I learned at that stage “base-R,” as most people do). In the second year I learned how to use the tidyverse and started using it heavily for dataframe manipulation. In my third year in the program I started learning Python but used it more for collecting data, like by writing scrapers with BeautifulSoup and Selenium, and less for manipulating data. Later, I started learning pandas because I did not want to go back and forth between manipulating data and visualizing in R and doing other tasks in Python. I also occasionally used data.table—a very fast R package for data frame operations—when I needed speed. Lastly, I sometimes used SQLite, another light-weight, embedded RDBMS system to (1) store multiple data frames into a single file so I could move it easily between HPC (the University’s High Computing System) and my local computer, (2) to ensure that column types in the saved data are the exact ones that I want, which can be a problem when storing data as CSV, (3) to have the data easily openable by R and Python processes. SQLite is great, but it is not meant to be data analytics tool.

This meant that for operations like filtering the rows of a dataframe, removing or adding columns, grouping rows by values of a column and aggregating, merging two dataframes—operations I do all the time—I now knew 4-5 different syntax options across multiple languages. Inevitably, my workflow involved constantly trying to learn or remember the syntax of the new framework, confusing one with another, trying to decide mid-task between switching to the more familiar one for that aspect of the task or sticking it out until I got better at the new one. Newer frameworks do not seem to be stopping any time soon: for example, polars is nowadays a very popular alternative to pandas in Python. Existing frameworks do not stay constant either—their functions get deprecated and new ones are introduced. This xkcd summarizes the situation well.

Lastly, another source of friction in my work has been the need to work with larger datasets. Operations on larger datasets naturally tend to be slower and this is one reason for the existence of the competing frameworks mentioned above. Speed is the main selling point for frameworks like data.table and polars and rightfully so. As I alluded to, I also ended up moving my data to HPC systems in many cases, which I luckily had access to at both Penn State and UMass. The need to work with different packages for small vs big data or the need to move data and code to another, bigger computer—i.e. the HPC—is another hassle.

Enter the DuckDB era

Despite being an RDBMS, you will often see DuckDB compared to Python packages like pandas and polars; R packages like dplyr and data.table, the Julia package DataFrames.jl and other tabular data manipulation packages, rather than other database systems like PostgreSQL or MySQL. For me, it solves, or at least significantly alleviates, the issues I mentioned above, which makes DuckDB an absolutely indispensable tool.

The first problem DuckDB solves is the language and framework chaos that I described above. DuckDB has a standalone CLI (Command Line Interface) tool, a Python package, and an R package (and more). It allows me to write my data manipulation, wrangling, exploration code in SQL—even when I’m working in R or Python. SQL is a language that has been around since the 1970s. A phenomenon called the Lindy effect dictates that the life expectancy of a non-perishable item like a programming language is proportional to its age. The appeal of SQL to me is that it gives me peace of mind. In the context of the data manipulation tasks that I do all the time, I do not need to worry about which language or package to focus my attention on or major syntax changes happening from year to year. Instead of worrying about R vs. Python, pandas vs. tidyverse vs. base-R vs. polars, I can write SQL within Python or within R, depending on which tool I find myself working with for that project for other reasons. For a quick and dirty analysis, I do not even need Python or R, I can simply write in SQL directly and run it with the CLI.

Here’s how one task looks like across the (1) command line, (2) Python, (3) R using DuckDB. I took a dataset from a recent Tidy Tuesday challenge that contains 213 papal encyclicals since 1878, calculated the number of encyclicals per pope, and listed the results in descending order. Download data from here (actual source) or here (local copy).

CREATE TABLE encyclicals AS
SELECT *
FROM read_csv('papal_encyclicals.csv');

SELECT
    pope,
    count(*) AS n_encyclicals
FROM encyclicals
GROUP BY pope
ORDER BY n_encyclicals DESC
LIMIT 10;

┌──────────────┬───────────────┐
│     pope     │ n_encyclicals │
│   varchar    │     int64     │
├──────────────┼───────────────┤
│ Leo XIII     │            86 │
│ Pius XII     │            39 │
│ Pius XI      │            23 │
│ Pius X       │            16 │
│ John Paul II │            14 │
│ Benedict XV  │            12 │
│ John XXIII   │             8 │
│ Paul VI      │             7 │
│ Francis      │             4 │
│ Benedict XVI │             3 │
├──────────────┴───────────────┤
│ 10 rows            2 columns │
└──────────────────────────────┘

import duckdb
duckdb.sql("""
CREATE TABLE encyclicals AS
SELECT *
FROM read_csv('papal_encyclicals.csv');

SELECT
    pope,
    count(*) AS n_encyclicals
FROM encyclicals
GROUP BY pope
ORDER BY n_encyclicals DESC
LIMIT 10;
""")

┌──────────────┬───────────────┐
│     pope     │ n_encyclicals │
│   varchar    │     int64     │
├──────────────┼───────────────┤
│ Leo XIII     │            86 │
│ Pius XII     │            39 │
│ Pius XI      │            23 │
│ Pius X       │            16 │
│ John Paul II │            14 │
│ Benedict XV  │            12 │
│ John XXIII   │             8 │
│ Paul VI      │             7 │
│ Francis      │             4 │
│ Benedict XVI │             3 │
├──────────────┴───────────────┤
│ 10 rows            2 columns │
└──────────────────────────────┘

library(duckdb)
con <- dbConnect(duckdb())
dbGetQuery(con, "
CREATE TABLE encyclicals AS
SELECT *
FROM read_csv('papal_encyclicals.csv');

SELECT
    pope,
    count(*) AS n_encyclicals
FROM encyclicals
GROUP BY pope
ORDER BY n_encyclicals DESC
LIMIT 10;
")

           pope n_encyclicals
1      Leo XIII            86
2      Pius XII            39
3       Pius XI            23
4        Pius X            16
5  John Paul II            14
6   Benedict XV            12
7    John XXIII             8
8       Paul VI             7
9       Francis             4
10 Benedict XVI             3

(The DuckDB R client does not support the “duckbox” output style, which is why the printed output’s style is different than the CLI and Python, but the content is the same.)

A major part of DuckDB’s appeal is its speed. While specific benchmarks are going to vary by task and over time, DuckDB is in the same league as the fastest frameworks in data manipulation. It immediately eliminates the need to use a different tool due to the existing tool being slow. Just write the same SQL code and it will deliver excellent performance on very large datasets—another inconvenience that DuckDB eliminates.

The last issue, related to but distinct from the second point, is the issue of out-of-memory datasets. As I alluded to above, when my laptop’s memory is not enough to hold an entire dataset at once, my usual workflow is to go to my university’s High Performance Computing system and do my work there, where I can get much more memory, disk space, and cores. (Thank you UMass and before that Penn State!) However, DuckDB made me realize that I am sometimes too quick to jump to the HPC; that my laptop is a lot more powerful when used with the right tool. DuckDB will automatically spill larger-than-memory datasets to disk and the work can continue uninterrupted. So, it will not just be fast on datasets that fit memory; it can handle anything that fits into a single node (i.e. a single computer)—the limit becomes the hard disk space, which is much larger than what fits in memory.

Some Caveats

First, obviously my take here applies on “tabular” datasets, which are the most common datasets in the social sciences and applied statistics. By “tabular,” I mean data that is best represented in a table / data frame format, with observations in the rows and variables in the columns. Technically, you can represent an image in a data frame or table, recording pixel locations and densities in cells, but it would not be the most useful representation for downstream tasks.

Second, even for tabular datasets, my point is not to suggest that DuckDB / SQL should replace all of the tools I mentioned. I still use all of these—base R, tidyverse, pandas, data.table, polars—and there are many other tools out there that are being used by many people with good reason. And, even if I were to suggest that, SQL / DuckDB cannot do everything one can do with a full-fledged programming language like Python or R. And there are some things it can do but it would be impractical or clunky to do with SQL, which is why I really like that it can operate inside Python and R and complement them. DuckDB itself not trying to replace those tools: in fact, it has great interoperability with a lot of them. The example I showed used DuckDB to read in the data, but that is not required. DuckDB can find dataframes and dataframe-like objects in memory in an R or Python session and run queries against them directly. So, it is a perfect tool to use in combination with all the dataframe frameworks in Python and R.

Third, SQL is can be verbose and at times less ergonomic in complex operations. Typing SELECT * FROM df instead of just typing df can be annoying (DuckDB allows for just FROM df, which improves but does not solve the problem). That’s a tradeoff worth recognizing.

Lastly, I mentioned small vs larger datasets. DuckDB can work with really large datasets as long as they fit in a single node—you can see reports of running queries on 15TB of data and that it works fine. But if you have truly enormous datasets—e.g. in the petabytes—yes, DuckDB is not going to be the best tool for that. You need a distributed query system like Apache Spark, Snowflake, Clickhouse, BigQuery, and so on. Well, guess what language they want their queries to be in?

Conclusion

I find that, like with many other workflows, there is an 80:20 rule at play when dealing with tabular datasets: 80% of the data manipulation and exploration is done with a handful of small operations (select columns, create new columns from existing columns, filter out rows based on values of columns, group by values of a column and calculate summary values for each group, merge two tables on one or few columns) and it makes a lot of sense to use one unified, stable language that is (1) portable across R, Python, CLI (2) is very fast (3) can handle out-of-memory datasets at least for those operations that do 80% of the work.

This was definitely not a DuckDB tutorial, so here are some places to learn more:

Getting started: https://duckdb.org/docs/current/
DuckDB-flavored SQL: https://duckdb.org/docs/current/sql/introduction

Notes

I used Kimi K2.6 (via OpenCode) to correct grammar errors. The model also suggested I add the caveats about the non-ergonomic feel of SQL and that what I say primarily applies to tabular data, which I did. (Kimi K2.6 is a bit overkill for grammar check, but it happened to be a model I was experimenting with while writing this!)

Cover Photo by ZhiCheng Zhang from Pexels: https://www.pexels.com/photo/rubber-duck-on-railing-over-railway-24030286/