An Introduction to DuckDB: What is It and Why Should You Use It?

Explore DuckDB, the fast, easy-to-use analytics database for Python and R. Learn its key features, use cases, and how it optimizes data analysis tasks.

Jun 9, 2023 · 7 min read

In the realm of data science and analytics, extracting insights from data often involves a series of steps, typically conducted in Python using libraries like pandas. While powerful, pandas may face performance issues with large datasets and resource-intensive operations.

Aiming for a balance between robust functionality and efficiency, DuckDB emerges as an excellent alternative. As a high-speed, user-friendly analytics database, DuckDB is transforming data processing in Python and R.

This article will explore:

DuckDB's unique features and capabilities
Advantages of DuckDB over traditional data manipulation tools
Practical use cases demonstrating DuckDB's potential
Guidelines for working with DuckDB in Python and R

By the end of this article, you will understand why DuckDB is a tool worth considering for your data analysis tasks, regardless of the data size and complexity of the operations involved.

What is DuckDB?

DuckDB is a free, open-source, embedded, in-process, relational, OnLine Analytical Processing (OLAP) DataBase Management System (DBMS). That’s a lot of words and acronyms, so let’s break it down:

By in-process, we mean the DBMS features are running from within the application you’re trying to access from instead of an external process your application connects to
OLAP means the database is designed for data analysis. There are also OnLine Transaction Processing (OLTP) databases, which are designed to handle large transactional data.

The main difference between OLAP and OLTP is how data is stored. OLTP databases commonly store data by records while ensuring all data associated with a record is stored close together in memory. They are also optimized for reading and writing rows efficiently.

In contrast, OLAP databases are commonly oriented by columns, which means data is organized by columns. All data associated with a column is stored near one another in memory, and the database is optimised for reading and computing on the columns efficiently.

Since DuckDB is an OLAP database, any data stored is organized by columns. Additionally, DuckDB is optimized to perform complex queries on data (e.g., joins, aggregations, groupings, etc.) – check out the SQL Fundamentals skill track to learn more.

If you’re familiar with SQLite, the easiest way to conceptualize DuckDB is as its analytics-focused replica. This plays into why DuckDB is so popular — it leverages the simplicity of SQLite and the functionalities of Snowflake on your local computer. DuckDB fills the need for an embedded database solution for analytical processing.

Key Features of DuckDB

Let’s take a look at some of the key features of DuckDB.

Fast analytical queries

DuckDB is incredibly fast; this may come as a shock to anyone familiar with databases that handle OLAP workloads, like PostgreSQL. Traditionally, OLAP databases tend to have slow response times since they typically process more data.

However, DuckDB runs on a columnar-vectorized query engine, which helps to make efficient use of the CPU cache and speed up response times for analytical query workloads.

Supports SQL and integration with other programming languages

DuckDB enables users to run complex SQL queries and provides APIs for Java, C, C++, and more. It’s also deeply integrated into Python and R, enabling users to conduct efficient interactive data analysis; thus, you can interact with DuckDB from your preferred programming language. There’s also access to extra SQL keywords that make SQL queries easier to write, such as EXCLUDE, REPLACE, and ALL.

Note: DuckDB has no external dependencies, so you don’t have to worry about dependency issues.

Free & open-source

DuckDB is open-sourced and has several active contributors, which means developments and improvements can be implemented fast. It’s also free, although it may not remain free for good.

DuckDB Use Cases

The two most common use cases for DuckDB are interactive data analysis and edge computing.

Let’s learn more about each one in turn.

Interactive data analysis

Data analysis describes the series of steps involved with processing and modeling data. The purpose of data analysis is to discover useful information to inform conclusions and data-driven decisions.

Nowadays, companies employ data scientists and analysts who take on this responsibility using technologies such as Python and R programming languages to conduct their analyses.

DuckDB offers a better alternative to those who prefer to use SQL for their local development than SQLite, which has a tough time managing OLAP workloads due to its lack of functionality for data analytics.

Edge computing

Embeddable databases, like DuckDB, enable users to analyze data on edge. Edge computing is an emerging distributed computing paradigm, which refers to a range of networks and devices being brought to or closer to the location where it’s required.

Processing data closer to where it’s required enables improved response times and bandwidth being preserved.

Getting Started with DuckDB

Your environment determines the steps required to install DuckDB, but the full installation guide can be found in the DuckDB documentation.

For the most part, it’s pretty straightforward. You only require a few lines of code since DuckDB is an embedded solution. It also does not require any server, and it has no external dependencies.

Here’s an example of how we would install DuckDB using Python:

pip install duckdb==0.8.0

And in R:

install.packages("duckdb")

Now, let’s check out some code examples.

Working With DuckDB in Python

Once you’ve installed DuckDB, it’s pretty simple to get started. You simply import duckdb into your environment, then connect to an existing database or create a new one if required.

For example:

import duckdb 
connection = duckdb.connect()

Since no database file was passed as a parameter to the connect() method, a new database will be created.

However, the most straightforward way to get started executing SQL queries with DuckDB is by using the sql() method.

# Source: Basic API usage - https://duckdb.org/docs/api/python/overview.html
import duckdb
duckdb.sql('SELECT 42').show()

Executing this command will run an in-memory database stored globally in the Python module and return a relation, which is essentially a symbolic representation of the query.

Note, the query is not executed until the result is explicitly requested, as follows:

# Source: Execute SQL - https://duckdb.org/docs/guides/python/execute_sql.html
results = duckdb.sql('SELECT 42').fetchall()
print(results)

"""
[(42,)]
"""

We can also use the df() method to convert the data to a Pandas DataFrame:

results_df = duckdb.sql('SELECT 42').df()
print(results) 

"""
    42
0   42
"""

Or you can use one of the many data ingestion methods to read data into memory:

# Source: Python API - https://duckdb.org/docs/api/python/overview.html 
import duckdb
duckdb.read_csv('example.csv') # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')# read a Parquet file into a Relation
duckdb.read_json('example.json') # read a JSON file into a Relation

duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file

Disclaimer: If you use the sql() method without first connecting to an existing database, any data you store in the files will be lost after shutting down the program. You can prevent this by establishing a connection using the connect() method we discussed at the beginning of the section.

Conclusion and Further Resources

DuckDB is an OLAP database used by data professionals, such as data scientists and analysts, to analyze data in a fast and efficient manner. It leverages a SQL query execution engine capable of running complex queries on large datasets.

DuckDB also has integrations with many other programming languages, so it’s possible to access it from your favorite language.

To continue your learning, check out the SQL Basics Cheat Sheet and how DuckDB made SQL a first-class Citizen in DataLab.

Author

Kurtis Pykes

Topics

Data Science

blog

DuckDB makes SQL a first-class citizen on DataLab

In this blog post, we list out all recent improvements that make it seamless and efficient to query data with SQL, all without leaving the tool.

Filip Schouwenaars

blog

NoSQL Databases: What Every Data Scientist Needs to Know

Find out what NoSQL databases are used for, why data scientists use them, and a list of the best NoSQL databases available.

Zoumana Keita

12 min

tutorial

DuckDB for Data Engineers: Speed Up Your Data Pipelines 10x and More

DuckDB is a powerful analytical engine that lives on your laptop. You can use it to speed up data reading and processing and reduce your pipeline runtimes from minutes to seconds. Follow this hands-on guide to learn how.

Dario Radečić

27 min

tutorial

DuckDB Tutorial: Building AI Projects

This tutorial guides you through DuckDB's key features and practical applications, including building tables, performing data analysis, building an RAG application, and using an SQL query engine with LLM.

Abid Ali Awan

13 min

tutorial

An Introduction to Vector Databases For Machine Learning: A Hands-On Guide With Examples

Explore vector databases in ML with our guide. Learn to implement vector embeddings and practical applications.

Gary Alway

8 min

code-along

Introduction to DuckDB SQL

Mehdi, a data engineer and developer advocate, introduces you to DuckDB SQL. You'll learn about the use cases, the changes to SQL syntax to make data analysis easier (no more SELECT *!), and how to include DuckDB in your analytics workflow.

Mehdi Ouazza

See More See More