DuckDB is a high-performance, embeddable analytical database system that has gained massive popularity in the last few years. It allows you to accomplish a surprising number of analytical tasks blazingly fast since it has a state-of-the-art vectorized query engine and it utilizes the local compute on your laptop to run queries. It is frequently referred to by data professionals as the “SQLite for analytics” for its simplicity and an embeddable model.
For developers, DuckDB provides programmatic access to its entire codebase through its extension model. It supports various data scanners (e.g. parquet, CSV, arrow, JSON) and over 500 scalar/aggregation functions through both in-tree and out-of-tree extensions. Any developer can write their own extension which supports specialized logic and/or functions for their needs. In this talk, we will take a look at how DuckDB extensions works and discuss some best practices and considerations around building DuckDB extensions from experience.
We will also see how we are able to extend the DuckDB extension as far as performing hybrid query execution – a query execution model that allows us to run queries closer to where the data lives in order to achieve better query performance, reduce cost, and give users more flexibility to decide where to run their queries. We will delve into the architecture of DuckDB’s query execution and how we are building out a delightful hybrid execution experience by both leveraging the existing DuckDB query execution flow and contributing to the DuckDB codebase. By the end of the talk, you will gain a better understanding of the power of DuckDB extensions, how query execution works and how to build an extension for your own needs.
Interview:
What's the focus of your work these days?
My primary focus is on databases and query execution. I work at Motherduck and we're building a serverless offering of DuckDB, which is an in-memory database invented by researchers at CWI, a Dutch institution. I've been working on this project for the last year, experimenting with new ideas about query execution in local and remote setups.
What's the motivation for your talk at QCon New York 2023?
My main motivation is to share the knowledge I've gained working with DuckDB and experimenting with it. I've been developing cool features with it, and any developer could do something similar because DuckDB is an open-source project. Half of my focus is on the extension model, which allows any programmer or engineer to extend DuckDB and build functionalities on top of it. The other half of the talk is focused on query execution and planning, which is a complex use case of this DuckDB extension model. I want to share my learnings and get some feedback from the audience, especially if they have prior experience working on query execution or have played with DuckDB before.
How would you describe your main persona and target audience for this session?
The audience is most likely a software engineer who has been in the data analytics space for some time, maybe with a focus on databases or data analytics in general. It could also be a data engineer with a strong experience or background in developing applications on top of data warehouses like BigQuery, Snowflake, etc.
Is there anything specific that you'd like people to walk away with after watching your session?
First, I want them to be excited about DuckDB and have an idea of wanting to develop something on top of it. Second, I want to share the query execution model we've been experimenting with, and if the audience has any feedback, that would be great.
Speaker
Stephanie Wang
Founding Engineer @MotherDuck
Stephanie is a Founding Engineer at MotherDuck, working on building a serverless DuckDB. Her main focus is database and query execution. She previously worked on Google BigQuery where she was the tech lead of the BigQuery developer tools team, building the CLI, client libraries and J/ODBC drivers. Prior to Google, Stephanie worked on building Fixed Income Sales and Trading applications at Morgan Stanley.