Enabling Remote Query Execution Through DuckDB Extensions

DuckDB is a high-performance, embeddable analytical database system that has gained massive popularity in the last few years. It allows you to accomplish a surprising number of analytical tasks blazingly fast since it has a state-of-the-art vectorized query engine and it utilizes the local compute on your laptop to run queries. It is frequently referred to by data professionals as the “SQLite for analytics” for its simplicity and an embeddable model.

For developers, DuckDB provides programmatic access to its entire codebase through its extension model. It supports various data scanners (e.g. parquet, CSV, arrow, JSON) and over 500 scalar/aggregation functions through both in-tree and out-of-tree extensions. Any developer can write their own extension which supports specialized logic and/or functions for their needs. In this talk, we will take a look at how DuckDB extensions works and discuss some best practices and considerations around building DuckDB extensions from experience.

We will also see how we are able to extend the DuckDB extension as far as performing hybrid query execution – a query execution model that allows us to run queries closer to where the data lives in order to achieve better query performance, reduce cost, and give users more flexibility to decide where to run their queries. We will delve into the architecture of DuckDB’s query execution and how we are building out a delightful hybrid execution experience by both leveraging the existing DuckDB query execution flow and contributing to the DuckDB codebase. By the end of the talk, you will gain a better understanding of the power of DuckDB extensions, how query execution works and how to build an extension for your own needs.

Interview:

What's the focus of your work these days?

My primary focus is on databases and query execution. I work at Motherduck and we're building a serverless offering of DuckDB, which is an in-memory database invented by researchers at CWI, a Dutch institution. I've been working on this project for the last year, experimenting with new ideas about query execution in local and remote setups.

What's the motivation for your talk at QCon New York 2023?

My main motivation is to share the knowledge I've gained working with DuckDB and experimenting with it. I've been developing cool features with it, and any developer could do something similar because DuckDB is an open-source project. Half of my focus is on the extension model, which allows any programmer or engineer to extend DuckDB and build functionalities on top of it. The other half of the talk is focused on query execution and planning, which is a complex use case of this DuckDB extension model. I want to share my learnings and get some feedback from the audience, especially if they have prior experience working on query execution or have played with DuckDB before.

How would you describe your main persona and target audience for this session?

The audience is most likely a software engineer who has been in the data analytics space for some time, maybe with a focus on databases or data analytics in general. It could also be a data engineer with a strong experience or background in developing applications on top of data warehouses like BigQuery, Snowflake, etc.

Is there anything specific that you'd like people to walk away with after watching your session?

First, I want them to be excited about DuckDB and have an idea of wanting to develop something on top of it. Second, I want to share the query execution model we've been experimenting with, and if the audience has any feedback, that would be great.


Speaker

Stephanie Wang

Founding Engineer @MotherDuck

Stephanie is a Founding Engineer at MotherDuck, working on building a serverless DuckDB. Her main focus is database and query execution. She previously worked on Google BigQuery where she was the tech lead of the BigQuery developer tools team, building the CLI, client libraries and J/ODBC drivers. Prior to Google, Stephanie worked on building Fixed Income Sales and Trading applications at Morgan Stanley.

Read more
Find Stephanie Wang at:

Date

Tuesday Jun 13 / 04:10PM EDT ( 50 minutes )

Location

Salon E

Topics

Architecture Data Analytics Database Data Warehouse

Share

From the same track

Session Streaming

Laying the Foundations for a Kappa Architecture - The Yellow Brick Road

Tuesday Jun 13 / 10:35AM EDT

In the ever changing landscape of big data, focus is slowly moving away from batch and towards realtime analytics. Data Science workflows are evolving to adapt to this changing landscape.

Speaker image - Sherin Thomas

Sherin Thomas

Staff Software Engineer @Chime

Session Serverless

The Rise of the Serverless Data Architectures

Tuesday Jun 13 / 01:40PM EDT

For a while, it looked like Serverless was just a convenient way to run stateless functions in the cloud. But in the last year we’ve seen the rapid rise in serverless data stores.

Speaker image - Gwen Shapira

Gwen Shapira

Founder @Nile, PMC Member @Kafka

Session Stream Processing

Streaming from Apache Iceberg - Building Low-Latency and Cost-Effective Data Pipelines

Tuesday Jun 13 / 11:50AM EDT

Apache Flink is a very popular stream processing engine featuring sophisticated state management, even-time semantics, exactly-once state consistency. For low latency processing, Flink jobs typically consume data from streaming sources like Apache Kafka.

Speaker image - Steven Wu

Steven Wu

Software Engineer @Apple and Apache Iceberg PMC

Session Data Architecture

Building a Large Scale Real-Time Ad Events Processing System

Tuesday Jun 13 / 02:55PM EDT

Two years ago, we embarked on building DoorDash's ad platform from the ground up. Today, our platform handles over 2 trillion events every day and our advertising business has experienced significant growth in recent years, becoming a key area of focus for the company.

Speaker image - Chao Chu

Chao Chu

Software Engineer @DoorDash

Session

Unconference: Modern Data Architecture & Engineering

Tuesday Jun 13 / 05:25PM EDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Speaker image - Ben Linders

Ben Linders

Independent Consultant in Agile, Lean, Quality and Continuous Improvement