You are viewing content from a past/completed conference.
  
    
  
  
        
    
  
    
      
  
Streaming from Apache Iceberg - Building Low-Latency and Cost-Effective Data Pipelines
    
  
    
      
	
	
	
	
	
		
		
	
	
		
			
				
					
					                    Abstract
					
						Apache Flink is a very popular stream processing engine featuring sophisticated state management, even-time semantics, exactly-once state consistency. For low latency processing, Flink jobs typically consume data from streaming sources like Apache Kafka. Apache Iceberg is a widely adopted data lake technology supporting numerous features like snapshot isolation, transactional commit, fast scan planning. While Iceberg was originally designed for batch, it can also be used as a streaming source in Flink. This not only lowers the processing delays from hours or days to just minutes, but also significantly reduces the infrastructure cost and operational burden.
In this talk, we will explain the design of the Flink Iceberg source that we contributed to Apache Iceberg open source project. We will compare the Kafka and Iceberg sources for streaming read and present performance evaluation results of the Iceberg streaming read. We will discuss how the Iceberg streaming source can power many common stream processing use cases (like ETL, feature engineering). It enables users to build low-latency streaming pipelines chained by Iceberg that are cost effective and easy to operate.
					 
					
						
					
					
					Speaker
     
    
    
            Steven Wu
      Software Engineer @Apple and Apache Iceberg PMC
          
    Steven Wu is a software engineer at Apple. He is working at the AIML data platform team focusing on stream processing and data lake technologies. Previously, he worked at Netflix where he helped build the real-time data infrastructure. He is passionate about building scalable distributed systems and empowering people with data.
      Find
      Steven Wu
      at:
    
    
       
 
 
				
			 
		 
	
			
			
				From the same track
				
					
    
        Session
        Streaming
        Laying the Foundations for a Kappa Architecture - The Yellow Brick Road
        Tuesday Jun 13 / 10:35AM EDT
        
            
            In the ever changing landscape of big data, focus is slowly moving away from batch and towards realtime analytics. Data Science workflows are evolving to adapt to this changing landscape.
      
        
        	
		 
		
			Sherin Thomas
			Staff Software Engineer @Chime
		 
	 
 
        Laying the Foundations for a Kappa Architecture - The Yellow Brick Road
     
 
    
        Session
        Serverless
        The Rise of the Serverless Data Architectures
        Tuesday Jun 13 / 01:40PM EDT
        
            
            For a while, it looked like Serverless was just a convenient way to run stateless functions in the cloud. But in the last year we’ve seen the rapid rise in serverless data stores.
      
        
        	
		 
		
			Gwen Shapira
			Founder @Nile, PMC Member @Kafka
		 
	 
 
        The Rise of the Serverless Data Architectures
     
 
    
        Session
        Data Architecture
        Building a Large Scale Real-Time Ad Events Processing System
        Tuesday Jun 13 / 02:55PM EDT
        
            
            Two years ago, we embarked on building DoorDash's ad platform from the ground up. Today, our platform handles over 2 trillion events every day and our advertising business has experienced significant growth in recent years, becoming a key area of focus for the company.
      
        
        	
		 
		
			Chao Chu
			Software Engineer @DoorDash
		 
	 
 
        Building a Large Scale Real-Time Ad Events Processing System
     
 
    
        Session
        Architecture
        Enabling Remote Query Execution Through DuckDB Extensions
        Tuesday Jun 13 / 04:10PM EDT
        
            
            DuckDB is a high-performance, embeddable analytical database system that has gained massive popularity in the last few years.
      
        
        	
		 
		
			Stephanie Wang
			Founding Engineer @MotherDuck
		 
	 
 
        Enabling Remote Query Execution Through DuckDB Extensions
     
 
    
        Session
        
        Unconference: Modern Data Architecture & Engineering
        Tuesday Jun 13 / 05:25PM EDT
        
            
            What is an unconference?
An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.
      
        
        	
		 
		
			Ben Linders
			Independent Consultant in Agile, Lean, Quality and Continuous Improvement
		 
	 
 
        Unconference: Modern Data Architecture & Engineering