Scaling stream data pipelines
Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.
Till RohrmannEngineering Lead data Artisans
Till is a PMC member of Apache Flink and software engineer at dataArtisans. His main work focuses on enhancing Flink’s scalability as a distributed system. Till studied computer science at TU Berlin, TU Munich and École Polytechnique where he specialized in machine learning and massively parallel dataflow systems.
Flavio JunqueiraEngineering Lead Pravega by DellEMC
Flavio Junqueira leads the Pravega team at DellEMC. He holds a PhD in computer science from the University of California, San Diego and is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held a software engineer position with Confluent and research positions with Yahoo! Research and Microsoft Research. Flavio has contributed to a few important open-source projects. Most of his current contributions are to the Pravega open-source project, and previously he contributed and started Apache projects such as Apache ZooKeeper and Apache BookKeeper. Flavio co-authored the O’Reilly “ZooKeeper: Distributed process coordination” book.