Inhalt des Dokuments
Adaptive Execution of Parallel Dataflows
Frameworks such as Apache Flink (flink.apache.org) provide high-level and simple programming abstractions for processing large datasets in parallel with clusters of compute nodes. Users develop programs using traditional database operations (e.g. for grouping and aggregating elements) as well as user-defined functions, which are supplied to second-order functions (e.g. the Map and Reduce operations from MapReduce). These programs do not encode much information about the underlying runtime. The same program can be run locally during development and be deployed to different production clusters. Flink also automatically chooses an optimized execution plan.
However, efficient processing highly depends on allocating a fitting set of resources (e.g. number of machines, machine types), partitioning data evenly across data-parallel paths (e.g. using ranges), and a good runtime configuration (e.g. ratio of memory used for buffers vs. by operators). Moreover, much of the information about the behavior of operations and the characteristics of the data are not available upfront. For example, it is difficult to foresee how CPU- or memory-intensive user-defined operations will be. Or how key values are distributed in a new dataset. For these reasons, estimating the runtime of parallel dataflow jobs as well as selecting resources that deliver the required performance and will be utilized well is very difficult for users.
The goal of theses in this area is to design and prototype mechanisms that make dataflow processing engines more efficient by monitoring actual conditions at runtime and adapting over time. Theses include designing, implementing, and evaluating prototypes integrated with Flink, which is implemented in Java and Scala. The evaluation will be done on one of our clusters.
Specific topics/tasks in this area include:
- monitoring of resource utilization over time and mapping usage to specific tasks
- identification of CPU-, memory- and I/O-bottlenecks
- better task scheduling based on heuristics or information from previous runs
- online migration of tasks to realize improved schedules at runtime
- adaptive runtime configuration based on gathered data and utilization statistics