direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Lauritz Thamsen has finished his PhD

Lauritz Thamsen has successfully completed his PhD on May 4, 2018. Mr. Thamsen graduated in the subject "Dynamic Resource Allocation for Distributed Dataflows" under the supervision of Prof. Dr. Odej Kao.

Abstract: Distributed dataflow systems enable users to process large datasets in parallel on clusters of commodity nodes. Users temporarily reserve resources for their batch processing jobs in shared clusters through containers. A container in this context is an abstraction of a specific amount of resources, typically a number of virtual cores and an amount of memory. For their production batch jobs, users often have specific runtime targets and need to allocate containers accordingly. However, estimating the performance of distributed dataflow jobs is inherently difficult due to the many factors the performance depends on such as programs, datasets, systems, and resources. Additionally, there is significant performance variance in the execution of distributed dataflows in shared large commodity clusters. For these reasons, users often over-provision resources considerably to ensure the runtime targets of their production jobs are met. This behavior leads to unnecessary low resource utilizations and thereby generates needless costs.

This thesis presents novel methods for predicting the performance of distributed dataflow jobs and for allocating minimal sets of resources predicted to meet users’ runtime targets. The core question addressed by this thesis is how minimal resources can be allocated automatically for a given runtime target and a production batch job of a distributed dataflow framework. To this end, this thesis contributes (1) two models for capturing the scale-out behavior of distributed dataflow jobs, a simple parameterized model of distributed processing and a nonparametric model able to interpolate arbitrary scale-out behavior given dense training data, and a method for automatically choosing between these two models, (2) different measures of the similarity between job executions and methods for selecting similar previous executions of a job as a basis for accurate performance prediction, and (3) a method for continuously monitoring a running job’s progress towards its runtime target and dynamically adjusting resource allocations based on per-stage runtime predictions. The overall solution we present in this thesis supports multiple distributed dataflow systems through the use of black-box models and can be deployed on a per-application basis in existing cluster setups.

The methods presented in this thesis have been implemented in prototypes, experimentally evaluated on a commodity cluster using exemplary distributed dataflow jobs, and peer-reviewed for publication at renowned international conferences. For the experiments, we used jobs from the domains of search, relational processing, machine learning, and graph processing. We further used different datasets of these domains, ranging from 1 to 745.5 gigabytes, and up to 60 cluster nodes.

Slides of the defense talk (PDF, 4,3 MB).

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions