direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

MA: Adaptive Resource Management for Distributed Dataflows

Many organizations need to analyze increasingly large datasets such as Web data or data emitted by thousands of sensors. Even with the parallel capacities that single nodes provide today, the size of some datasets requires the resources of many nodes. Consequently, distributed systems have been developed to manage and process large datasets with clusters of commodity resources. A popular class of such distributed systems are distributed dataflow systems like Flink, Spark, and Google’s Dataflow. These systems provide high-level programming abstractions resolving around operators like Map, Reduce, and Join. They also provide comprehensive fault-tolerant distributed runtime environments. These runtime environments include effective data partitioning mechanisms, data-parallel operator implementations, task distribution and monitoring, as well as means for data transfer and communication among workers. Arguably, distributed dataflow systems make it easier for users to develop data-parallel programs that make use of large sets of cluster resources. However, users still need to select adequate resources for their jobs and carefully configure the systems for efficient distributed processing. Yet, even expert users often do not fully understand system and workload dynamics since many factors determine the runtime behavior (e.g. programs, datasets, systems, system configuration, hardware architectures). In fact, users currently over-provision significantly to make sure their jobs meet minimal performance requirements, but even data-parallel processing systems like Spark do not scale without overheads and scale-out behaviors are often also not completely straightforward. Moreover, the usefulness of additional compute resources is ultimately limited by data ingestion rates. Therefore, significant over-provisioning leads to low resource utilization and thus unnecessary costs and energy consumption.

Instead of having users essentially guess adequate sets of resources and system configurations, systems should alleviate users from these tasks. The systems should automatically tune resource usage based on models of the workload and user-provided performance constraints. Such models can be learned either from a cluster’s execution history or dedicated profiling runs (or a combination of both). Ultimately, the goal is to allow users to fully concentrate on their programs and to have systems make more informed resource management decisions.

Concrete theses in this area may focus on the following topics: monitoring, modeling and runtime prediction, model training, resource allocation, runtime adjustments, and automatic system configuration. All theses will include designing a general method, implementing a prototype in the context of an existing open source system, and experimentally evaluating the prototype with large test datasets on one of our clusters. If this sounds interesting to you, please come talk to me or send me an email, so we can talk about concrete topics with a scope of a master thesis.

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions

Contact

Lauritz Thamsen
+49 (30) 314-24539
Room TEL 1210