TU Berlin

Department of Telecommunication SystemsMA: Anomaly Detection in Cloud Infrastructures from Observational System Data

Page Content

to Navigation

MA: Anomaly Detection in Cloud Infrastructures from Observational System Data

Designing a successful system for Cloud Maintenance and ensuring resilience (anomaly detection, root cause analysis and recovery) is challenging due to the complex nature of the system. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. Recently, over the last 2 years AIOps has exploded as a new category. AIOps refers to multi-layered technology platforms that automate and enhance IT operations by:

  • using analytics and machine learning to analyze big data collected from various IT operations tools and devices, in order to automatically spot and react to issues in real time. Advanced AIOps tools must be able analyze huge amounts of observational data (such as that found in monitoring systems, job logs etc.).

The observational data contains: tracing, logging and metrics. The three components produce different types of data. For example, tracing produces events (spans) that contain information about the execution path of the process, as well as the latency of the microservices. Data logging is the process of collecting and storing data over a period of time in order to analyze specific trends or record the data-based events/actions of a system, network or IT environment. It enables the tracking of all interactions through which data, files or applications are stored, accessed or modified on a storage device or application. Finally, the monitoring data usually comes in form of cross-layer metrics such as: throughput, CPU usage, memory usage, disk usage and network latency.

AIOps then implements Analytics and Machine Learning (ML) against the combined data. The desired outcome is continuous insights that can yield continuous improvements with the implementation of automation.

Concrete theses in this area may focus on the following topics: data generation and integration, learning joint representations from the multiple source of system data, anomaly detection, root-cause analysis, and explaining anomalies. All theses will entail designing a general method, implementing a prototype in the context of existing open source systems, and experimentally evaluating the prototype with a test data using one of our commodity clusters.


machine learning, tensorflow/pytorch/keras, distributed systems 

If this sounds interesting to you, please send me an email () with a little bit of background information on yourself, so we can quickly identify a fitting thesis topic together.


Quick Access

Schnellnavigation zur Seite über Nummerneingabe