Inhalt des Dokuments
Low Latency Storage for Stream Data
The appeal of a growing number of sensors, devices and systems has spawned various Big Data application and scenarios such as smart cities, transportation, manufacturing or health care, in which many small data items are continuously streaming into the system. Even though modern parallel streaming engines, such as Apache Flink, Spark, and Storm, can process these data streams on the fly, there is a need to store, access and manage such billions of data items efficiently for getting later insights into the valuable data streams. However, the de facto standard Big Data storage HDFS was primarily designed for dealing with large static files and does not perform well for handling large volume of small data items continuously streaming from a variety of data sources.
The goal of theses in this area is to design and prototype mechanisms towards a low latency storage for bridging batch and streaming. Therefore, new data partitioning, indexing, searching and caching strategies for accessing and storing small files need to be designed, implemented and evaluated. Another available thesis is to design and evaluate a streaming-based data ingestion framework that could ingest data continuously into our data storage. Since we have peak times, where more data flows into the system (e.g. events during the week vs weekend), the prototype should increase or decrease resources dynamic as the rate of incoming data increases or decreases. For this work, an existing prototype based on Apache Flink can be used and extended. The evaluation will be done on one of our clusters.
Prerequisites:programming skills in Java/Scala, interests distributed computing infrastructure and file systems (Hadoop HDFS) as well as parallel dataflow systems (e.g. Flink, Spark). The thesis can can be done either in German or in English.