Inhalt des Dokuments
Architecture- and Algorithm Design for Stream Processing of NLP Pipelines
Since the emerge of the internet, most content is text based. These texts are created in a huge amount by social networks, blogs, research papers, and also through traditional journalists in form of articles. In case of news articles, different areas like politicians, disaster relief, tourism- or the financial industry cannot even cope with the large amount of data published. Because of this, automatic text analysis technologies have to be developed in order to summarize relevant data to the user. This topic area concentrates on the stream based processing of German financial news texts and cooperates with YUKKA Lab AG. YUKKA Lab AG develops solutions that enable the analysis of finance-related big data in real-time.
In cooperation with YUKKA Lab AG, we defined two topics for thesis’s based on real-world problems. Those two topics are:
- Streaming based NLP algorithms. Real world financial news data needs to be learned and processed iteratively for each incoming point of data on the fly. Open source libraries like Stanford CoreNLP, Darmstadt DKPro and Berlin spaCy as well as SaaS solutions like IBM Watson are only a subset of possible frameworks to work with.
- Stream processing Architecture for NLP algorithm Pipelines. Apache Frameworks like Kafka, Spark and Flink allow the navigation of data stream through an analysis pipeline. The aim of this topic is to design and create an architecture, which scales with the data streams generated by news feeds and integrates existing analysis components.
From YUKKA Lab AG’s side access to financial experts, real-world financial big data and the already existing analysis infrastructure is available to students. From TUB’s side for testing and evaluation proposes, our 200-node cluster (each node: Quadcore Xeon @3.3 GHz, 16 GB RAM, 3 TB RAID0) is available. Thesis language: German or English.
Prerequisites: good programming skills in Java, desirable skills in Python and PHP; for topic 1: interest in algorithmic design, online algorithms, and knowledge of basic NLP algorithms; for topic 2: interest in distributed systems, basic knowledge about Hadoop based systems/processing.