CloudButton 2021 Big Data Challenge

What is the challenge?

The challenge consists of developing the entire life cycle of a Big Data application including data acquisition and storage, data preprocessing and indexing, and data queries and visualization.

The solution must use the Lithops Toolkit and run in the IBM Cloud.

Who is organizing this challenge?

This challenge is organized by the European Research project CloudButton. The project is developing novel Cloud technologies aiming to democratize Big Data applications in the Cloud. The CloudButton project has created the Lithops Toolkit that will be used to implement the Big Data challenge.

The challenge is organized by Universitat Rovira i Virgili, with the collaboration of IBM, RedHat and ATOS. The evaluation committee will include representatives from the four institutions.

IBM Israel - Science and Technology LTD (Israel)

Participation

Students of Computer Engineering in the last courses and Master students can participate in groups of up to three members.

Application is now closed.

The deadline to submit your solutions is June 12th, and we will announce the winners in a public event by June 18th.

Training

In the context of the distributed system course, but open to external participants, we will provide training in:

Cloud programming with lithops in the IBM Cloud

IBM Cloud Functions
Cloud Object Storage
Redis
Jupyter Lab

Data crawling and and text dataset generation
Data preprocessing and sentiment analysis libraries in Python
Pandas, GeoPandas and data visualization in Python

Details about the proposed projects

Students are free to select the topic and data selected for their challenge, but we encourage students to select data in the catalan and/or spanish language.

In groups of two or three people, you will create a distributed system using Cloud technologies with three main functionalities: (i) create a new text dataset extracting information from the Web (Web crawler, Twitter APIs, …) and store it in Cloud Object Storage, and (ii) preprocess the text dataset to build structured data (csv) that can be queried and analyzed later on. And (iii) create Python notebooks to demonstrate date related queries, basic visualization, and sentiment analysis techniques over the data.

To build the system, you will leverage the Lithops.cloud toolkit developed in CloudButton. This toolkit enables to launch processes in the Cloud over Cloud Functions, and to store data in Cloud Object Storage. We will provide training and examples of how to use this toolkit.

Stage 1 - Data crawler: Massively parallel functions crawling data and storing it in Cloud Object Storage. Obtain information from web pages or tweets and create a dataset of text data. Use FaaS backend in Lithops to launch crawling process over serverless functions.
Stage 2 - Data preprocessing: Produce structured data in csv format also stored in Cloud Object Storage. As columns in the csv file we suggest date, geographic location, url, and sentiment analysis.
Stage 3 - Python notebook: Demonstrate different queries in Python notebooks over your data. In particular, apply sentiment analysis to data in different dates and periods and generate simple plots and maps. Enable search queries over the data.

Example projects

Students can propose their own ideas for data acquisition and analysis, but they may align with Open Data inititiatives like Open Data Lab and Tarragona Open Data Lab. These initiatives are these days interested in datasets helping to understand the social and economic impact of the COVID pandemic in Tarragona/Catalonia.

Some potential ideas could be:

Crawl text from social networks in Spain/Catalonia with the goal of analyzing sentiment analysis to understand the impact of the COVID pandemic.
Crawl text from news web sites from Spain with the goal of analyzing sentiment analysis to understand the impact of the COVID pandemic.
Crawl comments from touristic locations in Catalonia (airbnb, hotels) to analyze the impact of the pandemic in the following months as the pandemic improves.

You can also include in your challenge available open data sets with different data formats: