Storage Data Flow Management based on Machine Learning

SMML Project

Hybrid data storage systems, if utilized correctly, can be instrumental in meeting the increasing data storage and I/O demands of modern large-scale data analytics and HPC workloads. However, the complexity of data movement across the storage tiers and caches increases significantly, making it harder for applications to take advantage of the higher I/O performance offered by the system. The general objective of this project (SMML) is to automate data flow management for caching and storage tiering in hybrid data storage solutions, while relying on newly-developed artificial intelligent algorithms, in order to achieve optimal performance to cost ratio under different capacity storage media. The proposed methodology involves combining mathematical modeling with streaming machine learning for the first time, for guiding the decisions of data storage systems. The expected outcome of the project will be instrumental in meeting the increasing data storage and I/O demands of modern large-scale data analytics and high performance computing workloads, as it could offer sustainable high performance with lower cost. This project is done in collaboration with the Huawei Data Algorithm Technology Center of Huawei Russian Research Institute.

Distributed Tiered Storage for Cluster Computing

OctopusFS Architecture

Improvements in memory, storage devices, and network technologies are constantly exploited by distributed systems in order to meet the increasing data storage and I/O demands of modern large-scale data analytics. We present OctopusFS, a novel distributed file system that is aware of storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across both the storage tiers and cluster nodes. A new data placement policy employs multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. Moreover, machine learning is employed for tracking and predicting file access patterns, which are then used by data movement policies to decide when and which data to move up or down the storage tiers for increasing system performance. This approach uses incremental learning along with XGBoost to dynamically refine the models with new file accesses and improve the prediction performance of the models. At the same time, the storage media are explicitly exposed to users and applications, allowing them to choose the distribution, placement, and movement of replicas in the cluster based on their own performance and fault tolerance requirements.

Smart Cloud Caching for Data Intensive Applications

SMACC Project

As Cloud computing is gaining popularity among small and medium enterprises, Cloud storage solutions such as Amazon S3 are increasingly utilized for storing, maintaining, and serving application data. Despite the typical high-speed internet connections between applications and Cloud storage, there is still a huge performance gap compared to accessing data from direct-attached memory or even locally attached disks. SMACC is a novel Cloud caching service developed at CUT that can run on application compute nodes (e.g., on Amazon EC2) and cache frequently-used data residing on Amazon S3 into local memory and locally-attached disks (e.g., Amazon EBS) using new smart policies. SMACC also provides an HDFS-compatible API interface, which can be used by big data platforms such as Spark and Hadoop for processing data residing on Amazon S3, while caching data blocks on the various compute nodes for increased performance.

Scaling Transactional Databases with Strong Guarantees

Transaction Mgmt Diagram

Database replication is a common mechanism used for scaling performance and improving availability of transactional databases but past approaches have suffered from various issues including limited scalability, performance versus consistency tradeoffs, and requirements for database or application modifications. Hihooi is a replication-based middleware solution that is able to achieve workload scalability, strong consistency guarantees, and elasticity for existing transactional databases at a low cost. A novel replication algorithm enables Hihooi to propagate database modifications asynchronously to all replicas at high speeds, while ensuring that all replicas are consistent. At the same time, a fine-grained routing algorithm is used to load balance incoming transactions to available replicas in a consistent way. This project is done in collaboration with Dr. Michael Sirivianos from Cyprus University of Technology.

Computational Intelligence Approaches for Optimizing Seaside Operations in Smart Ports

CIBAP Project

The increasing number of ships and containers observed in recent decades creates several challenges to marine container terminals (MCTs), such as congestion, long waiting times before ships dock, delayed departures, and high service costs. The berth allocation problem (BAP) and the quay crane assignment problem (QCAP) are two of the most important optimization problems in container terminals at ports worldwide. The CIBAP project develops several computational intelligence (CI) based methodologies for several BAP formulations in real world environments with several practical constraints. The first formulation considers the stand-alone BAP with the objective of reducing the total service cost. We extend the study of BAP to multiple quays, which adds the additional dimension of assigning a preferred quay to each arriving ship, rather than just specifying the berthing position and time. Eventually, this project investigates multi-quay combined BAP and QCAP, and solves it using CI approaches. For all formulations, a mathematical model is developed and each problem is formulated as a mixed-integer linear programming (MILP) model. Since BAP (and its variations) is an NP-hard problem a metaheuristic approach, namely, a cuckoo search algorithm (CSA), is proposed to solve the BAP. To validate the performance of the proposed CSA-based method, we use two benchmark CI approaches, namely, the genetic algorithm (GA) and particle swarm optimization (PSO). The comparative analysis and experimental results show that the CSA-based method outperforms the other CI-based methods, while achieving near-optimal results in affordable time for all considered scenarios.

Real-time Aggression Detection on Social Media

Aggression Detection Project

The rise of online aggression on social media is evolving into a major point of concern. Several machine and deep learning approaches have been proposed recently for detecting various types of aggressive behavior. However, social media are fast paced, generating an increasing amount of content, while aggressive behavior evolves over time. We introduce the first practical, real-time framework (RADONS) for detecting aggression on Twitter by embracing the streaming machine-learning paradigm. The framework is designed to be adaptable (its ML classifiers are trained incrementally as they receive new annotated examples), scalable (it can process the entire Twitter Firehose with three machines), and generalizable (it can detect other abusive behaviors such as sarcasm, racism, and sexism in real time). This project is done in collaboration with Dr Nicolas Kourtellis from Telefonica Research, Spain and Dr Despoina Chatzakou from Centre for Research and Technology Hellas, Greece.

Intelligent Vessel Monitoring with AIS

Intelligent Vessel Monitoring with AIS

The Automatic Identification System (AIS) is an automatic tracking system used on ships and by Vessel Traffic Services (VTS) for monitoring vessel movements in real time. AIS signals are sent in regular intervals containing encoded information regarding a vessel, including its unique identification, position coordinates, speed and course over ground, next port destination, and many more. The CUT-AIS Ship Tracking Intelligence Platform is a web-based platform that exploits AIS data signals to provide meaningful representations, graphs, and data analytics to the end user. The platform consumes data in real time from three sources: (i) a base station consisting of a VHF antenna, a receiver, and a Raspberry Pi installed in the premises of CUT; (ii) an AIS stream provided by the Cyprus Shipping Deputy Ministry; and (iii) base stations operated by Tototheo Maritime around the coast of Cyprus. AISafety is another web-based platform for monitoring and visualizing ship traffic in the general area of the Eastern Mediterranean sea using AIS data. The key feature of the platform is generating real-time and valid warnings for incoming and outgoing ships from various areas of interest, as well as for potential ship collisions.

Environmental Quality Monitoring & Analysis

Air Quality Data Monitoring and Analysis

CUT Environmental Monitoring is an intelligence platform that is developed and maintained by DICL in collaboration with Dr. Michalis P. Michaelides from Cyprus University of Technology and enables users to access, extract, and analyze air quality, water quality, and meteorological data. Data can be viewed or extracted through a table dashboard where users can make their requests on specific data that they are concerned about. Moreover, these data can also be viewed through our user-friendly live map. Finally, the platform provides a set of various graphs that present key statistics regarding the collected data. Overall, the CUT Environmental Monitoring platform provides the ability to the end users to monitor parameters related to air and water pollution.

Data-Driven Tourist Destination Marketing

Tourist Destination Marketing Project

This project employs a machine-learning approach to tourist destination marketing campaigns through the analysis of tourists’ reviews from TripAdvisor to identify significant patterns in the data. The proposed methodology combines topic modelling using Structured Topic Analysis with sentiment polarity, information on culture, and purchasing power of tourists for the development Decision Trees (DTs) at different level of granularity. The goal is to identify patterns in tourists’ accommodation experiences and potential reasons for their dissatisfaction and satisfaction, which in turn can improve destination marketing and optimize a destination’s profitability. This project is done in collaboration with Dr Andreas Gregoriades from Cyprus University of Technology and Dr Maria Pampaka from The University of Manchester, UK.

Go to top