DICL - SMML

Storage Data Flow Management based on Streaming Machine Learning

Hybrid data storage systems, if utilized correctly, can be instrumental in meeting the increasing data storage and I/O demands of modern large-scale data analytics and HPC workloads. However, the complexity of data movement across the storage tiers and caches increases significantly, making it harder for applications to take advantage of the higher I/O performance offered by the system. The general objective of this project is to automate data flow management for caching and storage tiering in hybrid data storage solutions, while relying on newly-developed artificial intelligent algorithms, in order to achieve optimal performance to cost ratio under different capacity storage media. The proposed methodology involves combining mathematical modeling with streaming machine learning for the first time, for guiding the decisions of data storage systems. The expected outcome of the project will be instrumental in meeting the increasing data storage and I/O demands of modern large-scale data analytics and high performance computing workloads, as it could offer sustainable high performance with lower cost.

At the same time, DITIS, a new simulator for distributed multi-tiered data storage systems is developed that simulates the end-to-end execution of file system requests through the different layers and storage nodes of the system using numerous pluggable policies that control every aspect of the execution. DITIS can be configured with different numbers of storage tiers, caches, nodes, and media devices (e.g., HDD, SSD, NVRAM, DRAM), where each media can have its own performance characteristics guiding fine-grained performance cost models. DITIS' architecture is based on the Actor Model, where each key component can exchange synchronous or asynchronous messages with each other, much like a real distributed multi-threaded system. As a result, developers can use DITIS to narrow down the design spaces, evaluate design trade-offs, develop, test, and evaluate different setups and policies, and reduce prototyping efforts, while end users can use it to better understand the system's behavior and identify the system configuration that best satisfies their requirements.

Relevant Publications

E. R. Lucas Filho, A. Efstathiou, L. Yang, K. Fu, J. Shen, and H. Herodotou. DITIS: An End-to-End System-Level Simulator and Optimizer for Distributed Tiered Storage. Springer Nature Computer Science (SNCS), Vol. 6, Article 746, 28 pages, August 2025.
E. R. Lucas Filho, G. Savva, L. Yang, K. Fu, J. Shen, and H. Herodotou. Employing Streaming Machine Learning for Modeling Workload Patterns in Multi-Tiered Data Storage Systems. Future Internet, Vol. 17, No. 4, Article 170, 37 pages, April 2025.
S. Vasileiadis, M. Paraskeva, G. Savva, A. Efstathiou, E. R. Lucas Filho, J. Shen, L. Yang, K. Fu, and H. Herodotou. Optimizing Distributed Tiered Data Storage Systems with DITIS. Proc. of VLDB Endowment (PVLDB), Vol. 17, No. 12, pp. 4393-4396, August 2024.
E. R. Lucas Filho, L. Yang, K. Fu, and H. Herodotou. Streaming Machine Learning for Supporting Data Prefetching in Modern Data Storage Systems. In Proc. of the First Workshop on AI for Systems (AI4Sys '23), 6 pages, June 2023.
E. R. Lucas Filho, L. Odysseos, L. Yang, K. Fu, and H. Herodotou. DITIS: A Distributed Tiered Storage Simulator. Infocommunications Journal, Vol. XIV, No 4, Article 3, pp. 18-25, December 2022.

Software Releases

DITIS UI - Distributed Tiered Storage Simulator UI., Apache License 2.0, April 2024

Funding

Huawei Technologies Inc., Nov 2021 - Apr 2023