Distributed Tiered Storage for Cluster Computing
Improvements in memory, storage devices, and network technologies are constantly exploited by distributed systems in order to meet the increasing data storage and I/O demands of modern large-scale data analytics. We present OctopusFS, a novel distributed file system that is aware of storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across both the storage tiers and cluster nodes. A new data placement policy employs multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. Moreover, machine learning is employed for tracking and predicting file access patterns, which are then used by data movement policies to decide when and which data to move up or down the storage tiers for increasing system performance. This approach uses incremental learning along with XGBoost to dynamically refine the models with new file accesses and improve the prediction performance of the models. At the same time, the storage media are explicitly exposed to users and applications, allowing them to choose the distribution, placement, and movement of replicas in the cluster based on their own performance and fault tolerance requirements.