Whereas even a few years ago a terabyte was seen as a large amount of data, today individual application can generate petabytes of data per second. The tremendous advances in low-cost, high-capacity magnetic hard disk drives (HDD) and relatively new flash-based solid state drives (SSD) have been among the key factors supporting big data and various computing and storage services that our modern society deeply rely on. Datacenter owners all have mission-critical workloads and need to guarantee quality of service to their customers, which are heavily reliant on their HDD and SSD based storage systems.
However disk drives are reported to be the most commonly replaced hardware components. It has been reported that the annualized failure rate (AFR) of disk drives can reach 15%, with 2-4% common for enterprise-class drives and 8-9% for consumer-grade drives. A modern datacenter usually has tens to hundreds of thousands of disk drives installed. At such a scale, disk failures are common with tens of instances every day, not to mention the larger number of logical failures that make disk drives inaccessible. It is reported that 78% of all hardware replacements were for hard drives in production data centers. Storage downtime and data loss cost enterprises $1.7 trillion per year.
The objective of this project is to achieve a deep understanding of the real-world storage reliability, and to develop a cost-effective data and storage resource management system for reliability enhancement.
We propose Wizard, a novel architecture that explores disk performance signatures for automated and systematic management of storage resources for reliability assurance targeting production, large-scale storage environments. The development of Wizard is based on a deep and comprehensive characterization of runtime I/O workloads and disk health data collected from several leadership-class production datacenters, available for this project. These datacenters run diverse application workloads (web services, e-commerce, and high performance computation) and have different storage architectures (consumer-grade HDDs, enterprise-class HDDs, SMR HDDs and SSDs). We treat disk health records as first-class data and discover the categories and types of disk failures, quantify disk performance degradation processes with performance signatures, and forecast occurrence time of future failures, by extensively exploring advanced machine learning technologies. In this way, Wizard manages heterogeneous disk devices under diverse storage workloads in a consistent and cost-effective manner.
Moreover, we incorporate proactive disk and data protection as the next natural step in the storage resource management architecture. Compared with reactive data recovery methods through disk rebuilds, proactive approach reduces data loss and recovery overhead by supporting data migration from an unhealthy storage device to a healthy one prior to a disk failure. Thus, the risk of data loss and the overhead of disk rebuilds can be dramatically reduced. In addition to efficient data rescue, we propose a factor-aware resource scheduling approach in Wizard to extend disk lifetime by smartly distributing storage workloads and other resources among disk drives at different health stages. Wizard also provides a set of APIs to allow storage users and developers to customize data protection and disk health control for flexible, reliable storage management.
- Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi, Making Disk Failure Predictions SMARTer!, accepted by the 18th USENIX Conference on File and Storage Technologies (FAST '20), 2020.
- Song Huang, Shuwen Liang, Song Fu, Weisong Shi, Devesh Tiwari, and Hsing-bung Chen, Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems, accepted by IEEE International Conference on Autonomic Computing (ICAC), 2019.
- Zhi Qiao, Song Fu, Hsing-bung Chen, and Brad Settlemyer, Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems, accepted by IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019.
- Shuwen Liang, Zhi Qiao, Jacob Hochstetler, Song Huang, Song Fu, Weisong Shi, Devesh Tiwari, Hsing-bung Chen, Bradley Settlemyer, and David Montoya, Reliability Characterization of Solid State Drives in a Scalable Production Datacenter, in Proceedings of IEEE International Conference on Big Data (Big Data), 2018.
- Zhi Qiao, Shuwen Liang, Nandini Damera, Song Fu, Hsing-bung Chen, and Michael Lang, ACTOR: Active Cloud Storage with Energy-Efficient On-Drive Data Processing, in Proceedings of IEEE International Conference on Big Data (Big Data), 2018.
- Zhi Qiao, Jacob Hochstetler, Shuwen Liang, Song Fu, Hsing-bung Chen, and Bradley Settlemyer, Incorporate Proactive Data Protection in ZFS Towards Reliable Storage Systems, in Proceedings of IEEE International Conference on Big Data Intelligence and Computing (DataCom), 2018.
- Shuwen Liang, Zhi Qiao, Song Fu, and Weisong Shi, In-Depth Reliability Characterization of NAND Flash based Solid State Drives in High Performance Computing Systems, Extended abstract. In Proceedings by IEEE International Conference on Parallel Processing (ICPP), 2018.
- Song Huang, Song Fu, Weisong Shi, and Devesh D. Tiwari, Proactive Disk Failure Management and Data Protection for Highly Available Storage Systems, Extended abstract. In Proceedings of ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2017.
Kumar, Devesh Tiwari, Saurabh Gupta, Tirthak Patel, Weisong Shi, Song
Fu, Christian Engelmann,Understanding and Analyzing Interconnect
Errors and Network Congestion on a Large Scale HPC System , in
Proceedings of the 48th IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN 2018), Luxembourg, June 25-28,
- Zhi Qiao, Jacob Hochstetler, Shuwen Liang, Song Fu, Hsing-bung HB Chen and Bradley Settlemyer, Enabling Proactive Data Protection in ZFS To Build Reliable Big Data Storage Systems, accepted by IEEE DataCom 2018, August 12-15, Athens, Greece.
- Shuwen Liang, Zhi Qiao, Song Fu, and Weisong Shi. In-Depth Reliability Characterization of NAND Flash based Solid State Drives in High Performance Computing Systems. Extended abstract, accepted by IEEE ICPP, August 13-16, 2018, Oregon, USA.
- Song Huang, Song Fu, Weisong Shi, and Devesh D. Tiwari. Proactive Disk Failure Management and Data Protection for Highly Available Storage Systems. Extended abstract. In Proceedings of ACM HPDC, June 26-30, 2017, Washington DC, USA.
- Biao Xu, Zujie Ren, Weisong Shi, Yongjian Ren, Feng Cao and Jiangbin Lin, iGen: A Realistic Request Generator for Cloud File Systems Benchmarking, in Proceedings of IEEE CLOUD 2016, June 27-July 2, 2016. San Francisco, USA.
- Hsing-Bung Chen and Song Fu. Improving Coding Performance and Energy Efficiency of Erasure Coding Process for Storage Systems. In Proceedings of IEEE CLOUD, June 27-July 2, 2016. San Francisco, USA.
- Song Huang, Song Fu, Quan Zhang and Weisong Shi, Characterizing Disk Failures with Quantified Disk Degradation Signatures: An Early Experience, in Proceedings of 2015 IEEE International Symposium on Workload Characterization (IISWC), Atlanta, GA. Oct 4-6, 2015.
- Qiang Guan and Song Fu, Autonomic Failure Identification and Diagnosis for Building Dependable Computing Systems, Proc. of ACM/IEEE Supercomputing Conference (SC'13), November 2013.
- Zujie Ren, Xianghua Xu, Jian Wan, Weisong Shi and Min Zhou, Workload
Characterization on a Production Hadoop Cluster: A Case Study on Taobao, 2012 IEEE International Symposium
on Workload Characterization (IISWC), November 4-6, 2012, San Diego,
USA. Best Paper Award.
To aid the storage community in making advancements in the disk reliability field-studies, we are opening our source code and the disk dataset of the paper titled "Making disk failure prediction SMARTer!" (published in FAST'20) for non-commercial purposes. Please be free to contact Sidi Lu (email: firstname.lastname@example.org) if you have any concerns, questions, comments or suggestions :)