RSS feed RSS: Events | News | Papers

News

››› Complete list of news items

Events

››› Complete list of events

Archival Storage

Description

We have several active and past projects in archival storage, all of which are contributing to the ability to build more efficient, reliable, and secure long-term storage systems. In addition, we maintain a wiki page with links to resources on archival storage systems.

  • Archival Workload Studies: The first study of publicly accesible long-term data repositories ever done, and the first study of tertiary storage in over 15 years
  • Logan: A management system to scalably grow, maintain, and evolve a heterogeneous archival storage system
  • Computation-Storage Trade-off: Using provenance to reduce storage overhead by storing intermediate and initial inputs and recomputing a dataset on demand
  • Pergamum: long-term evolvable storage built from intelligent network-attached bricks with both disk and NVRAM such as flash.
  • (Past Project) Deep Store: building more efficient archival storage using deduplication to take advantage of intra-file and inter-file redundancy.
  • (Past Project) POTSHARDS: long-term secure storage, which allows the secure preservation of data for decades without relying upon traditional encryption to prevent information leakage.

Workload Studies: Archival workloads information is currently highly out of date. The most recent studies of archives were done on tertiary storage systems over 15 years ago. Not only is this data highly out of date, there are now many publicly available archives of historical, compliance and scientific data with which we have little to no understanding of their usage and access patterns. We are working on obtaining and analyzing a variety of archival workloads to better understand their usage patterns in order to aid in the design and verification of current and future archival storage architectures.

Logan: A system comprised of heterogeneous devicesoffers unique opportunities for administrators to dictate when and where devices are integrated and utilized based upon their characteristics. Logan optimizes the growth of a system by choosing which devices to integrate into the system based on administratively defined policies. Similarly, it maintains and improves system state by allowing administrators to dictate at a high level when and where data should be migrated or rebuilt when a device fails or is decommissioned.

Computation-Storage Trade-off Often times computations produce many rarely used intermediate or final results. Naively storing or discarding results can prove to be a very expensive trade-off. Often used results may then need to be repeatedly computed, or similarly never used ones waste storage. We examine storing the provenance (work-flow) used to create a data-set, and choose an optimal set of inputs and intermediate results to yield the best level of overhead and availability under a variety of constraints such as time to retrieve a result, feasibility of re-computation.

Pergamum: This project's goal is to develop a long-term storage system that is evolvable and controls the major storage cost contributors: static, operational and management. Pergamum consists of a fully distributed network of intelligent storage devices. Each node, called a tome, consists of a SATA hard drive, a low-power processor, NVRAM and a standardized network interface. Reliability is provided through two levels of redundancy encoding: intra-tome redundancy handles latent sector errors, and inter-tome redundancy handles lost devices. By keeping most of the devices spun-down, and through the utilization of commodity hardware, Pergamum provides cost efficiency on par with tape based systems, while providing superior random access performance. Further cost savings are realized by utilizing hierarchical consistency checking, staged rebuilds and NVRAM based metadata stores; reducing disk spin-up results in dramatic energy savings.

Deepstore: Disk-based deep storage is becoming practical because magnetic disks are rapidly becoming as inexpensive as magnetic tape and optical storage, the traditional storage media used for backup and archiving today. The Deep Store architecture uses inter-file (differential) and intra-file (sliding dictionary) data compression to increase storage density, and by adding distribution and redundancy to improve request bandwidth and robustness, the expected media costs will be much lower than that of traditional backup and archival storage.

POTSHARDS: This is project aimed to securely preserve data by spreading breaking it into pieces (shards) and storing them across multiple archives so that no individual archive can reconstruct the data or even know which shards it must steal from other archives to build data. However, a user who gathers all of the shards must be able to reconstruct the original data with no additional information (including encryption keys). We accomplish this using multiple levels of secret splitting and approximate pointers that limit the space that must be searched for related shards while requiring an attacker to obtain exponential numbers of shards that may not be identified in advance. This approach has information-theoretic security because of the use of secret splitting, unlike encryption that might be broken by advances in algorithms or computer hardware. We believe that this approach will become common as the need to securely store data for decades becomes more pressing.

Status

Workload Studies: We have thus far obtained several archive access and update logs and are in the process of obtaining more. If you have workload information you wish to share, please contact a current graduate student or faculty member.

Logan: We have designed and initially validated the basic Logan architecture through simulation and are moving on to investigating several areas. First, scalability: the system is ultimately self-managing which means there are several challenges to be addressed such as scalable communications, group membership, and resource discovery. Second, layout heuristics: In a system with thousands of storage devices of varying type and characteristics brute-force search is simply not an option when searching for devices to coordinate for reliability purposes. Furthermore, access to low level traces for planning and provisioning incur overhead and cannot be guaranteed available. Therefore we must take a detailed look at various methods and heuristics to choose, such as simulated annealing techniques, and basic heuristics such as power draw and feasible I/O.

Computation-Storage Trade-off: There are a variety of areas under investigation in this project. First, identifying the necessary information to store within the provenance and workflows, as well as how to gather and represent it. This is less trivial than initially thought as there are many issues that must be accounted for such as how deterministic a process is, scheduling constraints, security issues, and so forth. Secondly, useful metrics: A user debating this tradeoff should receive concise and simple results in answer to questions such as "how long will this take to recompute under my defined conditions". This is a difficult problem as there may be many possible ways schedule re-computations among many different processes. Third, data-selection: what is the ideal set of data to store and discard. large workflows can contain literally thousands of processes and intermediate results, thus the selection process must be as automated as possible.

Publications

2011

2010

2009

2008

2007

2006

2005

2004

1998


Last modified 30 Sep 2011
Home | Research | People | Publications | Seminars | Sponsors
  Site powered by Django