Research Projects
Research Program Areas
- Object-based Storage Systems
- Storage Networking
- Storage Security
- Applications with Intelligent Storage
- Improving Energy Efficiency of Storage
Subsystems
New project section
Project Title: Data Center Power Management
People:
Nagapramod Mandagere, Professor David Du
Description:
Project
Objectives: High level goal of the project is to optimize usage of
power in large datacenter environments. Specifically, provisioning of server
and storage resources with cooling design in mind, load balancing for data
center power efficiency, failure planning for thermal emergencies and
Location aware data layouts for distributed storage are the main areas
of focus.
Research Objectives:
- Thermal Aware Static Provisioning
– Mapping Workloads to Servers
- Dynamic Provisioning/Load Balancing
– Mapping new workloads to existing data center
- Layout Optimizations for Data Centers
– Designing efficient layouts with thermal modeling
- Failure Planning/problem diagnosis
– Hot spot detection & minimization
Papers:
GreenStor: Application aided Energy Aware Storage,
MSST 07
Data Center Power Management, Work in progress
(Survey)
Project Title: Data De-Duplication
People:
Guanlin Lu, Nohhyun Park, Chuanyi Liu, Professor David Du,
Professor David Lilja
Description:
Core Research Issues:
- Technology algorithms, hashing, chunking, etc.
- Architecture: client side or server side, etc.
- File resemblance algorithm improvement
- Novel framework for delta encoding based de-dup
De-Duplication Research Projects:
- Semantic De-Duplication
- Hardware optimization for De-duplication
- Authentication and integrity validation
- Backup De-duplication vs. Filesystem De-duplication
Guanlin Lu
A traditional application of de-dup is to minimize the data backup cost
by only storing modified data. However recently there is an emerging
interest in how to de-dup a filesystem or virtual machine images. This
idea is motivated by de-dup enabling the capacity of achieving filesystem
fast-recovery and virtual machine fast online migration.
From the de-dup perspective, there are two typical application
environments: (1) dataset within which each file is a pure bit
stream and no other information is available. (2) Datasets with
descriptive information available (e.g. name, type, owner, version,
c/m time, annotations…) and with format information available (e.g.
MIME,HTML, XML, JPEG,MPEG).
For (1) we are working on giving a new inter-file similarity definition
which overcomes the deficiency of widely used shingle-set one as well as
proposing a new framework to enable delta encoding based de-dup handling
large dataset while avoiding N-to-N comparison cost.
For (2) we are focusing on how to utilize these descriptive and format
information to facilitate de-dup. We believe this application environment
is particular important to matching the needs of filesystem de-dup or
virtual machine de-dup since in both cases we will have lots of
descriptive and format information available.
Nohhyun Park
One of the merits of deduplication compared to simple inter-file level
compression is possible file/object level backup and restore with minimal
cost. This could potentially allow client initiated backup/restore,
real-time backup, versioning filesystem, etc at relatively low cost.
However, a significantly large table will be needed to be updated /
searched at sufficiently high speed. The problem lies with the fact
that there exists no apparent locality to be explored since the chunks
are divided by bit patterns rather than any semantic meanings.
Use of cryptographic hash in identifying a chunk id has potential for
collision. This can only be completely avoided if byte-level or equivalent
comparison is done. However, it is possible to minimize the cost of
collision test into other operations such as consistency test, error
correction, etc instead of simply doing byte-level comparison.
De-duplicating encrypted data is problematic since the entropy of the
n-bit data is ideally n, making variable size chunking impossible and
the number of replications minimal. Although encrypting the volume after
deduplication is a possible option, more secure and efficient method may
be worth looking into.
Data Deduplication Process
Project Title: Buffer Management Schemes for NAND Flash Solid State
Disk
People:
Biplob Debnath, Sunil Subramanya, Professor David Du,
Professor David Lilja, Professor Mohamed Mokbel
Description:
The NAND based flash Solid State Disk (SSD)
systems have a very fast read performance and a slower write performance.
This slower write performance can be as much as an order of magnitude
lesser than read in certain cases. Besides, flash does not allow in
place update. Every update without any buffering scheme would require
an "erase" and a "program" operation of the related
sectors. The problem is worsened due to the fact that an "erase"
operation can only be performed in a higher granularity (a block level
operation) than a "program" (a sector level operation). Moreover,
a block can be erased for only limited number of times (i.e., 100K times).
Therefore, a good buffering scheme can give a remarkable performance gain
for any workload that involves write operations. The goal of this project
is improve the lifetime and performance of SSD using a new caching
mechanism that will utilize flash unique characteristics.
Proposed Write Buffer Cache Process

Project Title: DBMS+OSD
People:
Mohamed Khalefa, Professor Mohamed Mokbel
Description:
Over the last fifty years, the
storage interfaces (e.g., IDE and SCSI) have abstracted the underlying
storage devices as an array of logical blocks. Such abstraction creates
a semantic gap between DBMS and storage device. Recently, Object-based
Storage Devices (OSD) are introduced to reduce this semantic gap by
providing a standard interface to the application to communicate
performance requirements to the storage devices. OSD can respond to
these requirements by providing intelligent object-level caching,
pre-fetching, security, and data placement. DBMS+OSD project aims to
tightly integrate relational DBMS with object-based storage devices.
With radical change in storage devices, DBMS should be changed to make
use of the underlying storage.
Extracting Scientific Information Process
Object-based Storage Systems
Project Title: Design and Implementation of an Object-Based Storage
System
People:
David Du, Jon Weissman, Yongdae Kim, Dingshan He, etc.
Description:
This project will investigate the design and implementation trade-offs of object-based file systems in an asymmetric file system environment similar to Lustre. Such an environment is highly scalable in that various components, including clients, metadata-servers and object-based storage devices, can be scaled independently.
Project Title: OSD Reference Implementation
People:
David Du, David Lilja, Yingping Lu, Aravindan Raghuveer, Vishal Kher, Jaehoon Jeong, Changjin Hong, Sarah Sharafkandi, Kevin KleinOsowski
Description:
Object Storage Device (OSD) exploits the increasing intelligence of storage device and offers an object level access to its initiators. OSD can potentially improve data sharing, scalability, performance and management. T10 develops the standard for OSD in cooperation with other industry groups. The latest version OSD-2 defines the second generation of OSD command set.
This project is the continuation of the Intel iSCSI (with OSD) reference implementation. It aims to continuously provide a public reference implementation of an iSCSI and an OSD target that complies with the latest version of T10 OSD specification. In the current stage, we have implemented all mandatory commands and attributes. In addition, we plan to provide support for OSD security features (security manager, policy manager and security validation in the target). Other optional features will be added gradually in the future.

Project Title: Object Placement for Parallel Tertiary Storage System
People:
Xianbo Zhang, Dingshan He, David Du and Yingping Lu —
DTC Intelligent Storage Consortium Univ. of Minnesota, Twin Cities
(xzhang,he,du,lu@cs.umn.edu)
Description:
In this project, we are investigating how to use multiple tape libraries to build a parallel tertiary storage system with high aggregated data bandwidth to reduce object cluster retrieval latencies. Within high performance cluster environment or backup/restore environment there are cases that require the retrieval of a cluster of objects with huge size from tape storage to disks. The object cluster retrieval time affects the system utilization or the system recovering speed. A scheme is proposed to place objects across tape drives for minimized object retrieval latency, and a multiple-tape-library simulator is built to compare our allocation scheme with previous two studies which minimize data seek time and tape switch time respectively. Simulation results show that our scheme outperforms these two schemes with good trade-off between tape switch time, data seek time and data transfer time. The effects of various factors on the overall object retrieval performance have been carried out.
Project Title: High Performance Tape File System for Data Backup and
Archive
People:
Xianbo Zhang, David Du
DTC Intelligent Storage Consortium
Univ. of Minnesota, Twin Cities
(xzhang,du@cs.umn.edu) |
Jim Hughes, Ravi Kavuri
StorageTek Inc.
jim_hughes,ravi_kavuri@storagetek.com |
Description:
Modern tape technology features huge capacity per cartridge (close to 1 TB with compression), low cost per storage unit and high streaming rate (>100MB/s). Taking advantage of these features while making the use of tape as easy as possible is our research topic. For off-site storing and possible disaster recovery, tape backup/archive is still a strong candidate and even a must for exploding valuable data. We are building a tape file system that provides a tape as an ordinary storage device to users for writes. This system mounts a tape as a storage device and transparently interleaves multiple user data streams for maximum write performance. Requests batches are intelligently scheduled to be served by the system for reduced response time for request batches. Based on the prototyping, system performance is presented and improvements are analyzed to achieve higher write/read performance.
Project Title: Execution Environment for Active Objects
People:
Jinpyo Kim
Description:
Consider an expansion of the OSD model where objects not only contain data, metadata, and attributes, but they also contain methods, i.e., executable active functions associated with an object.
top
Storage Networking
Project Title: SIMON Simulation and Modeling for SANs
People:
Yongdae Kim, David Du, Vishal Kher, Jaehoon Jeong and
Yingping Lu
Description:
With the ubiquity of TCP/IP networks and increased popularity of network applications, Quality of Service (QoS) provisioning has been extremely important since real-time applications running on networks, such as streaming video and voice over IP, demand certain guarantee of network bandwidth and delay variance, while the TCP/IP network was not designed for this purpose.
OSD provides assistance to resolving the QoS needs where storage systems are concerned. With the object storage model, QoS details will become available to the storage system. This empowers the storage system to partake in, and react to, QoS concerns.
top
Storage Security
Project Title: Reconsidering Security for Storage and Distributed File
Systems
People:
Yongdae Kim and Vishal Kher
* This project is jointly funded by NSF CNS-0448423.
Description:
Despite the tremendous improvement in technology of networked storage, and the vast improvement in cryptographic techniques, the directly-applicable state-of-the-art cryptographic techniques are rarely used in today's real-life storage systems and research. There exists a wide gap between the recent cryptographic advances and the existing cryptographic techniques used to secure storage systems. This project narrows the gap between the state-of-art cryptographic solutions and the existing storage security solutions as well as continuing to identify and resolve new security problems. In particular, this research provides the following solutions: 1) Security for Minimally Trusted Storage Systems: Securing data on minimally trusted storage is particularly important, because of frequent insider misuse as well as the current trend of outsourced storage services. In this project, new cryptographic mechanisms are provided for temporal access control and secure file sharing. 2) A Secure and Practical Data Sharing System: Relying either on system administrator or on the data users to add a new user in global file system may cause possible misuse of privilege. This project resolves this "dilemma" by splitting the trust between system administrators and data owners and making them co-operatively work together to add users to the system. 3) Symmetric-key role-based access control: This project provides efficient decentralized access control mechanism. For all these project, complete cryptographic analysis as well as implementation on diverse storage platforms are considered. The output of this research will be disseminated to diverse audiences ranging from cryptographers to storage industries.
top
Applications with Intelligent Storage
Project Title: Realizing Data Provenance from Models to Storage
People:
Abed E. Lawabni, Changjin Hong, David H.C. Du, and Ahmed H.
Tewfik
Description:
Scientific research is based on exchanging data and conclusions. Data collected by a research group, or conclusions reached by the group, build on prior data and results produced by the group and the entire community. They in turn contribute to other derivative innovations, corrections and data. The integrity of scientific knowledge (its accuracy and reproducibility), the rate at which any scientific community can extend it, and the time elapsed between a new discovery and its widespread use for the greater good of society, all depend on the ability to track the propagation and complex interdependencies of the underlying representations and embodiments of knowledge.
Our goals of this project are to investigate and build a system architecture which will serve as a foundation for allowing the seamless tracking and management of scientific data objects across disjoint projects, and integrating tools that automatically disseminate new knowledge (updates as well as new data objects) to help researchers confirm the novelty of a conclusion or to identify faulty processing feasible. More precisely, our primary focus in this project is to exploit the capabilities of OSD-based storage devices to provide a powerful framework for solving the data provenance problem.

Project Title: Deploying Intelligent Storage in a real-world setting:
Issues and Challenges
People:
UofM: Aravindan Raghuveer, Dingshan He, Yingping Lu, Jim Diehl, Abed Lawabni, Prof. David.H.C Du, Prof. Jon Weissman Mayo Clinic: Dr. Chris Chute, Nathan Spillers, Dr. Piet de Groen, Matthew Weipert
Description:
In recent times, data centers are facing a data explosion problem wherein the complexity and amount of data is increasing exponentially by the day. This leads to both data storage and management issues. Mayo Clinic also faces the same problem spanning two dimensions: handling patient records, storing research experimental data and in some cases tying these two together. In this project we explore how intelligent storage can help tackle such issues.
In the intelligent storage paradigm, application defined attributes of the data objects capture characteristics of the data which is then used by the storage device to store the data in an efficient manner. We are currently building a prototype intelligent storage device to demonstrate such capabilities. We envision this intelligent storage device to be a network attached storage brick with a dual processor core. One processor would be used for the regular activities of the storage brick like RAID management etc and the other processor would handle the intelligence that we propose to build into the system. The intelligence of the device would be deployed on four fronts: Data provenance based object placement, intelligent search capabilities, QoS provisioning and Object level security. This storage brick would implement the SNIA T-10 OSD interface.
As a first step, we propose to deploy the storage brick to store data generated by a bioinformatics research methodology called Microarray based gene expression analysis. A typical microarray experiment generates various pieces of data, with varying properties ranging from unstructured data like images to well structured database tables. (See Figure-1) In this phase, we explore whether the property of the data can be exploited to store it more efficiently.
Data pieces
generated at various phases of the microarray experiment
Project Title: Search and Indexing for Intelligent Storage
People:
Jim Diehl, David Du, and Jon Weissman
Description:
As the rate at which data is generated outpaces the rate at which data can be analyzed, it becomes more difficult to manage data efficiently. Correspondingly, the ability to quickly locate and retrieve relevant data from massive collections is becoming increasingly important. The purpose of this project is to explore ways in which intelligent storage can be used to improve data access by reducing query response time and data retrieval time.
top
Improving Energy Efficiency of Storage Subsystems
Project Title: Improving Energy Efficiency of Storage Subsystems
People:
Pramod Mandagere, David Du
Description:
The sheer size and volume of medical data and the performance requirements of integrated queries suggest that an online storage solution is required. This is particularly true since many queries may require access to older “archived” data. Tape-based archival solutions suffer from high latency and low throughput. An attractive option for large distributed sites such as the Mayo complex is to exploit large disk arrays. The decreasing cost and increasing capacity of commodity disks is rapidly changing the economics of online storage. Large disk arrays will also enable system scaling . an important property as the growth in medical data (clinical and research) is predicted to be enormous both in at-rest storage (TBs or more) and in delivered data (GBs/day or more). This enhanced performance comes at a price. Keeping huge disk arrays “spinning” has a hidden cost — energy. Industry surveys suggest that the cost of powering up the nation’s data centers is growing at the rate of 25% every year [1]. Among various components of a data center, storage is one of the biggest energy consumers, consuming almost 27% of the total. To make matters worse - the demands of increasing performance have led to disks with higher power requirements; moreover, storage demands are continuously growing by 60% annually according to an industry report [2]. Given the well-known growth in healthcare costs, a solution which can mitigate the high cost of power, yet keep data online is needed.
Various studies of data access patterns in data centers suggest that on any given day the total amount of data accessed is less than 5% [2]. While we can.t predict the future access rates for medical data once it is fully online, we expect it to exhibit similar behavior. Most energy conservation techniques make use of various optimizations to conserve energy, but this usually comes with a performance penalty. Another reason for the failure of these techniques is that the data access pattern is very random. We believe that the access patterns to medical data may have more predictability. Consider the simple example of a medical patient record retrieval system. For scheduled routine visits, the system can know which records will be needed at well-defined points in the future. Our goal is to exploit hidden patterns and well-known patterns inherent in data access and query history to increase energy efficiency. In this work we explore our solutions in the context of MAID (Massive Array of Idle Disks).
top