Scaling Up: From Big Data to Big Storage – a Growing Need for Repository Storage

Many customers have contacted us following our announcement of 3TB SATA repository storage drives at AWS. For those who are not familiar with this type of storage, we wanted to expand on what Repository Storage is, what it requires, and share a customer use-case to better explain how repository storage differs from primary storage and archival storage.

Repository Storage

Our world is increasingly becoming more digital, and data production and consumption is growing at an explosive rate. Technological advancements demand higher storage capacities, Big Data is evolving as a pertinent analytical tool to understanding customer needs and market trends, while employees and clients have come to expect infinite data longevity.

Much of the data we see today is active data. It is heavily read and written, and requires very good performance. This type of data is typically stored in SAN arrays or even Cloud Block Storage. On the opposite end of the performance spectrum lies archival data, data that no longer needs to be written, and will scarcely be read.  This type of data is typically moved off of expensive rotating disks to a tape library for safe keeping. It is no longer active. 

Repository data is data positioned in the middle of this spectrum, between active data and archival data, and is usually written once and read seldom. One characteristic that primarily differentiates repository storage from archival storage is the need to be retrieved and accessed instantly without necessitating waiting times. Current cloud archival services, and tape archives, commonly have waiting times of several minutes, let alone hours for data retrieval. 

So when do instantaneous retrieval times matter for data that is seldom read? A great example to illustrate this is a photo uploaded to Facebook. Upon ‘posting’ this photo it requires many concurrent reads. However, after several weeks this photo will seldom be viewed, yet it needs to be instantaneously accessible should any user want to view it. No Facebook user would wait several hours, minutes (probably not even several seconds) for a photo to be retrieved from a storage system. 

One of the challenges of repository data is that it can require very high-capacity, which in most cases requires lower cost drives to support such large scale volumes. For this reason, efficient storage solutions, that can scale both smartly and economically, are critical to allow businesses and services to continue scaling. Storage virtualization (e.g., software-defined storage) and the cloud are new architectures that offer new performance and capex frameworks that can provide these efficiencies alongside dropping prices and growing capacity of hard drives. 

Where does repository storage fit in my storage scheme?
Let’s take a look at a use case of one of the first customers to employ these drives through our VPSA service at AWS US-East. This case is a great example for using VPSA for multiple storage needs, including a write once and read seldom model of repository storage, alongside high-performance storage for intensive image rendering.

The use case customer is a company providing mapping services, including high-quality aerial photography images for engineering, land use, historical comparison and environment planning and studies. Their customers require detailed images, each sized at about 1GB. Photographs are created uniquely per customer and project needs. Raw storage for images alone requires over 100TB of Repository Storage. In addition to image repository, 10TB are required for high-performance, intensive rendering of thousands of images. 

Repository Storage may not have the performance needs of primary storage, but availability is no less crucial. One of the challenges this customer faced in the cloud was reaching true High Availability. You can imagine that an engineer on the ground needing to access an aerial photo in order to make a construction-related decision will need the file to be available on demand, no matter what. VPSA offer true High Availability, including shared storage clustering and Persistent Reservation.  

A secondary cloud challenge VPSA solved was the 2TB limit of AWS EBS. In order to efficiently manage the storage needed for large image files and data that will grow to several hundred TB, one requires a small number of large-capacity volumes. 2TB become a very small volume in today’s digital world. 

Our Virtual Private Storage Arrays (VPSA™) support both primary and secondary storage – both “hot” and “cold” (repository or archival) data in a single architecture with Enterprise-class storage features. Customers can tune each private array from a price and performance standpoint with on-demand choice of dedicated hardware (multiple HDD and SSD choices, cores and memory).

VPSA 3TB drives at AWS comparing to AWS EBS and S3 
You can learn more about our new 3TB SATA Repository Storage drives and pricing along with feature and capability comparison to AWS EBS and AWS S3 on our announcement blog post. Repository Storage drives are initially available at AWS US-East and US-West locations, with VPSA Enterprise storage as-a-service also available at Dimension Data and various colocation facilities (Equinix, KVH Japan).  

Promotional Free Trial 
For a free trial, please register on our website with promotional code: Repository3. 
To learn more about our solution, or if you are a service provider seeking to offer VPSA, please contact our sales team

Share This Post

More To Explore