ZFS Hot Spares

How to: Universal Hot-spare Management for ZFS-based Storage Pools

A standard best practice for preventing data loss due to disk failure is to designate one or more disk spares so that fault-tolerant arrays can auto-heal using the spare in the event of a HDD or SSD drive failure.  Universal hot-spare management takes that a step further and lets you repair any array in the system with a given hot-spare rather than having to designate specific spares for specific RAID groups (arrays).

Policy Driven Hot Spare Management

QuantaStor’s universal hot-spare management was designed to automatically or manually reconstruct/heal fault-tolerant arrays (RAID 1/10/5/50/6/60/7/70) when one or more disks fail.  Depending on the specific needs of a given configuration or storage pool, spare disk devices can be assigned to a specific pool or added to the “Universal” hot spare list meaning that it can used by any storage pool.

QuantaStor’s hot spare management system is also distributed and “grid aware” so that spares can be shared between multiple appliances if they are connected to one or more shared disk enclosures (JBODs).  The grid aware spare management system makes decisions about how and when to repair disks pools based on hot-spare management policies set by the IT administrator.

Some of the challenges that the QuantaStor’s policy driven auto-heal system must tackle includes making sure to only repair pools with like devices (don’t use that slow HDD to repair an SSD pool!), detecting when an enclosure was turned off by differentiating between a power loss versus a disk failure, allowing users to set policies and designate disks to pools as needed and allowing hot spares to be shared and reserved within and across appliances for HA (High Availability) configurations.

With this policy driven hot-spare management, OSNEXUS has developed an advanced system that makes managing HBA (SAS Host Bus Adapter) connected spares for our ZFS-based Storage Pools as easy as managing spares in a hardware RAID controller (also manageable via QuantaStor).

Configuring Storage Pool Hot Spare Policies

The Hot Spare Policy Manager can be found in the Modify Storage Pool dialog box under Storage Pools. (Figure 1)

QuantaStor Hot Spare Policy Manager

Figure 1

The default hot spare policy is “Autoselect best match from assigned or universal spares.” Additional options include auto-select best match from pool assigned spares only, auto-select exact match from assigned or universal spares, auto-select exact match from pool assigned spares only and manual hot spare management only.

Marking Hot-Spares

Marking and unmarking physical disks as hot spares can be found under “Physical Disks” by clicking on the disk and then selecting from the dialog box “Mark as Hot Spare” or “Unmark as Hot Spare.” This applies to the “Universal” hot spare list. (Figure 2)

Figure 2Figure 2

Pinning Spares to Specific Storage Pools

To add assigned disk spares to specific storage pools rather than the universal spare list click on the storage pool and select “Recover Storage Pool/Add Spares.” (Figure 3)

Adding Assigned Disk Spares to Specific Storage Pools
Figure 3

Select disks to be added to the storage pool using the “Add Hot-spares / Recover Storage Pool” dialog box. (Figure 4)

Add Hot-spares / Recover Storage Pool
Figure 4 

For more information about Hot Spare Management and Storage Pool Sizing see the OSNEXUS Solution Design Guide.

Disaster Recovery

QuantaStor v3.14 Released

The latest maintenance release of QuantaStor SDS (v3.14) was published on December 30th, 2014 and comes with several new features. Some highlights include:

  • Cascading replication of volumes and shares allows for replicating data in an unlimited chained-linked fashion from appliance to appliance to appliance.
  • Kernel upgrade to the Linux 3.13 kernel add support for the latest 12GB SAS/SATA HBA and RAID controllers as well as the latest 40GbE network interface cards.
  • Advanced universal hot-spare management to the ZFS-based Storage Pool type that’s enclosure aware and makes hot-spares universally shared within an appliance and across multiple appliances.

This is also the first release that has some initial Ceph support but at this time we’re only working with partners via a pilot program around the new Ceph capabilities. For more information about the pilot program please contact us here and note that broad GA availability of Ceph support is planned for late Q1 2015.

Below is the full list of changes. Linux kernel update instructions can be found on the OSNEXUS support site.

Change Log:

  • SO DVD image: osn_quantastor_v3.14.0.6993.iso
  • MD5 Hash: osn_quantastor_v3.14.0.6993.md5
  • adds 3.13 linux kernel and SCST driver stack upgrade
  • adds support for Micron PCIe SSD cards
  • adds universal hot-spare management system for ZFS based pools
  • adds support for FC session management and session iostats collection
  • adds disk search/filtering to Storage Pool Create/Grow dialogs in web interface
  • adds configurable replication schedule start offset to replication schedule create/modify dialogs
  • adds support for cascading replication schedules so that you can replicate volumes across appliances A->B->C->D->etc
  • adds wiki documentation for CopperEgg
  • adds significantly more stats/instruments to Librato Metrics integration
  • adds dual mode FC support where FC ports can now be in Target+Initiator mode
  • adds support for management API connection session management to CLI and REST API interfaces
  • adds storage volume instant rollback dialog to web management interface
  • adds sysstats to send logs report
  • adds swap device utilization monitoring and alerting on high swap utilization
  • adds support for unlimited users / removes user count limit license checks for all license editions
  • adds support for scale-out block storage via Ceph FS/RBDs (pilot program only)
  • fix for CLI host-modify command
  • fix for pool discovery reverting IO profile selection back to default at pool start
  • fix for web interface to hide ‘Delete Unit’ for units used for system/boot
  • fix for alert threshold slider setting in web interface ‘Alert Manager’ dialog
  • fix to accelerate pool start/stop operations for FC based systems
  • fix to disk/pool correlation logic
  • fix to allow IO profiles to have spaces and other special characters in the profile name
  • fix to FC ACL removal
  • fix to storage system link setup to use management network IPs
  • fix to remove replication association dialog to greatly simplify it
  • fix to CLI disk and pool operations to allow referencing disks by short names
  • fix for replication schedule create to fixup and validate storage system links
  • fix for replication schedule delta snapshot cleanup logic which ensures that the last delta between source and target is not removed
  • fix for stop replication to support terminating zfs based replication jobs
  • fix for pool freespace detection and alert management
  • fix license checks to support sum of vol, snap, cloud limits across all grid nodes
  • fix to create gluster volume to use round-robin brick allocation across grid nodes/appliances to ensure brick pairs do not land on the same node
  • fix to storage volume snapshot space utilization calculation
  • fix to iSCSI close session logic for when multiple sessions are created between the same pair of target/initiator IP addresses
  • fix to auto update user specific CHAP settings across all grid nodes when modified
  • fix to allow udev more time to generate block device links, resolves issue exposed during high load with replication
  • fix to IO fencing logic to reduce load and make it work better with udev
Storage Appliance Hardware

Deploying a High Availability Storage Cluster with GlusterFS

During the Paris OpenStack Summit earlier this month, Red Hat announced the latest version of GlusterFS, version 3.6.0, with new features including volume snapshots, erasure coding across GlusterFS volumes, improved SSL support, and rewritten automatic file replication code for improved performance.

Today, GlusterFS provides the speed, reliability and features such as snapshots, cloning, thin provisioning and massive scalability that can be expanded with RAM and solid state drives (SSDs) to accelerate throughput and IOPS performance.

As we’ve stated before, we believe that GlusterFS is becoming the defacto standard scale-out file storage platform for Big Data deployments as its file-based architecture is great for unstructured data ranging from documents and archives to media.

Online Upgrades, Mostly

When managing Big Data, the key feature is high availability. With multi-petabyte archives and potentially hundreds of client applications reading and writing files it’s typically very difficult to find a maintenance window where the storage can be offline for upgrades. But with cluster based solutions like GlusterFS you can upgrade hardware without imposing downtime on clients due to the replica based architecture of GlusterFS. Multiple replicas provides access to data even if one copy of the data on a given appliance node goes offline.

The trouble is that when updating GlusterFS software a coordinated upgrade across nodes may be required where a maintenance window is required. This is because the introduction of new features can at times be very difficult to synchronize while old versions of the software are running on other nodes. In general, the GlusterFS team has done a great job with the more recent versions but when looking at any storage deployment you’ll need to factor in a maintenance window, and if you can’t afford one, you’ll need to setup replication so that you have failover ability to a second storage cluster while the first one is being upgraded.

Boosting Efficiency with Erasure Coding

The downside to using replicas for high-availability is the dramatic drop in useable storage. With two copies of every file, only 50 percent of your storage is usable. And with three copies only 33 percent is usable. This means that if you have 10PB of files and you are going to maintain 2 copies of each file so that your solution is highly available, you will need to purchase 20PB of raw storage.

Erasure coding takes a different approach to delivering high-availability and fault-tolerance by using parity information so that your storage overhead can be as low as 10 percent in some cases. Therefore, instead of needing to buy 20PB of raw storage you will only need ~12PB.  For those familiar with RAID technology you can think of it as loosely similar to network RAID5. This is a new capability for GlusterFS and it’s critical for deployments that need to scale to 10s of petabytes as the cost in just raw hardware and power becomes a serious issue using the replica model.

Making GlusterFS Easy to Manage

QuantaStor takes a holistic approach to GlusterFS integration by bringing management, monitoring, and NFS/CIFS services together so that deployments can be done faster, easier, with point-click-provision simplicity.

Provisioning GlusterFS Volumes

Gluster Volumes are provisioned from the ‘Gluster Management’ tab in the QuantaStor web management interface. To make a new Gluster Volume simply right-click on the Gluster Volumes section or choose Create Gluster Volume from the tool bar (Figure 2).

To make a Gluster Volume highly available be sure to choose a replica count of two or three.  If you only need fault tolerance in case of a disk failure that is provided by the storage pools and you can use a replica count of one but if an appliance goes offline then that portion of the data will be inaccessible. With replica count of two or three your data is always available even in the event a node is taken offline.

Figure 2

Figure 2

Auto Healing

When the appliance is turned back on it will automatically synchronize with the other nodes to bring itself up to the proper current state via auto-healing. GlusterFS does the all the work for you by comparing the contents of the “bricks” and then synchronizing the appliance that was offline to make it bring it up to date.

High-Availability for Gluster Volumes

When using the native Gluster Client from a Linux server there is no additional steps required to make a volume highly-available as it will communicate with the server nodes to get the peer status information and will do the right thing. To see the commands to connect to your QuantaStor appliance via the native Gluster protocol, just right-click on the volume and choose ‘View Mount Command.’

When accessing your Gluster Volume via traditional protocols such as CIFS or NFS, additional steps are required to make the storage highly available because CIFS and NFS clients communicate with a single IP address.

If the appliance serving storage through an interface with that IP address is turned off, then the IP address must move to another node to ensure continued access to storage on that interface. QuantaStor natively provides this capability by allowing you to create virtual network interfaces for your Gluster Volumes that will float to another node automatically to maintain high-availability to your storage via CIFS/NFS in the event that an appliance is turned off.

OSNEXUS engineering is actively performing feature validation of GlusterFS 3.6 and the new erasure coding features. We look forward to releasing an updated version of QuantaStor in early 2015 with erasure coding support to leverage this new jump in efficiency it provides.

For more in-depth technical information on Managing Scale-out GlusterFS Volumes see the OSNEXUS administrators’ guide.

GlusterFS High Availability

Planning a Disaster Recovery Strategy: Automated Backup Policies for Software Defined Storage

From a business process perspective, a “disaster” needs to encompass everything from application downtime or hardware failures to computer viruses and hackers that cause business disruptions with economic consequences that may be just as impactful as a fire or flood. The core pillars of any disaster recovery strategy must take into consideration the following:

  • Assessing business exposure to a wide range of business disruptions
  • Reviewing storage options for preparation and recovery
  • Setting recovery expectations and data polices that inform storage backup priority decisions
  • Establishing automated backup policies and a testing plan for vulnerabilities and downtime

In today’s heterogeneous storage environments, storage resources may be spread across on premises data centers or cloud storage pools ranging from proprietary storage systems, Windows servers and open storage systems running on Linux. For this reason a backup strategy needs to account for the various types of critical information that must be backed up segmented by business function such as legal, marketing, finance, engineering or health and medical data.

Automating Backups with Software Defined Storage

Within QuantaStor you can create backup policies that will automatically backup CIFS/NFS shares on your network to your QuantaStor appliance.  Whether the share is on a 3rd party NAS filer, Linux, Windows, or other server presenting NFS or CIFS shares, QuantaStor Backup Policies make implementing a DR strategy easy.


To create a backup policy in a QuantaStor appliance right-click on the Network Share where you want the data to be backed up to and choose the ‘Create Backup Policy’ option (Figure 1).  From here you’ll select the CIFS/NFS share on your network to be backed up, and the times at which you want the backup jobs to run.

When the backup policy runs it will attach to the specified CIFS/NFS share on your network to access the data to be archived. When the backup starts, QuantaStor creates a “Backup Job” so you can track the progress of any given backup. Simply select the Backup Jobs tab in the center-pane of the web interface after you select the network share to which the backup policy is attached.

Create Backup Policy

Figure 1

Parallelized Backup for Big Data

Backup policies in QuantaStor also support heavy parallelism so that very large NAS appliances with 100m+ files can be easily scanned for changes. This feature was specifically designed for a life sciences company that had so many files (over 300m) that they could not scan the entire data set within their backup window using traditional backup products and techniques. By default, QuantaStor backup policies use parallelism (up to 64 concurrent “scan+copy” threads) and has a major impact on reducing the backup window for Big Data scenarios.

Sliding Windows

Backup policies back up everything by default, but you can also opt to back up only recently created files and modified files using a ‘Sliding Window Backup’. When backing up data from Big Data archives with hundreds of millions or even billions of files it is sometimes useful or necessary to only backup and maintain a data subset. This is especially helpful for scenarios where there’s more data to be backed up than may be available in your QuantaStor appliance.

For example, if you set the data retention period of your Backup Policy to 60 days then all files that have been created or modified within the previous 60 days will be retained. There’s also a purge rule that by default is set to remove files that are older than the retention period from the backup folder (Figure 2).

Backup Policies 3rd Party

Figure 2

Consolidating Backups

Backup polices can also be configured to automatically consolidate or aggregate storage backups from remote network shares into a single network share on one QuantaStor appliance based on departmental data to comply with compliance-mandated storage retention policies (Figure 3).

Aggregated Backup

Figure 3

Executing Failover

With your Backup Policy in place and running automatically throughout the day you can then use various techniques to failover to a QuantaStor SDS appliance in the event of an outage of a NAS appliance. The easiest is to make a DNS change to assign the IP address of the NAS filer’s hostname (/ FQDN) to the QuantaStor appliance. This will map all existing client connections to the appliance in a transparent manner. The other option is to reconfigure the clients manually to have them reconnect to their network shares using the IP address or hostname of the QuantaStor appliance where the backup copy resides, but that’s less efficient if you are supporting many clients.

For more technical information regarding automated backup policies and disaster recovery please visit the QuantaStor Administrators Guide on the OSNEXUS Wiki.

Disaster Recovery High Availability Software Defined Storage
CERN Atoms

From Smashing Atoms to Remote Replication: Using Ceph for Highly Available Scale-out Storage

The Large Hadron Collider (LHC), one of the most complex experimental facilities ever built, is an underground ring roughly 17 miles (27 kilometers) in circumference, crossing through parts of both Switzerland and France, where speeding particles travel around at 99.99 percent the speed of light and smash into each other roughly 600 million times per second.

Scientists hope that the energy given off by the collisions will yield answers to questions such as the existence of extra dimensions in the universe or the nature of dark matter that appears to account for 27 percent of universal mass-energy.

Smashing atoms together generates not only lots of energy but also huge amounts of particle collision data at the rate of more than 25PB per year, according to the CERN IT department

In the early years of the LHC, the CERN IT staff deployed a unique storage strategy to handle the massive amount of scientific data, successfully scaling their storage system to 100PB. However, according to Daniel van der Ster and Arne Wiebalck in an IOPscience journal article, “in recent years, innovations from companies such as Google, Yahoo, and Facebook have demonstrated that the Big Data problems seen by other communities are approaching, and often surpassing those of the LHC.”

In response, CERN IT decided to investigate and leverage new storage technologies and they landed on Ceph. Through rigorous testing, the CERN team found that Ceph gave the administrator fine-grained control over data distribution, replication strategies, consolidation of object and block storage, and very fast provisioning of boot-from-volume instances using thin provisioning.

Smashing Atoms to Day-to-Day Workloads

OSNEXUS sees Ceph as an ideal solution for not just for high performance computing (HPC) applications like they have at CERN but also for Virtual Machine (VM) workloads in the enterprise and large datacenters. Given the strong OpenStack integration with Ceph and the uptick in Ceph adoption in recent quarters, Ceph is definitely emerging as a key use case for storage technology.

When you look at many proprietary scale-out solutions, most have done a good job dealing with millions or billions of files but VM workloads are different. Making block devices scale-out with performance require the new approaches that Ceph employs to deliver highly available block-level storage that’s always available even in the event of a server outage.

“Ceph is a fantastic storage technology,” says Steve Umbehocker, CEO of OSNEXUS. “The teams at Redhat and Inktank have made a major contribution to the open source world and our goal is to make it easier for enterprises and cloud providers to adopt Ceph by integrating it into our enterprise SDS platform.”

OSNEXUS is planning to release its first version of QuantaStor with integrated Ceph support next month.

Ceph QuantaStor