Monday, 15 April 2013

Quicker fixes are not a good idea (in most cases)

Every once in a while someone comes up with a forum question along the lines of

I have such and such problem with my storage. The data is still accessible, but the storage unit is for some reason in abnormal state. What should I do?

Typical (and proper) answer is
  1. back up data,
  2. test the backup,
  3. make sure the original problem, which caused whatever abnormality there was, is corrected,
  4. rebuild the storage from scratch,
  5. restore data
Often, this is not considered a good enough advice: Is there any quicker way to resolve the issue?

Actually, no. To make sure the abnormal state is properly repaired, one needs to identify all the undesired effects and changes to the data. This is plain impossible in all but most simple cases. So, there is always a risk of missing some important point during "quick" repair, masking the problem instead of repairing it. The issue might then reappear later in some undesired way.
 

Wednesday, 3 April 2013

Intel RS2MB044 RAID controller

If you have an Intel RS2MB044 controller and get the message

Controller ID: 0 Controller encountered a fatal error and was reset

Flash the latest firmware.

Next thing,

If flashing with the latest firmware fails stating that

Firmware Failed to FLASH flash. Stop!!!
FW error description:
The requested command cannot be completed as the image is corrupted.

Then, start flasing older firmwares first. The update from very old firmware version to the latest one may require several intermediate firmwares to be flashed before the controller will accept the latest one.

Mk. II testbed storage system

Finally, completed the build of Mark II testbed storage array today, to replace aging QNAP TS-639 Pro unit.

Intel RS2MB044 controller.
4x OCZ Solid 3 120 GB SSD, directly connected to the controller.
Intel RES2SV240 expander, connected by SFF-8088 to SFF-8087 cable.
16x 3TB WD Red hard drives connected to the expander.

The hard drives are configured as 14x 3TB RAID 6 and 2x hot spare, for a total of 36 decimal TB unformatted capacity.

SSD cache to be configured later.

Wednesday, 27 March 2013

Mirrors vs. automatic backups

There was an incident recently, which involved a near-loss of some important git repository. The incident involved the distributed system of multiple servers, one being the designated master and multiple slaves. Slaves pulled data automatically, and that was apparently done quite often. You guess what happens next, the master copy got corrupt, and before anyone knows all the slaves pull the corrupt copy.

In the follow up (here) they state that they have a backup system in place which is principially different from a RAID 1. That is obviously not so.

Any system with automatic replication is subject to the following failure mode - the master copy is damaged, and the damage is then automatically replicated to slaves (mirrors). The automatic replication systems are designed around the assumption that all the master failures are fail-stop - the master either fails mechanically and ceases to perform completely, or the master can detect any and all cases of corruption in it and sutdown itself. The "grey area" cases when the data is damaged but the master still works and pushes out that damaged data are not accounted for.

In most of these systems (except an exterme case of fully duplicate hardware) there is a time window when the synchronization can still be aborted if the corruption is detetcted in time. In a hardware RAID 1 with rotational hard drives this window is like 10..50 milliseconds long. In a weekly backup system, the average time window is half a week. Version-retention systems (which retain multiple previous versions of data) have longer window of opportunity for recovery. However, infinite window of opportunity requires an infinite storage space.

This is why manual backups should still be considered a valuable addition to autmatic strategies. You look at the system and it seems good enough, so you make a copy of it somewhere out of automatics' reach. Might come in handy one day.

Thursday, 28 February 2013

Why there will be no saved-state feature in ReclaiMe software


In average, ReclaiMe File Recovery brings data in less than 48 hours - with NTFS you typically see the files within a couple of hours, with HFS or ReFS you should wait till the end of the scan which takes about 12 hours for a 2 TB disk (SATA connection). ReclaiMe RAID Recovery scans drives in parallel so you are limited by the slowest and largest drive - with a 4 TB drive in the set you expect something like 24 hours. So, in 48 hours you definitely get data with ReclaiMe software. If recovery takes more than 48 hours, it is possible that there is a failed drive (with bad sectors) and you should consider imaging the drive in question first.

If the PC running data recovery cannot work for 48 hours non-stop, you should repair hardware first and only then proceed with data recovery. Anyway, with such a PC there is a great chance that data recovery will produce either incorrect data or, even worse, something that seems to be normal files with just "slightly" damaged content, due to subtle memory corruption or a similar problem.