[ome-devel] Archive integrity check during OMERO Insight import

Alex Herbert a.herbert at sussex.ac.uk
Thu Jul 5 14:23:07 BST 2012


Hi Chris,

Thanks for the reply. Here is the question that the integrity check is 
supposed to answer:

"How do I know my archive image file has been stored correctly?"

I realise that no integrity check is perfect but I believe that the 
functionality would be welcomed within OMERO, if only as a configurable 
option that can be enabled/disabled.

At current the procedure being used by our scientists is to archive the 
file into OMERO, export the archive and verify the exported file can be 
correctly read by the vendor software. This is only a short step away 
from direct binary comparison of the original and the export (which I 
think they would like to do if there was an easy way with GUI software). 
Obviously this procedure suffers from the time overhead of manually 
checking all the images along with the inevitable human error. The 
biggest problem is that people do not even bother to check their files. 
This is why I would like to improve the archive integrity.

To answer your questions:

  * When you say that you'd like an integrity check are you expecting a hash digest on the client, then one on the server, then a comparison?

Yes.

  * What sort of archiving performance burden would you be willing to accept? 2x slower? 4x? 10x?

2-4x

Currently the rate of import to OMERO is not network limited, so I presume it is CPU limited/IO limited but I do not know at which end (client/server) the limit occurs. If CPU limited then adding a integrity check will be noticeable. If IO limited then the performance hit may not be as significant. A simple answer would be that a speed 2-4x slower than at current but removing the need for manual archive verification would still increase our OMERO workflow speed.

  * How secure are you expecting the hash digest to be? Would a CRC be sufficient?

 From my perspective the hash digest should be reasonably unique for the image. This would be enough to ensure the complete set of bytes have been transferred with a low probability that the key could be reproduced through impartial/incorrect data transfer. This would lend me towards a cryptographic hash function such as MD5 or SHA. If speed differs significantly then perhaps the digest could be configurable to suit users needs (paranoia).

  * Are you expecting the hash digest to be stored in the database? If so are you worried about collisions?

No. The hash value would be used to verify the file was correctly transferred. I do not envision a must have use-case for the key to be stored. I do not worry about collisions in the DB. These would be expected if the user archives the same file again anyway.

In the scenario we are discussing the key point is the average probability of two different files generating the same hash value. The wikipedia reference page (http://en.wikipedia.org/wiki/Birthday_attack) contains a table of random collisions for different sized keys. For a 128-bit key it would take 2.6e10 hashes to have a 1e-18 probability of a collision. On the assumption that (1) the complete byte length of the file is transferred and (2) a different hash could be generated for a mistake in every bit up to the length of the file then the length of an archived file that would achieve this probability is 2.6e10 bits = 3.02 GB. I.e. if a 3GB file is imported and there is a mistake in a single bit then there is a 1e-18 probability that the checksum will not detect it. Given that the same page notes that the 1e-18 to 1e-15 is the error rate of a hard disk then as long as the collision rate for largest potentially archived file is below this then the hash value is acceptable.

These calculations are dependant on having a well behaving function that outputs all hash values with equal probability. In any case from the table it seems that your existing code for the SHA-1 160 bit digest would provide an acceptable collision probability for practically any files people would like to import.

I would be interested to find out the thoughts of other OMERO users on 
this functionality.

Regards,

Alex


On 04/07/12 08:53, Chris Allan wrote:
> Hi Alex,
>
> On 28 Jun 2012, at 15:00, Alex Herbert wrote:
>
>> Dear All,
>>
>> When importing and archiving images into OMERO is there an integrity check on the archived data? For example a comparison between an MD5 sum on the original data on either side of the connection. I assume that the overhead of such a check would be far less than the current work of translating the pixel data into the OMERO format and the time taken to transmit it via the network.
> The answer to your question is that we only do this (via a SHA1 sum) to the pixel data on import. There is provision to do it to the original files that are uploaded to the server via the "archive" process but we don't mostly for performance reasons. In fact there is even a column in the database for storing a SHA1 sum. Generating a reasonably secure cryptographic hash (such as MD5, SHA1 or SHA256) would reduce the upload performance several fold.
>
> Some questions so that we could adequately scope adding this functionality, likely in a configurable manner, to future OMERO versions:
>
>   * When you say that you'd like an integrity check are you expecting a hash digest on the client, then one on the server, then a comparison?
>
>   * What sort of archiving performance burden would you be willing to accept? 2x slower? 4x? 10x?
>
>   * How secure are you expecting the hash digest to be? Would a CRC be sufficient?
>
>   * Are you expecting the hash digest to be stored in the database? If so are you worried about collisions?
>
>> Currently we archive all images from our microscopes into OMERO. We would like to be sure that the archive has been successfully transferred before deleting the original from the local filesystem.
>>
>> I have had people ask me about the archive integrity and would like to find out the answer.
>>
>> Thanking you in advance.
>>
>> Alex
> -Chris




More information about the ome-devel mailing list