[ome-users] Interpretation of special characters

Thu Jan 29 06:52:45 GMT 2015

Hi Melissa,

On 29.01.2015 01:29, Melissa Linkert wrote:
> Hi Mario,
> 
>> I have a slightly challenging question, at least it is so for me :-)
>> We have images where users can write a description into a free-text
>> field. Sometimes, users somehow manage to create special characters
>> in such fields, probably by copy-pasting from some word-document
>> into MetaXpress.
>>
>> In the following image, you can find such a case. The character that
>> the user seems to have entered should be decimal 150, hex 96. At least
>> this is what I get from a hex editor when checking the file. When I
>> inspect the image meta data with Matlab, the character is reported
>> as "0xFF96", which seems at least to make it possible to recover the
>> original value by using only the lower byte.
>>
>> With Bio-Formats, I did not manage to recover the original value.
>> Instead, the character is reported as decimal 65533, which I learned
>> to be an UTF-special for "character could not be parsed"(?) I tried
>> all sorts of setting a correct locale, so far without success. Can
>> you please advise: is it even supported to handle such cases? Should
>> I be able to read the characters identical to the bytes in the file
>> with Bio-Formats? How can I set the correct locale for Matlab and/or
>> Java? I assume we have de_CH.something here...
>>
>> The image I used is here:
>>   http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
>>
>> Rough instructions to reproduce:
>>   aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
>>   javaaddpath('.../bioformats_package.jar');
>>   vTiffParser = loci.formats.tiff.TiffParser(aFileName);
>>   vString = vTiffParser.getComment();
>>   uint32(vString.charAt(93))
>>   # this returns 65533 for me, not (dec) 150 / (hex) 96
> 
> I see the same behavior, which is more or less what I would expect in
> this case.  The problematic byte is indeed 0x96, which is not a valid
> character by itself.
> 
> The TIFF specification indicates that image descriptions should always
> be valid ASCII strings.  Bio-Formats is slightly more lenient in that it
> always uses UTF-8 encoding for the image description - this is
> hard-coded, so changing locales will have no effect.
> 
> UTF-8 encoding in Java causes invalid bytes to be replaced with 0xfffd,
> which is the standard Unicode replacement character.  It is expected
> that converting bytes to a Java String will be lossy if invalid
> bytes were present.  Matlab performs invalid byte replacement by escaping
> with 0xff, which is also valid but is just a different implementation
> choice from Java.
> 
> Encoding the image description as ISO-8859-1 instead of UTF-8 would preserve
> the original byte value in this case, but I would be a little hesitant to do
> this across the board in Bio-Formats.
> 
> As it stands, if you need the raw byte values for the image description,
> the size and offset to the byte array can be read something like this:
> 
> %%
> tiffParser = loci.formats.tiff.TiffParser(fileName);
> tiffParser.setDoCaching(false);
> ifd = tiffParser.getFirstIFD();
> ifdEntry = ifd.get(loci.formats.tiff.IFD.IMAGE_DESCRIPTION);
> byteArrayOffset = ifdEntry.getValueOffset();
> byteArraySize = ifdEntry.getValueCount();
> %%
> 
> and then the bytes can be read and encoded as desired.
> 
> If that isn't sufficient, I'd be open to other ideas - it's just a
> little tricky, as this byte is fundamentally not valid with respect to
> the TIFF specification.

I think that is actually a cool idea and I'm happy to give it a try.
I'll let you know how it goes!

All the best,

    Mario

> Regards,
> -Melissa
> 
> On Wed, Jan 28, 2015 at 11:25:24AM +0100, Mario Emmenlauer wrote:
>>
>> Dear Bio-Formats developers,
>>
>> I have a slightly challenging question, at least it is so for me :-)
>> We have images where users can write a description into a free-text
>> field. Sometimes, users somehow manage to create special characters
>> in such fields, probably by copy-pasting from some word-document
>> into MetaXpress.
>>
>> In the following image, you can find such a case. The character that
>> the user seems to have entered should be decimal 150, hex 96. At least
>> this is what I get from a hex editor when checking the file. When I
>> inspect the image meta data with Matlab, the character is reported
>> as "0xFF96", which seems at least to make it possible to recover the
>> original value by using only the lower byte.
>>
>> With Bio-Formats, I did not manage to recover the original value.
>> Instead, the character is reported as decimal 65533, which I learned
>> to be an UTF-special for "character could not be parsed"(?) I tried
>> all sorts of setting a correct locale, so far without success. Can
>> you please advise: is it even supported to handle such cases? Should
>> I be able to read the characters identical to the bytes in the file
>> with Bio-Formats? How can I set the correct locale for Matlab and/or
>> Java? I assume we have de_CH.something here...
>>
>> The image I used is here:
>>   http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
>>
>> Rough instructions to reproduce:
>>   aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
>>   javaaddpath('.../bioformats_package.jar');
>>   vTiffParser = loci.formats.tiff.TiffParser(aFileName);
>>   vString = vTiffParser.getComment();
>>   uint32(vString.charAt(93))
>>   # this returns 65533 for me, not (dec) 150 / (hex) 96
>>
>>
>> All the best, and thanks for your great work,
>>
>>      Mario Emmenlauer
>>
>>
>>
>>
>>
>>
>> -- 
>> Mario Emmenlauer BioDataAnalysis             Mobil: +49-(0)151-68108489
>> Balanstrasse 43                    mailto: mario.emmenlauer * unibas.ch
>> D-81669 München                          http://www.marioemmenlauer.de/
>>
>> _______________________________________________
>> ome-users mailing list
>> ome-users at lists.openmicroscopy.org.uk
>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-users
> 

-- 
A: Yes.
> Q: Are you sure?
>> A: Because it reverses the logical flow of conversation.
>>> Q: Why is top posting annoying in email?

Mario Emmenlauer BioDataAnalysis             Mobil: +49-(0)151-68108489
Balanstrasse 43                    mailto: mario.emmenlauer * unibas.ch
D-81669 München                          http://www.biodataanalysis.de/