[ome-users] Interpretation of special characters
Mario Emmenlauer
mario at emmenlauer.de
Thu Jan 29 06:52:45 GMT 2015
Hi Melissa,
On 29.01.2015 01:29, Melissa Linkert wrote:
> Hi Mario,
>
>> I have a slightly challenging question, at least it is so for me :-)
>> We have images where users can write a description into a free-text
>> field. Sometimes, users somehow manage to create special characters
>> in such fields, probably by copy-pasting from some word-document
>> into MetaXpress.
>>
>> In the following image, you can find such a case. The character that
>> the user seems to have entered should be decimal 150, hex 96. At least
>> this is what I get from a hex editor when checking the file. When I
>> inspect the image meta data with Matlab, the character is reported
>> as "0xFF96", which seems at least to make it possible to recover the
>> original value by using only the lower byte.
>>
>> With Bio-Formats, I did not manage to recover the original value.
>> Instead, the character is reported as decimal 65533, which I learned
>> to be an UTF-special for "character could not be parsed"(?) I tried
>> all sorts of setting a correct locale, so far without success. Can
>> you please advise: is it even supported to handle such cases? Should
>> I be able to read the characters identical to the bytes in the file
>> with Bio-Formats? How can I set the correct locale for Matlab and/or
>> Java? I assume we have de_CH.something here...
>>
>> The image I used is here:
>> http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
>>
>> Rough instructions to reproduce:
>> aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
>> javaaddpath('.../bioformats_package.jar');
>> vTiffParser = loci.formats.tiff.TiffParser(aFileName);
>> vString = vTiffParser.getComment();
>> uint32(vString.charAt(93))
>> # this returns 65533 for me, not (dec) 150 / (hex) 96
>
> I see the same behavior, which is more or less what I would expect in
> this case. The problematic byte is indeed 0x96, which is not a valid
> character by itself.
>
> The TIFF specification indicates that image descriptions should always
> be valid ASCII strings. Bio-Formats is slightly more lenient in that it
> always uses UTF-8 encoding for the image description - this is
> hard-coded, so changing locales will have no effect.
>
> UTF-8 encoding in Java causes invalid bytes to be replaced with 0xfffd,
> which is the standard Unicode replacement character. It is expected
> that converting bytes to a Java String will be lossy if invalid
> bytes were present. Matlab performs invalid byte replacement by escaping
> with 0xff, which is also valid but is just a different implementation
> choice from Java.
>
> Encoding the image description as ISO-8859-1 instead of UTF-8 would preserve
> the original byte value in this case, but I would be a little hesitant to do
> this across the board in Bio-Formats.
>
> As it stands, if you need the raw byte values for the image description,
> the size and offset to the byte array can be read something like this:
>
> %%
> tiffParser = loci.formats.tiff.TiffParser(fileName);
> tiffParser.setDoCaching(false);
> ifd = tiffParser.getFirstIFD();
> ifdEntry = ifd.get(loci.formats.tiff.IFD.IMAGE_DESCRIPTION);
> byteArrayOffset = ifdEntry.getValueOffset();
> byteArraySize = ifdEntry.getValueCount();
> %%
>
> and then the bytes can be read and encoded as desired.
>
> If that isn't sufficient, I'd be open to other ideas - it's just a
> little tricky, as this byte is fundamentally not valid with respect to
> the TIFF specification.
I think that is actually a cool idea and I'm happy to give it a try.
I'll let you know how it goes!
All the best,
Mario
> Regards,
> -Melissa
>
> On Wed, Jan 28, 2015 at 11:25:24AM +0100, Mario Emmenlauer wrote:
>>
>> Dear Bio-Formats developers,
>>
>> I have a slightly challenging question, at least it is so for me :-)
>> We have images where users can write a description into a free-text
>> field. Sometimes, users somehow manage to create special characters
>> in such fields, probably by copy-pasting from some word-document
>> into MetaXpress.
>>
>> In the following image, you can find such a case. The character that
>> the user seems to have entered should be decimal 150, hex 96. At least
>> this is what I get from a hex editor when checking the file. When I
>> inspect the image meta data with Matlab, the character is reported
>> as "0xFF96", which seems at least to make it possible to recover the
>> original value by using only the lower byte.
>>
>> With Bio-Formats, I did not manage to recover the original value.
>> Instead, the character is reported as decimal 65533, which I learned
>> to be an UTF-special for "character could not be parsed"(?) I tried
>> all sorts of setting a correct locale, so far without success. Can
>> you please advise: is it even supported to handle such cases? Should
>> I be able to read the characters identical to the bytes in the file
>> with Bio-Formats? How can I set the correct locale for Matlab and/or
>> Java? I assume we have de_CH.something here...
>>
>> The image I used is here:
>> http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
>>
>> Rough instructions to reproduce:
>> aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
>> javaaddpath('.../bioformats_package.jar');
>> vTiffParser = loci.formats.tiff.TiffParser(aFileName);
>> vString = vTiffParser.getComment();
>> uint32(vString.charAt(93))
>> # this returns 65533 for me, not (dec) 150 / (hex) 96
>>
>>
>> All the best, and thanks for your great work,
>>
>> Mario Emmenlauer
>>
>>
>>
>>
>>
>>
>> --
>> Mario Emmenlauer BioDataAnalysis Mobil: +49-(0)151-68108489
>> Balanstrasse 43 mailto: mario.emmenlauer * unibas.ch
>> D-81669 München http://www.marioemmenlauer.de/
>>
>> _______________________________________________
>> ome-users mailing list
>> ome-users at lists.openmicroscopy.org.uk
>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-users
>
--
A: Yes.
> Q: Are you sure?
>> A: Because it reverses the logical flow of conversation.
>>> Q: Why is top posting annoying in email?
Mario Emmenlauer BioDataAnalysis Mobil: +49-(0)151-68108489
Balanstrasse 43 mailto: mario.emmenlauer * unibas.ch
D-81669 München http://www.biodataanalysis.de/
More information about the ome-users
mailing list