[ome-users] Interpretation of special characters

Thu Jan 29 00:29:57 GMT 2015

Hi Mario,

> I have a slightly challenging question, at least it is so for me :-)
> We have images where users can write a description into a free-text
> field. Sometimes, users somehow manage to create special characters
> in such fields, probably by copy-pasting from some word-document
> into MetaXpress.
> 
> In the following image, you can find such a case. The character that
> the user seems to have entered should be decimal 150, hex 96. At least
> this is what I get from a hex editor when checking the file. When I
> inspect the image meta data with Matlab, the character is reported
> as "0xFF96", which seems at least to make it possible to recover the
> original value by using only the lower byte.
> 
> With Bio-Formats, I did not manage to recover the original value.
> Instead, the character is reported as decimal 65533, which I learned
> to be an UTF-special for "character could not be parsed"(?) I tried
> all sorts of setting a correct locale, so far without success. Can
> you please advise: is it even supported to handle such cases? Should
> I be able to read the characters identical to the bytes in the file
> with Bio-Formats? How can I set the correct locale for Matlab and/or
> Java? I assume we have de_CH.something here...
>
> The image I used is here:
>   http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
> 
> Rough instructions to reproduce:
>   aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
>   javaaddpath('.../bioformats_package.jar');
>   vTiffParser = loci.formats.tiff.TiffParser(aFileName);
>   vString = vTiffParser.getComment();
>   uint32(vString.charAt(93))
>   # this returns 65533 for me, not (dec) 150 / (hex) 96

I see the same behavior, which is more or less what I would expect in
this case.  The problematic byte is indeed 0x96, which is not a valid
character by itself.

The TIFF specification indicates that image descriptions should always
be valid ASCII strings.  Bio-Formats is slightly more lenient in that it
always uses UTF-8 encoding for the image description - this is
hard-coded, so changing locales will have no effect.

UTF-8 encoding in Java causes invalid bytes to be replaced with 0xfffd,
which is the standard Unicode replacement character.  It is expected
that converting bytes to a Java String will be lossy if invalid
bytes were present.  Matlab performs invalid byte replacement by escaping
with 0xff, which is also valid but is just a different implementation
choice from Java.

Encoding the image description as ISO-8859-1 instead of UTF-8 would preserve
the original byte value in this case, but I would be a little hesitant to do
this across the board in Bio-Formats.

As it stands, if you need the raw byte values for the image description,
the size and offset to the byte array can be read something like this:

%%
tiffParser = loci.formats.tiff.TiffParser(fileName);
tiffParser.setDoCaching(false);
ifd = tiffParser.getFirstIFD();
ifdEntry = ifd.get(loci.formats.tiff.IFD.IMAGE_DESCRIPTION);
byteArrayOffset = ifdEntry.getValueOffset();
byteArraySize = ifdEntry.getValueCount();
%%

and then the bytes can be read and encoded as desired.

If that isn't sufficient, I'd be open to other ideas - it's just a
little tricky, as this byte is fundamentally not valid with respect to
the TIFF specification.

Regards,
-Melissa

On Wed, Jan 28, 2015 at 11:25:24AM +0100, Mario Emmenlauer wrote:
> 
> Dear Bio-Formats developers,
> 
> I have a slightly challenging question, at least it is so for me :-)
> We have images where users can write a description into a free-text
> field. Sometimes, users somehow manage to create special characters
> in such fields, probably by copy-pasting from some word-document
> into MetaXpress.
> 
> In the following image, you can find such a case. The character that
> the user seems to have entered should be decimal 150, hex 96. At least
> this is what I get from a hex editor when checking the file. When I
> inspect the image meta data with Matlab, the character is reported
> as "0xFF96", which seems at least to make it possible to recover the
> original value by using only the lower byte.
> 
> With Bio-Formats, I did not manage to recover the original value.
> Instead, the character is reported as decimal 65533, which I learned
> to be an UTF-special for "character could not be parsed"(?) I tried
> all sorts of setting a correct locale, so far without success. Can
> you please advise: is it even supported to handle such cases? Should
> I be able to read the characters identical to the bytes in the file
> with Bio-Formats? How can I set the correct locale for Matlab and/or
> Java? I assume we have de_CH.something here...
> 
> The image I used is here:
>   http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
> 
> Rough instructions to reproduce:
>   aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
>   javaaddpath('.../bioformats_package.jar');
>   vTiffParser = loci.formats.tiff.TiffParser(aFileName);
>   vString = vTiffParser.getComment();
>   uint32(vString.charAt(93))
>   # this returns 65533 for me, not (dec) 150 / (hex) 96
> 
> 
> All the best, and thanks for your great work,
> 
>      Mario Emmenlauer
> 
> 
> 
> 
> 
> 
> -- 
> Mario Emmenlauer BioDataAnalysis             Mobil: +49-(0)151-68108489
> Balanstrasse 43                    mailto: mario.emmenlauer * unibas.ch
> D-81669 München                          http://www.marioemmenlauer.de/
> 
> _______________________________________________
> ome-users mailing list
> ome-users at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-users