[ome-users] Interpretation of special characters
Melissa Linkert
melissa at glencoesoftware.com
Thu Jan 29 00:29:57 GMT 2015
Hi Mario,
> I have a slightly challenging question, at least it is so for me :-)
> We have images where users can write a description into a free-text
> field. Sometimes, users somehow manage to create special characters
> in such fields, probably by copy-pasting from some word-document
> into MetaXpress.
>
> In the following image, you can find such a case. The character that
> the user seems to have entered should be decimal 150, hex 96. At least
> this is what I get from a hex editor when checking the file. When I
> inspect the image meta data with Matlab, the character is reported
> as "0xFF96", which seems at least to make it possible to recover the
> original value by using only the lower byte.
>
> With Bio-Formats, I did not manage to recover the original value.
> Instead, the character is reported as decimal 65533, which I learned
> to be an UTF-special for "character could not be parsed"(?) I tried
> all sorts of setting a correct locale, so far without success. Can
> you please advise: is it even supported to handle such cases? Should
> I be able to read the characters identical to the bytes in the file
> with Bio-Formats? How can I set the correct locale for Matlab and/or
> Java? I assume we have de_CH.something here...
>
> The image I used is here:
> http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
>
> Rough instructions to reproduce:
> aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
> javaaddpath('.../bioformats_package.jar');
> vTiffParser = loci.formats.tiff.TiffParser(aFileName);
> vString = vTiffParser.getComment();
> uint32(vString.charAt(93))
> # this returns 65533 for me, not (dec) 150 / (hex) 96
I see the same behavior, which is more or less what I would expect in
this case. The problematic byte is indeed 0x96, which is not a valid
character by itself.
The TIFF specification indicates that image descriptions should always
be valid ASCII strings. Bio-Formats is slightly more lenient in that it
always uses UTF-8 encoding for the image description - this is
hard-coded, so changing locales will have no effect.
UTF-8 encoding in Java causes invalid bytes to be replaced with 0xfffd,
which is the standard Unicode replacement character. It is expected
that converting bytes to a Java String will be lossy if invalid
bytes were present. Matlab performs invalid byte replacement by escaping
with 0xff, which is also valid but is just a different implementation
choice from Java.
Encoding the image description as ISO-8859-1 instead of UTF-8 would preserve
the original byte value in this case, but I would be a little hesitant to do
this across the board in Bio-Formats.
As it stands, if you need the raw byte values for the image description,
the size and offset to the byte array can be read something like this:
%%
tiffParser = loci.formats.tiff.TiffParser(fileName);
tiffParser.setDoCaching(false);
ifd = tiffParser.getFirstIFD();
ifdEntry = ifd.get(loci.formats.tiff.IFD.IMAGE_DESCRIPTION);
byteArrayOffset = ifdEntry.getValueOffset();
byteArraySize = ifdEntry.getValueCount();
%%
and then the bytes can be read and encoded as desired.
If that isn't sufficient, I'd be open to other ideas - it's just a
little tricky, as this byte is fundamentally not valid with respect to
the TIFF specification.
Regards,
-Melissa
On Wed, Jan 28, 2015 at 11:25:24AM +0100, Mario Emmenlauer wrote:
>
> Dear Bio-Formats developers,
>
> I have a slightly challenging question, at least it is so for me :-)
> We have images where users can write a description into a free-text
> field. Sometimes, users somehow manage to create special characters
> in such fields, probably by copy-pasting from some word-document
> into MetaXpress.
>
> In the following image, you can find such a case. The character that
> the user seems to have entered should be decimal 150, hex 96. At least
> this is what I get from a hex editor when checking the file. When I
> inspect the image meta data with Matlab, the character is reported
> as "0xFF96", which seems at least to make it possible to recover the
> original value by using only the lower byte.
>
> With Bio-Formats, I did not manage to recover the original value.
> Instead, the character is reported as decimal 65533, which I learned
> to be an UTF-special for "character could not be parsed"(?) I tried
> all sorts of setting a correct locale, so far without success. Can
> you please advise: is it even supported to handle such cases? Should
> I be able to read the characters identical to the bytes in the file
> with Bio-Formats? How can I set the correct locale for Matlab and/or
> Java? I assume we have de_CH.something here...
>
> The image I used is here:
> http://data.marssoft.de/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif
>
> Rough instructions to reproduce:
> aFileName = '/tmp/b0bCDac-T130_wB09_s25_z0_t1_cRFP_u001.tif';
> javaaddpath('.../bioformats_package.jar');
> vTiffParser = loci.formats.tiff.TiffParser(aFileName);
> vString = vTiffParser.getComment();
> uint32(vString.charAt(93))
> # this returns 65533 for me, not (dec) 150 / (hex) 96
>
>
> All the best, and thanks for your great work,
>
> Mario Emmenlauer
>
>
>
>
>
>
> --
> Mario Emmenlauer BioDataAnalysis Mobil: +49-(0)151-68108489
> Balanstrasse 43 mailto: mario.emmenlauer * unibas.ch
> D-81669 München http://www.marioemmenlauer.de/
>
> _______________________________________________
> ome-users mailing list
> ome-users at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-users
More information about the ome-users
mailing list