[ome-users] OME-TIFF: problem with the "micron" character (micrometer unit)

Wed Sep 6 13:57:10 BST 2017

On 01/09/17 20:30, Christoph Gohlke wrote:
> one issue is that the tiffcomment utility outputs XML that is not well
> formed. OME-XML should be UTF-8 encoded, but tiffcomment apparently
> encodes with latin1, iso-8859-1, or similar (Bioformats 5.6.0 on Windows
> 10).
> Try re-encoding the XML file (e.g. in Python3 Q&D):
>
> xml = open('comment.xml', 'rb').read()
> xml = xml.decode('iso-8859-1').encode('utf8')
> open('comment.xml', 'wb').write(xml)
>
> Another issue could be that the XML in the ome.tiff file is not encoded
> correctly. Open the ome.tiff file with a HEX editor. The lower case Mu
> letter should be stored in two bytes (C2 B5), not just one byte (B5).

The problem lies with the behaviour of Java on Windows.

tiffcomment uses System.out.println() to print the comment to standard
output, and this uses the default encoding.  On Windows, this is likely
to be an old 8-bit codepage such as CP1252, which will result in the
output being recoded from UTF-8 to whatever codepage is in use.  Please
see
https://stackoverflow.com/questions/24803733/default-character-encoding-for-java-console-output
for further details.

You could try to force the use of UTF-8 by making this change to the
bf.bat script which is part of bftools:

diff --git a/tools/bf.bat b/tools/bf.bat
index 0c56b79388..6f3146e956 100644
--- a/tools/bf.bat
+++ b/tools/bf.bat
@@ -22,6 +22,14 @@ if "%BF_MAX_MEM%" == "" (
  )
  set BF_FLAGS=%BF_FLAGS% -Xmx%BF_MAX_MEM%

+rem Set the file encoding
+if "%BF_ENCODING%" == "" (
+  rem Set UTF-8 by default
+  set BF_ENCODING=UTF-8
+)
+set "BF_FLAGS=%BF_FLAGS% -Dfile.encoding=%BF_ENCODING%"
+
+
  rem Skip the update check if the NO_UPDATE_CHECK flag is set.
  if not "%NO_UPDATE_CHECK%" == "" (
    set BF_FLAGS=%BF_FLAGS% -Dbioformats_can_do_upgrade_check=false

It's not something which we can enable by default, because this is not a
setting which is supposed to be used publicly, but it may help in this case.

An alternative solution would be to use a Unix platform such as Linux,
FreeBSD or MacOS X with a UTF-8 locale, where the output will always be
correctly encoded as UTF-8.

As a better long term solution, we could reopen System.out to use a
UTF-8 encoding, or to use raw bytes and transfer everything verbatim.


Kind regards,
Roger

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK   Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096