[ome-devel] on omero, pytables and genotypes

Wed Feb 2 16:21:04 GMT 2011

Hey Gianluigi,

On Feb 1, 2011, at 9:28 AM, Gianluigi Zanetti wrote:
> We are currently trying to move our genotype data description from a
> one-file per genotype to pytables, since it appears to be an efficient
> way to handle simple by-column and by-row operations.
> 
> Now, what we are trying to understand is how this can best fit inside
> Omero. Are you supporting an array omero.grid Column type in omero.grid.table? It should be
> able to handle medium size (i.e., up to 8M elements float) arrays.

The API defined in omero.grid.Table:

  http://hudson.openmicroscopy.org.uk/job/OMERO/javadoc/slice2html/omero/grid/Table.html#Table

is largely based on PyTables as you can tell. To handle the remote transmission of the NumPy arrays, a list of omero.grid.Column:

  http://hudson.openmicroscopy.org.uk/job/OMERO/javadoc/slice2html/omero/grid/Column.html#Column

subclasses are returned. Having the return value be column-based prevents us from having to define a new type for each set of rows, as I think you were doing with ProtocolBuffers, but has limitations in the complexity of what rows can be expressed. (see below)

I don't think 8M elements is much of an issue in general, but you will have to sensibly define your loading strategy, perhaps using getWhereList.

> Attached is a python file with a somewhat more detailed discussion of
> what we would like to do.

Thanks for the example. At the moment, you wouldn't be able to store the probs and the confidence columns as defined in genotable.py inside of OMERO.tables, since there are no higher order columns, though an argument can certainly be made for adding them.

Since N >> n_recs, a rotated table is one option for using OMERO right away, with n_recs*N rows with vid & act as denormalized values repeated in each column, but I didn't go through the exercise of seeing what that does to your computations. Just FYI:

/tmp $ cat rot.py 
#!/usr/bin/env python
import numpy as np
import tables as tb
import time

N=200 # Just a reasonable test size.
class GDOAffy6_0(tb.IsDescription):
  vid          = tb.StringCol(itemsize=16)
  act          = tb.StringCol(itemsize=16)
  prob0        = tb.Float32Col()
  prob1        = tb.Float32Col()
  confidence   = tb.Float32Col()

def create_gdos(fname, n_recs=2, p_A=0.3):
  fh = tb.openFile(fname, mode="w")
  root = fh.root
  for gn in ("GDOs",):
    g = fh.createGroup(root, gn)
  table = fh.createTable('/GDOs', 'affy6_0',
                         GDOAffy6_0,
                         "GDOs Affy6_0")
  sample = table.row
  probs = np.zeros((2,N), dtype=np.float32)
  for i in range(n_recs):
      probs[0,:] = np.random.normal(p_A**2, 0.01*p_A**2, N)
      probs[1,:] = np.random.normal((1.-p_A)**2, 0.01*(1.0-p_A)**2, N)
      for n in xrange(N):
        sample['vid'] = 'V9482948923%05d' % i
        sample['act'] = 'V9482948923%05d' % i
        sample['confidence'] = np.random.random(N)[n]
        sample['prob0'] = probs[0][n]
        sample['prob1'] = probs[1][n]
        sample.append()
  table.flush()
  fh.close()

if __name__ == "__main__":
    create_gdos("rot.h5")

/tmp $ ptdump-2.6 -v rot.h5 
/ (RootGroup) ''
/GDOs (Group) ''
/GDOs/affy6_0 (Table(400,)) 'GDOs Affy6_0'
  description := {
  "act": StringCol(itemsize=16, shape=(), dflt='', pos=0),
  "confidence": Float32Col(shape=(), dflt=0.0, pos=1),
  "prob0": Float32Col(shape=(), dflt=0.0, pos=2),
  "prob1": Float32Col(shape=(), dflt=0.0, pos=3),
  "vid": StringCol(itemsize=16, shape=(), dflt='', pos=4)}
  byteorder := 'little'
  chunkshape := (1489,)

/tmp $ ls -ltrah *.h5
-rw-r--r--  1 moore  wheel   131K Feb  2 16:47 geno.h5
-rw-r--r--  1 moore  wheel    70K Feb  2 17:15 rot.h5

> An ugly alternative, more or less what we are already doing, is to have
> Omero keep track of DataObject(s) (and how they have been obtained) and
> then to have another layer of software that knows how to resolve an
> hdf5/pytable url to a record.
> --gianluigi

Yeah, it would be nice if we could have all our middleware in one spot. ;)

Cheers,
~Josh