[ome-devel] on omero, pytables and genotypes
Josh Moore
josh at glencoesoftware.com
Wed Feb 2 16:21:04 GMT 2011
Hey Gianluigi,
On Feb 1, 2011, at 9:28 AM, Gianluigi Zanetti wrote:
> We are currently trying to move our genotype data description from a
> one-file per genotype to pytables, since it appears to be an efficient
> way to handle simple by-column and by-row operations.
>
> Now, what we are trying to understand is how this can best fit inside
> Omero. Are you supporting an array omero.grid Column type in omero.grid.table? It should be
> able to handle medium size (i.e., up to 8M elements float) arrays.
The API defined in omero.grid.Table:
http://hudson.openmicroscopy.org.uk/job/OMERO/javadoc/slice2html/omero/grid/Table.html#Table
is largely based on PyTables as you can tell. To handle the remote transmission of the NumPy arrays, a list of omero.grid.Column:
http://hudson.openmicroscopy.org.uk/job/OMERO/javadoc/slice2html/omero/grid/Column.html#Column
subclasses are returned. Having the return value be column-based prevents us from having to define a new type for each set of rows, as I think you were doing with ProtocolBuffers, but has limitations in the complexity of what rows can be expressed. (see below)
I don't think 8M elements is much of an issue in general, but you will have to sensibly define your loading strategy, perhaps using getWhereList.
> Attached is a python file with a somewhat more detailed discussion of
> what we would like to do.
Thanks for the example. At the moment, you wouldn't be able to store the probs and the confidence columns as defined in genotable.py inside of OMERO.tables, since there are no higher order columns, though an argument can certainly be made for adding them.
Since N >> n_recs, a rotated table is one option for using OMERO right away, with n_recs*N rows with vid & act as denormalized values repeated in each column, but I didn't go through the exercise of seeing what that does to your computations. Just FYI:
/tmp $ cat rot.py
#!/usr/bin/env python
import numpy as np
import tables as tb
import time
N=200 # Just a reasonable test size.
class GDOAffy6_0(tb.IsDescription):
vid = tb.StringCol(itemsize=16)
act = tb.StringCol(itemsize=16)
prob0 = tb.Float32Col()
prob1 = tb.Float32Col()
confidence = tb.Float32Col()
def create_gdos(fname, n_recs=2, p_A=0.3):
fh = tb.openFile(fname, mode="w")
root = fh.root
for gn in ("GDOs",):
g = fh.createGroup(root, gn)
table = fh.createTable('/GDOs', 'affy6_0',
GDOAffy6_0,
"GDOs Affy6_0")
sample = table.row
probs = np.zeros((2,N), dtype=np.float32)
for i in range(n_recs):
probs[0,:] = np.random.normal(p_A**2, 0.01*p_A**2, N)
probs[1,:] = np.random.normal((1.-p_A)**2, 0.01*(1.0-p_A)**2, N)
for n in xrange(N):
sample['vid'] = 'V9482948923%05d' % i
sample['act'] = 'V9482948923%05d' % i
sample['confidence'] = np.random.random(N)[n]
sample['prob0'] = probs[0][n]
sample['prob1'] = probs[1][n]
sample.append()
table.flush()
fh.close()
if __name__ == "__main__":
create_gdos("rot.h5")
/tmp $ ptdump-2.6 -v rot.h5
/ (RootGroup) ''
/GDOs (Group) ''
/GDOs/affy6_0 (Table(400,)) 'GDOs Affy6_0'
description := {
"act": StringCol(itemsize=16, shape=(), dflt='', pos=0),
"confidence": Float32Col(shape=(), dflt=0.0, pos=1),
"prob0": Float32Col(shape=(), dflt=0.0, pos=2),
"prob1": Float32Col(shape=(), dflt=0.0, pos=3),
"vid": StringCol(itemsize=16, shape=(), dflt='', pos=4)}
byteorder := 'little'
chunkshape := (1489,)
/tmp $ ls -ltrah *.h5
-rw-r--r-- 1 moore wheel 131K Feb 2 16:47 geno.h5
-rw-r--r-- 1 moore wheel 70K Feb 2 17:15 rot.h5
> An ugly alternative, more or less what we are already doing, is to have
> Omero keep track of DataObject(s) (and how they have been obtained) and
> then to have another layer of software that knows how to resolve an
> hdf5/pytable url to a record.
> --gianluigi
Yeah, it would be nice if we could have all our middleware in one spot. ;)
Cheers,
~Josh
More information about the ome-devel
mailing list