[ome-devel] OMERO.features: Development of a new API for storing image features

Ivan E. Cao-Berg icaoberg at andrew.cmu.edu
Tue Aug 27 19:26:57 BST 2013


hi! everything sounds great. i have a couple of questions and comments.

> On Tue, Aug 27, 2013 at 12:12 PM, Coletta, Christopher (NIH/NIA/IRP) [E]
> <christopher.coletta at nih.gov<mailto:christopher.coletta at nih.gov>> wrote:
>
> Simon and I were talking last week about the OMERO.Features API, and I
> wanted to share with you some insights that fell out of that discussion.
>
> Let us make the following design assumptions:
>
> 0. We continue to use the existing OMERO image organizational structure of
> Project->Dataset->Image->ROI

> 1. Whole images can be considered ROIs for the purpose of calculating
> features.

yes, but it is important that we keep a clear distinction between field
and roi feature sets. that being said,
what does that mean? that once we build a model for a table, class or
select a data structure; is it going
to be used interchangeably between rois and field feature sets? that
means, whatever column or structure that holds the roi ids will be
set to null when it belongs to a field feature set?

> 2. ROI Preprocessing options must be an essential part of the feature
> storage framework, since a single ROI with different preprocessing can
> result in different feature values.

totally agree. then the question i ask is, how will this affect the
community as a whole? i havent used cellprofiler, i have only used
knime. can we guarantee numerical accuracy across any system given a
particular version of the software? (not rethorical, i have no idea).

will knime produce the same feature values in win32 against macosx64?
and if the answer is no, then how much effort will be put into
communicating this idea, if any?
how easy will it be to exchange that information between systems?
do we want to share a feature table/vector or we want to share a process
that can run on their own system?

even though these issues seems unimportant i think they are. if people are
going to be publishing data online along with
their research articles reproducibility of certain calculations, say
feature values, is very important. how can we guarantee people can
reproduce
those results?

> 3. Feature storage backend that allows for fast query (i.e., API call to
> produce a feature matrix given a list of ROIs & preproc opts & feature
> names should be as close to an O(1) operation as possible, preferably on a
> single, large easily-queryable easily-sliceable data structure, as opposed
> to querying multiple files or doing multiple table joins.)

i think it is possible to make a hash table using the "super id" we use in
pyslid as a the hash key. essentially the key is

<image id>.<pixel index>.<channel index>.<zslice index>.<timepoint
index>.<resolution>

we figured this contained enough information to make a unique identifier
for every feature vector in the database.
unfortunately even though searching in a hash table is O(1), vertical
slicing is not. then again, i am not an expert on the area. any feedback
on this?

other things i would like to point out
1) in terms of feature calculation it is essential that we keep track of
the resolution at which features were calculated
2) we should have a clear method that just links features to a database.
some people will not want to recalculate features if they have already
done it. some feature sets are computationally expensive
3) [super low priority but i think it is important] the possibility of
importing/exporting images where the features are attached to the metadata
(i slightly mentioned this to melissa in paris and she told me it "should"
be possible, at least theoretically).

Iván E. Cao-Berg
Senior Research Programmer
Ray and Stephanie Lane Center for Computational Biology
School of Computer Science
Carnegie Mellon University

>
> Here are some ideas we discussed:
>
> Each node in the image organizational structure will be a valid target to
> have features associated to it. The feature-containing data structure
> associated to each node should be hierarchical in nature, reflecting the
> structure of its children nodes and containing some or all of its
> children's features in a useful hierarchical table format. PyTables or
> something like it seems well-suited for this task. As Vebjorn Ljosa
> mentioned previously in this thread, the ability to efficiently make
> vertical and horizontal feature slices across the dataset is critical. And
> as Simon Li said previously, redundantly storing features across nodes
> will help to speed up these operations.
>
> Images can belong to multiple projects. It's possible to have multiple
> pre-sliced feature stores which are associated to projects/datasets that
> have been optimized for certain classification problems, i.e., contain a
> specific subset of features.
>
> Newly calculated features will always be saved to the ROI object store,
> and then upper level organizational nodes can have an Update call to
> integrate any feature updates into its own feature store. Stored feature
> vectors (or matrices in the case of projects/datasets) must be clearly
> labeled as to what feature set and preprocessing options were used to
> generate them, perhaps using an associative array where the feature
> vector/matrix is the "value" and the "key" is a machine parsable human
> readable string such as "WND-CHARM large feature set version 2.0,
> Haemotoxylin channel deconvolved from H&E stained brightfield RGB image,
> deconvolution params( X=foo, Y=bar, Z=baz )". Or maybe a lookup table
> where all known feature sets/preproc options are assigned an index, which
> would facilitate the packing of the features into a 3D matrix where the z
> index corresponds to the preproc index. Whatever we can do to make the
> query faster.
>
> Some possible feature retrieval use-cases:
> 1a. Return a matrix of features for all ROIS for a given image, one row
> per ROI.
> 1b. Return a matrix of features for all ROIS for all images in a dataset,
> one row per image, features for individual ROIs concatenated into a single
> row.
> 2a. Return a matrix of features for all images in this dataset, where the
> features for each image come from the ROI located at X,Y[,Z]
> 2b. Return a matrix of features for all images in this dataset, where the
> features for each image come from the ROI labeled number 37 out of 96 for
> that image.
>
> Thanks for reading,
> Chris
>
> On Aug 1, 2013, at 1:42 PM, Simon Li wrote:
>
> Hi all
>
> Thanks for your great comments! Based on everyone's contributions so far I
> think there's scope for two, or maybe even three, APIs...
>
>
> 1. Feature storage
>
> It sounds like if we described an ROI in a suitable manner this would
> encompass many of the requirements brought up by Lee and Chris. For
> example, a whole-image feature is just a ROI covering the whole image,
> features calculated from multiple ROIs could be linked to a parent ROI
> consisting of the union of the smaller ROIs, including over time.
>
> As you probably know the OME along with many other groups is involved in
> creating a cross-platform ROI specification [1] which could solve some of
> the difficulties in linking a feature to the original data, perhaps this
> is a good use case for that work?
>
> In terms of the implementation this lends itself to a tabular format such
> as HDF5, though the optimal layout needs investigation. For instance
> OMERO.tables (which uses HDF5 as the backend) supports array-columns,
> where each column in the table is effectively a feature-set, and each row
> therefore consists of multiple feature-sets. Alternatively if the
> feature-set is almost always the same then splitting the feature-set up
> like this could reduce performance.
>
>
> 2. Feature calculation
>
> My understanding of PySLID is that although it includes only a small
> number of algorithms for single or dual channel images it's designed to
> support a much wider range of feature calculation algorithms. Hopefully
> Ivan from CMU can elaborate.
>
> Coming up with a way to record a whole workflow to support reproducible
> research is going to be a very big undertaking, though a standardised ROI
> and feature storage specification will be a big step. Is there any way, at
> least in the short term, we could take advantage of say some of the
> components in CellProfiler or KNIME to record our analysis pipeline?
>
> For large screens feature calculation will obviously have to be done
> offline, perhaps with a aid of a cluster. However in past meetings Chris
> Coletta has suggested a role for near real-time feature calculation, for
> instance as part of online classification of new images based on a
> previously trained classifier. If you're dealing with small datasets then
> it'd be nice if we could return results to people in a reasonable time.
>
>
> 3. Feature retrieval
>
> Chris' suggestion of requesting features in the form of ([feature names],
> [object ids]) looks sensible. Vebjorn brings up a very good point about
> retrieving both row and column slices efficiently. When I spoke with Lee a
> few months ago in Dundee he said features sets are often frozen after
> calculation, so one option would be to have a post-feature-calculation
> task to optimise the storage format, if necessary duplicating the data so
> both row an column operations are fast. Effectively we'd have multiple
> implementations of the same API, transparent to the client.
>
>
> In the interests of getting things moving I think concentrating on feature
> storage/retrieval first might be better than trying to do everything at
> once. This would at least allow inter-operability of algorithms from
> different groups. At present PySLID is written in Python and is designed
> to be used directly by the feature calculation algorithm. In contrast
> something like the OMERO.tables service is hidden behind Ice, which
> automatically gives us cross-language support at the expense of increased
> complexity. My inclination is to stick with a Python module for now, but
> with the aim of converting it to an Ice service as soon as we've got a
> working design. However I'm happy to hear other opinions.
>
> Best wishes
>
> Simon
>
> [1] http://www.scijava.org/roi-model/
>
>
> On 22 Jul 2013, at 14:23, Lee Kamentsky
> <leek at broadinstitute.org<mailto:leek at broadinstitute.org><mailto:leek at broadinstitute.org<mailto:leek at broadinstitute.org>>>
> wrote:
>
> I'm forwarding this thread onto our internal imaging platform list - we
> have several researchers here whose job is analyzing the sorts of feature
> sets that would be an output of the spec. If you all could read the whole
> thread and contribute, I think it would be helpful. Otherwise, I think
> Chris represents much of our perspective well, so nothing more to say.
>
> --Lee
>
>
> On Fri, Jul 19, 2013 at 8:02 PM, Coletta, Christopher (NIH/NIA/IRP) [E]
> <christopher.coletta at nih.gov<mailto:christopher.coletta at nih.gov><mailto:christopher.coletta at nih.gov<mailto:christopher.coletta at nih.gov>>>
> wrote:
> Hey Jason et. al, I didn't forget about ya!
>
> By the way, cheers to Simon for soliciting feedback for the API design, to
> Ivan Cao-Berg and the Murphy group for putting forward pyslic/pyslid, and
> to Lee Kamentsky for his CellProfiler insights.
>
> Lets talk about the part of the API that deals specifically with retrieval
> of features from OMERO (the "feature query interface"). Most supervised
> learning classifiers/clustering algorithms entail measuring distances
> between images in high-dimensional feature space. Ideally there'd be an
> OMERO.features API call that would construct the feature space, i.e.,
> return a 2D matrix where the rows are the points in space corresponding to
> image or ROI, and the columns are the individual features which are the
> dimensions in feature space.
>
> At first it seems pretty trivial to construct the feature matrix. Just
> specify a list of ids for images or ROIs, get back their corresponding
> feature vectors, stack them into a matrix, and you're done, right?
>
> Not quite. First, there are the many ways an image/ROI can be preprocessed
> as Lee K. mentioned, including highlighting structures of interest,
> segmentation, cropped out into ROIs, transformation, rotation,
> normalization, etc. After all the preprocessing, you end up with what I
> call the "sample," which simply means the pixels/voxels which are the
> substrate upon which the feature algorithms operate. There are countless
> ways to sample an image, and each sample will have its own corresponding
> feature vector. You may want to build your feature space by mixing and
> matching between these feature vectors. The simplest example is with
> multi-channel images: running WND-CHARM on an RGB image can result in the
> entire battery of feature algorithms being run on all three channels, and
> three separate feature vectors are generated. And it's perfectly
> acceptable to construct a single classifier feature space using features
> from all three channels. It may be useful to keep the CellProfiler/Pys
>  lic/WND-CHARM feature sets in their own containers, but still provide the
> option to mix and match, or add new batteries of algorithms like Lee K.
> mentioned. The features used and sampling can vary by image modality and
> can vary from experiment to experiment.
>
> The hierarchical nature of feature data is what made Simon choose HDF5
> files stored on a per-image basis as the back-end in the first generation
> of OMERO-WND-CHARM. But for datasets that consist of thousands of
> images/ROIs, this solution might not scale well, which is why Simon was
> interested in a NoSQL database for feature storage where the schema about
> what can be stored is not strict.
>
> A generalized API to construct this 2D feature matrix should allow for
> three types of user inputs:
> 1. Image/ROI ids indicating what will be represented in feature space
> (i.e., rows)
> 2. List of the features or feature families that are relevant (i.e.,
> columns)
> 3. Specification of how to pack the features into the rows, i.e., should
> each image/ROI gets its own row, or should one row contain features from
> multiple images/ROIs/samples.
>
> An example to illustrate #3: We're developing a classifier to diagnose
> bacterial and viral pneumonia. We have chest X-rays images, each of which
> have been segmented into 12 regions based on anatomy. Experimental design
> may dictate that the 12 ROIs may be considered as 12 individual points in
> feature space (more rows, less columns), or they may count as a single
> point for the purpose of classification (more columns, less rows).
>
> A simple feature query API call would be similar to the SQL query  "SELECT
> FeatureA, FeatureB, FeatureC FROM FeatureTable WHERE image = (list of
> ids)". The user would provide a list of image ROI ids, and a list of
> features in the form of some human-readable, machine-parseable sturcture
> which would contain all the information for how that feature was
> calculated, including all preprocessing information, channel information.
> You could say that every individual feature could be uniquely identified
> for a given image/ROI/sample by its own feature "street address". In
> WND-CHRM we accomplish this using strings that have nested parentheses and
> brackets in the form "<Algorithm> ( <Transform> (<Channel> ) ) [ Feature
> Index ]"  The API call might look something like this:
>
> roi_id_list = [ 24, 67, 89, 103 ]
> desired_feature_list = [ \
>      'Zernike Coefficients (Fourier (Wavelet (Red))) [52]',
>      'Gini Coefficient (Wavelet (Fourier (Green))) [0]',
>      'Chebyshev Coefficients (Wavelet (Blue)) [19]' ,
>      'Radon Coefficients (Green) [12]' ]
>
> feature_matrix = GenerateFeatureSpace( roi_id, desired_feature_list )
>
> A 4x3 Numpy matrix with the corresponding features would then be returned
> into feature_matrix. The feature street addresses don't have to be
> strings, they can be some can be some map/dict/class where the components
> of the feature address are more structured. Or each feature street address
> could be composed of a bunch of tags.
>
> It's possible at query time the feature for the given image/ROI/sample
> hasn't even been calculated yet. The API would need to satisfy the request
> either from features stored in the database or calculated on the fly.
> BISQUE has functionality that works like this called the Feature Service.
>
> Sorry for the long email. I'm excited to work with you all to move the
> ball forward on this project!
> Chris C.
>
> On Jul 19, 2013, at 5:35 PM, Jason Swedlow wrote:
>
> Hi All
>
> A quick plea not to drop this thread. Input from Ivan and Chris C. would
> be most welcome. These applications are very important and getting this
> API nailed down-- at least for a first draft-- would be hugely helpful.
>
> Cheers,
>
> Jason
>
>
> Jason Swedlow, PhD, FRSE
> Centre for Gene Regulation & Expression
> Open Microscopy Environment
> University of Dundee
> http://openmicroscopy.org<http://openmicroscopy.org/><http://openmicroscopy.org/><http://openmicroscopy.org/>
>
>
>
>
> Lee Kamentsky
> <leek at broadinstitute.org<mailto:leek at broadinstitute.org><mailto:leek at broadinstitute.org<mailto:leek at broadinstitute.org>><mailto:leek at broadinstitute.org<mailto:leek at broadinstitute.org><mailto:leek at broadinstitute.org<mailto:leek at broadinstitute.org>>>>
> wrote:
>
> Hi all,
> I think it's great that Bob Murphy's group has implemented pyslic and
> pyslid in an open-source framework like OMERO. It looks like a substantial
> body of work. I'm wondering what needs to be done to make it a
> general-purpose framework however, especially looking at it from the
> perspective of our group's experience with CellProfiler. Also, Simon,
> thanks for moving this forward.
>
> My reading of the pyslic code is that it supports a nuclear stain and a
> protein stain and calculates a standard set of per-image and per-object
> features (although I haven't quite figured out the storage mechanism for
> the object features). This is adequate for a large class of experiments
> involving two-color fluorescently-labeled samples and it's likely the
> methods are robust, but our experience has been that experimental
> protocols can be more varied (multiple protein stains, brightfield images)
> and the biological questions can require additional image preprocessing to
> highlight the structures of interest, often requiring tuning parameters
> specific to the structure scale. Because of this, I think that the
> framework needs a modular architecture that supports development of new
> algorithms by computational researchers and configuration by the end users
> and it needs to extend beyond a curated code-base to allow for innovation.
> Personally, I'm really pleased that the framework is i
>  n Python because it aligns well with our group, but perhaps this is
> limiting for the ImageJ community and perhaps some portion of
> CellProfiler's bridge between Python and ImageJ could be adapted to
> supply the connection.
>
> I think that we do need a platform for innovation and the keys to that are
> interoperability, standards, and a model of the analysis that is flexible
> enough to describe our community's experiments and that captures the
> analysis protocol in a reproducible manner. I'm going to outline my
> perspective on the model here, drawing on our group's experience with
> CellProfiler, and try to keep it brief. I see the components of the model
> being:
>
> * Fields of view - N dimensional spaces (X, Y, T, Z, spectral)
> representing an imaging site
> * Images - acquired image data on a field of view (with acquisition
> metadata) or similar produced by algorithms such as filters or
> morphological operations.
> * Segmentations - defining multiple regions of interest on the fields of
> view or on (hyper)planes of the fields of view
> * Relationships between segmented regions - links between segmented
> regions either within segmentations or across them. Examples might be
> time-lapse cell tracking, associations between nuclear and cellular
> segmentations or groupings of organelle segmentations within a cell.
> * Measurements - data computed on the images, segmentations and
> relationships within a field of view. My take on this is that a
> measurement produces a numeric feature value per image or per segmentation
> region, but perhaps that's too narrow.
> * Protocol - a description of how to perform the analysis. I think the key
> elements are a link to the OMERO screen and a list of the parameterized
> algorithms to be performed. The screen provides image inputs to the
> algorithms which are the available image acquisition channels and the
> algorithms themselves provide images, segmentations, relationships and
> measurements which can serve as inputs to other algorithms in the
> protocol. Algorithms will often be parameterizable by the user and these
> parameters should be captured by the protocol. Ideally, the protocol
> should capture the versions of the algorithms using a mechanism such as a
> GIT hash. In CellProfiler, we have algorithms that produce an aggregated
> image based on samples from many fields of view, for instance an estimate
> of differences in signal magnitude across the field of view caused by
> non-uniform illumination - algorithms might have stacks of images as
> inputs and these stacks might span individual fields of view.
>
> As far as the actual mechanics, I see OMERO or similar using the protocol
> as a dependency graph, fetching the algorithms using some
> community-standard mechanism (maven? pip?), providing inputs as specified
> by the protocol and harvesting the outputs for the database and for
> dependent algorithms. I have some detailed concerns about algorithm
> input/output introspection and discovery, but ImageJ 2.0's plugin
> introspection protocol (@parameter) is a good starting point (thanks
> ImageJ 2.0).
>
> OK - somewhat CellProfiler-centric perhaps, but the nice thing about OMERO
> is that it is a relational database and the protocol is the thing itself -
> not a description of the experiment, but a mineable map of how each number
> is produced especially if the protocol pieces are described relationally
> in the database. I think the above is an ambitious undertaking, but look
> at the result! Researchers can trade protocols which produce robust and
> comparable values (not just "nuclear area", but the nuclear area after
> illumination correction and segmentation using Otsu thresholding and a
> seeded watershed of HeLa cells stained with DAPI). Developers can publish
> their method in OMERO and possibly OMERO itself can generate citations
> based on a protocol, leading to better accreditation of our work. And
> OMERO itself becomes a sustainable platform for analysis with a
> well-defined interoperable API for image processing.
>
> Hope this all gives things a positive lift, thx for reading this far,
> --Lee
>
> On Fri, Jul 5, 2013 at 10:03 AM, Simon Li
> <s.p.li at dundee.ac.uk<mailto:s.p.li at dundee.ac.uk><mailto:s.p.li at dundee.ac.uk<mailto:s.p.li at dundee.ac.uk>><mailto:s.p.li at dundee.ac.uk<mailto:s.p.li at dundee.ac.uk><mailto:s.p.li at dundee.ac.uk<mailto:s.p.li at dundee.ac.uk>>>>
> wrote:
> Hi everyone
>
> It was great to see so many people interested in OMERO.searcher and
> WND-CHRM at the Paris meeting, both those who were interested in
> installing it on their own systems and also those of you who were
> interested in developing other analysis algorithms for use with OMERO.
>
> One of the main points that came up was that OMERO should provide a single
> API for storing and calculating image features. Robert Murphy's group at
> CMU have already developed PySLID [http://github.com/icaoberg/pyslid], a
> python module for calculating and storing features used with
> OMERO.searcher, so I'd like to propose we bring this into the
> openmicroscopy GitHub organisation, and rename it to OMERO.features (other
> suggestions are welcome).
> Then there's the much bigger task of modifying the module to cater for
> everyone's requirements. I can see several potential issues, including how
> we handle multiple channels, z-slices, timepoints, ROIs, etc since
> features can be calculated for these individually or as a whole.
>
> If anyone has any thoughts or comments on what they'd like to see it'd be
> great if you could share them with the rest of this list, or if you prefer
> on our forums.
>
> Best wishes
>
> Simon
>
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>>>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>>>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
> Christopher Coletta
> Computer Scientist
> Image Informatics and Computational Biology Unit
> Laboratory of Genetics
> National Institute on Aging
> Biomedical Research Center
> 251 Bayview Boulevard, Room 10B125
> Baltimore, MD 21224
> Desk: 410-558-8170<tel:410-558-8170><tel:410-558-8170<tel:410-558-8170>>
> Cell: 617-943-9745<tel:617-943-9745><tel:617-943-9745<tel:617-943-9745>>
>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk><mailto:ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
> Christopher Coletta
> Computer Scientist
> Image Informatics and Computational Biology Unit
> Laboratory of Genetics
> National Institute on Aging
> Biomedical Research Center
> 251 Bayview Boulevard, Room 10B125
> Baltimore, MD 21224
> Desk: 410-558-8170<tel:410-558-8170>
> Cell: 617-943-9745<tel:617-943-9745>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:ome-devel at lists.openmicroscopy.org.uk>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
>
> Christopher Coletta
> Computer Scientist
> Image Informatics and Computational Biology Unit
> Laboratory of Genetics
> National Institute on Aging
> Biomedical Research Center
> 251 Bayview Boulevard, Room 10B125
> Baltimore, MD 21224
> Desk: 410-558-8170
> Cell: 617-943-9745
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>



More information about the ome-devel mailing list