[ome-devel] OMERO.features: Development of a new API for storing image features

Lee Kamentsky leek at broadinstitute.org
Tue Aug 27 17:43:06 BST 2013


This looks great, thanks. I was wondering if you could explain the ROI
preprocessing options and I'm hoping that there is some way to relate a
feature to the mechanics of how it was created (in CellProfiler's case, the
software version pipeline file and OMERO dataset).


On Tue, Aug 27, 2013 at 12:12 PM, Coletta, Christopher (NIH/NIA/IRP) [E] <
christopher.coletta at nih.gov> wrote:

>
> Simon and I were talking last week about the OMERO.Features API, and I
> wanted to share with you some insights that fell out of that discussion.
>
> Let us make the following design assumptions:
>
> 0. We continue to use the existing OMERO image organizational structure of
> Project->Dataset->Image->ROI
> 1. Whole images can be considered ROIs for the purpose of calculating
> features.
> 2. ROI Preprocessing options must be an essential part of the feature
> storage framework, since a single ROI with different preprocessing can
> result in different feature values.
> 3. Feature storage backend that allows for fast query (i.e., API call to
> produce a feature matrix given a list of ROIs & preproc opts & feature
> names should be as close to an O(1) operation as possible, preferably on a
> single, large easily-queryable easily-sliceable data structure, as opposed
> to querying multiple files or doing multiple table joins.)
>
> Here are some ideas we discussed:
>
> Each node in the image organizational structure will be a valid target to
> have features associated to it. The feature-containing data structure
> associated to each node should be hierarchical in nature, reflecting the
> structure of its children nodes and containing some or all of its
> children's features in a useful hierarchical table format. PyTables or
> something like it seems well-suited for this task. As Vebjorn Ljosa
> mentioned previously in this thread, the ability to efficiently make
> vertical and horizontal feature slices across the dataset is critical. And
> as Simon Li said previously, redundantly storing features across nodes will
> help to speed up these operations.
>
> Images can belong to multiple projects. It's possible to have multiple
> pre-sliced feature stores which are associated to projects/datasets that
> have been optimized for certain classification problems, i.e., contain a
> specific subset of features.
>
> Newly calculated features will always be saved to the ROI object store,
> and then upper level organizational nodes can have an Update call to
> integrate any feature updates into its own feature store. Stored feature
> vectors (or matrices in the case of projects/datasets) must be clearly
> labeled as to what feature set and preprocessing options were used to
> generate them, perhaps using an associative array where the feature
> vector/matrix is the "value" and the "key" is a machine parsable human
> readable string such as "WND-CHARM large feature set version 2.0,
> Haemotoxylin channel deconvolved from H&E stained brightfield RGB image,
> deconvolution params( X=foo, Y=bar, Z=baz )". Or maybe a lookup table where
> all known feature sets/preproc options are assigned an index, which would
> facilitate the packing of the features into a 3D matrix where the z index
> corresponds to the preproc index. Whatever we can do to make the query
> faster.
>
> Some possible feature retrieval use-cases:
> 1a. Return a matrix of features for all ROIS for a given image, one row
> per ROI.
> 1b. Return a matrix of features for all ROIS for all images in a dataset,
> one row per image, features for individual ROIs concatenated into a single
> row.
> 2a. Return a matrix of features for all images in this dataset, where the
> features for each image come from the ROI located at X,Y[,Z]
> 2b. Return a matrix of features for all images in this dataset, where the
> features for each image come from the ROI labeled number 37 out of 96 for
> that image.
>
> Thanks for reading,
> Chris
>
> On Aug 1, 2013, at 1:42 PM, Simon Li wrote:
>
> Hi all
>
> Thanks for your great comments! Based on everyone's contributions so far I
> think there's scope for two, or maybe even three, APIs...
>
>
> 1. Feature storage
>
> It sounds like if we described an ROI in a suitable manner this would
> encompass many of the requirements brought up by Lee and Chris. For
> example, a whole-image feature is just a ROI covering the whole image,
> features calculated from multiple ROIs could be linked to a parent ROI
> consisting of the union of the smaller ROIs, including over time.
>
> As you probably know the OME along with many other groups is involved in
> creating a cross-platform ROI specification [1] which could solve some of
> the difficulties in linking a feature to the original data, perhaps this is
> a good use case for that work?
>
> In terms of the implementation this lends itself to a tabular format such
> as HDF5, though the optimal layout needs investigation. For instance
> OMERO.tables (which uses HDF5 as the backend) supports array-columns, where
> each column in the table is effectively a feature-set, and each row
> therefore consists of multiple feature-sets. Alternatively if the
> feature-set is almost always the same then splitting the feature-set up
> like this could reduce performance.
>
>
> 2. Feature calculation
>
> My understanding of PySLID is that although it includes only a small
> number of algorithms for single or dual channel images it's designed to
> support a much wider range of feature calculation algorithms. Hopefully
> Ivan from CMU can elaborate.
>
> Coming up with a way to record a whole workflow to support reproducible
> research is going to be a very big undertaking, though a standardised ROI
> and feature storage specification will be a big step. Is there any way, at
> least in the short term, we could take advantage of say some of the
> components in CellProfiler or KNIME to record our analysis pipeline?
>
> For large screens feature calculation will obviously have to be done
> offline, perhaps with a aid of a cluster. However in past meetings Chris
> Coletta has suggested a role for near real-time feature calculation, for
> instance as part of online classification of new images based on a
> previously trained classifier. If you're dealing with small datasets then
> it'd be nice if we could return results to people in a reasonable time.
>
>
> 3. Feature retrieval
>
> Chris' suggestion of requesting features in the form of ([feature names],
> [object ids]) looks sensible. Vebjorn brings up a very good point about
> retrieving both row and column slices efficiently. When I spoke with Lee a
> few months ago in Dundee he said features sets are often frozen after
> calculation, so one option would be to have a post-feature-calculation task
> to optimise the storage format, if necessary duplicating the data so both
> row an column operations are fast. Effectively we'd have multiple
> implementations of the same API, transparent to the client.
>
>
> In the interests of getting things moving I think concentrating on feature
> storage/retrieval first might be better than trying to do everything at
> once. This would at least allow inter-operability of algorithms from
> different groups. At present PySLID is written in Python and is designed to
> be used directly by the feature calculation algorithm. In contrast
> something like the OMERO.tables service is hidden behind Ice, which
> automatically gives us cross-language support at the expense of increased
> complexity. My inclination is to stick with a Python module for now, but
> with the aim of converting it to an Ice service as soon as we've got a
> working design. However I'm happy to hear other opinions.
>
> Best wishes
>
> Simon
>
> [1] http://www.scijava.org/roi-model/
>
>
> On 22 Jul 2013, at 14:23, Lee Kamentsky <leek at broadinstitute.org<mailto:
> leek at broadinstitute.org>> wrote:
>
> I'm forwarding this thread onto our internal imaging platform list - we
> have several researchers here whose job is analyzing the sorts of feature
> sets that would be an output of the spec. If you all could read the whole
> thread and contribute, I think it would be helpful. Otherwise, I think
> Chris represents much of our perspective well, so nothing more to say.
>
> --Lee
>
>
> On Fri, Jul 19, 2013 at 8:02 PM, Coletta, Christopher (NIH/NIA/IRP) [E] <
> christopher.coletta at nih.gov<mailto:christopher.coletta at nih.gov>> wrote:
> Hey Jason et. al, I didn't forget about ya!
>
> By the way, cheers to Simon for soliciting feedback for the API design, to
> Ivan Cao-Berg and the Murphy group for putting forward pyslic/pyslid, and
> to Lee Kamentsky for his CellProfiler insights.
>
> Lets talk about the part of the API that deals specifically with retrieval
> of features from OMERO (the "feature query interface"). Most supervised
> learning classifiers/clustering algorithms entail measuring distances
> between images in high-dimensional feature space. Ideally there'd be an
> OMERO.features API call that would construct the feature space, i.e.,
> return a 2D matrix where the rows are the points in space corresponding to
> image or ROI, and the columns are the individual features which are the
> dimensions in feature space.
>
> At first it seems pretty trivial to construct the feature matrix. Just
> specify a list of ids for images or ROIs, get back their corresponding
> feature vectors, stack them into a matrix, and you're done, right?
>
> Not quite. First, there are the many ways an image/ROI can be preprocessed
> as Lee K. mentioned, including highlighting structures of interest,
> segmentation, cropped out into ROIs, transformation, rotation,
> normalization, etc. After all the preprocessing, you end up with what I
> call the "sample," which simply means the pixels/voxels which are the
> substrate upon which the feature algorithms operate. There are countless
> ways to sample an image, and each sample will have its own corresponding
> feature vector. You may want to build your feature space by mixing and
> matching between these feature vectors. The simplest example is with
> multi-channel images: running WND-CHARM on an RGB image can result in the
> entire battery of feature algorithms being run on all three channels, and
> three separate feature vectors are generated. And it's perfectly acceptable
> to construct a single classifier feature space using features from all
> three channels. It may be useful to keep the CellProfiler/Pys
>  lic/WND-CHARM feature sets in their own containers, but still provide the
> option to mix and match, or add new batteries of algorithms like Lee K.
> mentioned. The features used and sampling can vary by image modality and
> can vary from experiment to experiment.
>
> The hierarchical nature of feature data is what made Simon choose HDF5
> files stored on a per-image basis as the back-end in the first generation
> of OMERO-WND-CHARM. But for datasets that consist of thousands of
> images/ROIs, this solution might not scale well, which is why Simon was
> interested in a NoSQL database for feature storage where the schema about
> what can be stored is not strict.
>
> A generalized API to construct this 2D feature matrix should allow for
> three types of user inputs:
> 1. Image/ROI ids indicating what will be represented in feature space
> (i.e., rows)
> 2. List of the features or feature families that are relevant (i.e.,
> columns)
> 3. Specification of how to pack the features into the rows, i.e., should
> each image/ROI gets its own row, or should one row contain features from
> multiple images/ROIs/samples.
>
> An example to illustrate #3: We're developing a classifier to diagnose
> bacterial and viral pneumonia. We have chest X-rays images, each of which
> have been segmented into 12 regions based on anatomy. Experimental design
> may dictate that the 12 ROIs may be considered as 12 individual points in
> feature space (more rows, less columns), or they may count as a single
> point for the purpose of classification (more columns, less rows).
>
> A simple feature query API call would be similar to the SQL query  "SELECT
> FeatureA, FeatureB, FeatureC FROM FeatureTable WHERE image = (list of
> ids)". The user would provide a list of image ROI ids, and a list of
> features in the form of some human-readable, machine-parseable sturcture
> which would contain all the information for how that feature was
> calculated, including all preprocessing information, channel information.
> You could say that every individual feature could be uniquely identified
> for a given image/ROI/sample by its own feature "street address". In
> WND-CHRM we accomplish this using strings that have nested parentheses and
> brackets in the form "<Algorithm> ( <Transform> (<Channel> ) ) [ Feature
> Index ]"  The API call might look something like this:
>
> roi_id_list = [ 24, 67, 89, 103 ]
> desired_feature_list = [ \
>      'Zernike Coefficients (Fourier (Wavelet (Red))) [52]',
>      'Gini Coefficient (Wavelet (Fourier (Green))) [0]',
>      'Chebyshev Coefficients (Wavelet (Blue)) [19]' ,
>      'Radon Coefficients (Green) [12]' ]
>
> feature_matrix = GenerateFeatureSpace( roi_id, desired_feature_list )
>
> A 4x3 Numpy matrix with the corresponding features would then be returned
> into feature_matrix. The feature street addresses don't have to be strings,
> they can be some can be some map/dict/class where the components of the
> feature address are more structured. Or each feature street address could
> be composed of a bunch of tags.
>
> It's possible at query time the feature for the given image/ROI/sample
> hasn't even been calculated yet. The API would need to satisfy the request
> either from features stored in the database or calculated on the fly.
> BISQUE has functionality that works like this called the Feature Service.
>
> Sorry for the long email. I'm excited to work with you all to move the
> ball forward on this project!
> Chris C.
>
> On Jul 19, 2013, at 5:35 PM, Jason Swedlow wrote:
>
> Hi All
>
> A quick plea not to drop this thread. Input from Ivan and Chris C. would
> be most welcome. These applications are very important and getting this API
> nailed down-- at least for a first draft-- would be hugely helpful.
>
> Cheers,
>
> Jason
>
>
> Jason Swedlow, PhD, FRSE
> Centre for Gene Regulation & Expression
> Open Microscopy Environment
> University of Dundee
> http://openmicroscopy.org<http://openmicroscopy.org/><
> http://openmicroscopy.org/>
>
>
>
>
> Lee Kamentsky <leek at broadinstitute.org<mailto:leek at broadinstitute.org
> ><mailto:leek at broadinstitute.org<mailto:leek at broadinstitute.org>>> wrote:
>
> Hi all,
> I think it's great that Bob Murphy's group has implemented pyslic and
> pyslid in an open-source framework like OMERO. It looks like a substantial
> body of work. I'm wondering what needs to be done to make it a
> general-purpose framework however, especially looking at it from the
> perspective of our group's experience with CellProfiler. Also, Simon,
> thanks for moving this forward.
>
> My reading of the pyslic code is that it supports a nuclear stain and a
> protein stain and calculates a standard set of per-image and per-object
> features (although I haven't quite figured out the storage mechanism for
> the object features). This is adequate for a large class of experiments
> involving two-color fluorescently-labeled samples and it's likely the
> methods are robust, but our experience has been that experimental protocols
> can be more varied (multiple protein stains, brightfield images) and the
> biological questions can require additional image preprocessing to
> highlight the structures of interest, often requiring tuning parameters
> specific to the structure scale. Because of this, I think that the
> framework needs a modular architecture that supports development of new
> algorithms by computational researchers and configuration by the end users
> and it needs to extend beyond a curated code-base to allow for innovation.
> Personally, I'm really pleased that the framework is i
>  n Python because it aligns well with our group, but perhaps this is
> limiting for the ImageJ community and perhaps some portion of
> CellProfiler's bridge between Python and ImageJ could be adapted to supply
> the connection.
>
> I think that we do need a platform for innovation and the keys to that are
> interoperability, standards, and a model of the analysis that is flexible
> enough to describe our community's experiments and that captures the
> analysis protocol in a reproducible manner. I'm going to outline my
> perspective on the model here, drawing on our group's experience with
> CellProfiler, and try to keep it brief. I see the components of the model
> being:
>
> * Fields of view - N dimensional spaces (X, Y, T, Z, spectral)
> representing an imaging site
> * Images - acquired image data on a field of view (with acquisition
> metadata) or similar produced by algorithms such as filters or
> morphological operations.
> * Segmentations - defining multiple regions of interest on the fields of
> view or on (hyper)planes of the fields of view
> * Relationships between segmented regions - links between segmented
> regions either within segmentations or across them. Examples might be
> time-lapse cell tracking, associations between nuclear and cellular
> segmentations or groupings of organelle segmentations within a cell.
> * Measurements - data computed on the images, segmentations and
> relationships within a field of view. My take on this is that a measurement
> produces a numeric feature value per image or per segmentation region, but
> perhaps that's too narrow.
> * Protocol - a description of how to perform the analysis. I think the key
> elements are a link to the OMERO screen and a list of the parameterized
> algorithms to be performed. The screen provides image inputs to the
> algorithms which are the available image acquisition channels and the
> algorithms themselves provide images, segmentations, relationships and
> measurements which can serve as inputs to other algorithms in the protocol.
> Algorithms will often be parameterizable by the user and these parameters
> should be captured by the protocol. Ideally, the protocol should capture
> the versions of the algorithms using a mechanism such as a GIT hash. In
> CellProfiler, we have algorithms that produce an aggregated image based on
> samples from many fields of view, for instance an estimate of differences
> in signal magnitude across the field of view caused by non-uniform
> illumination - algorithms might have stacks of images as inputs and these
> stacks might span individual fields of view.
>
> As far as the actual mechanics, I see OMERO or similar using the protocol
> as a dependency graph, fetching the algorithms using some
> community-standard mechanism (maven? pip?), providing inputs as specified
> by the protocol and harvesting the outputs for the database and for
> dependent algorithms. I have some detailed concerns about algorithm
> input/output introspection and discovery, but ImageJ 2.0's plugin
> introspection protocol (@parameter) is a good starting point (thanks ImageJ
> 2.0).
>
> OK - somewhat CellProfiler-centric perhaps, but the nice thing about OMERO
> is that it is a relational database and the protocol is the thing itself -
> not a description of the experiment, but a mineable map of how each number
> is produced especially if the protocol pieces are described relationally in
> the database. I think the above is an ambitious undertaking, but look at
> the result! Researchers can trade protocols which produce robust and
> comparable values (not just "nuclear area", but the nuclear area after
> illumination correction and segmentation using Otsu thresholding and a
> seeded watershed of HeLa cells stained with DAPI). Developers can publish
> their method in OMERO and possibly OMERO itself can generate citations
> based on a protocol, leading to better accreditation of our work. And OMERO
> itself becomes a sustainable platform for analysis with a well-defined
> interoperable API for image processing.
>
> Hope this all gives things a positive lift, thx for reading this far,
> --Lee
>
> On Fri, Jul 5, 2013 at 10:03 AM, Simon Li <s.p.li at dundee.ac.uk<mailto:
> s.p.li at dundee.ac.uk><mailto:s.p.li at dundee.ac.uk<mailto:s.p.li at dundee.ac.uk>>>
> wrote:
> Hi everyone
>
> It was great to see so many people interested in OMERO.searcher and
> WND-CHRM at the Paris meeting, both those who were interested in installing
> it on their own systems and also those of you who were interested in
> developing other analysis algorithms for use with OMERO.
>
> One of the main points that came up was that OMERO should provide a single
> API for storing and calculating image features. Robert Murphy's group at
> CMU have already developed PySLID [http://github.com/icaoberg/pyslid], a
> python module for calculating and storing features used with
> OMERO.searcher, so I'd like to propose we bring this into the
> openmicroscopy GitHub organisation, and rename it to OMERO.features (other
> suggestions are welcome).
> Then there's the much bigger task of modifying the module to cater for
> everyone's requirements. I can see several potential issues, including how
> we handle multiple channels, z-slices, timepoints, ROIs, etc since features
> can be calculated for these individually or as a whole.
>
> If anyone has any thoughts or comments on what they'd like to see it'd be
> great if you could share them with the rest of this list, or if you prefer
> on our forums.
>
> Best wishes
>
> Simon
>
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:
> ome-devel at lists.openmicroscopy.org.uk><mailto:
> ome-devel at lists.openmicroscopy.org.uk<mailto:
> ome-devel at lists.openmicroscopy.org.uk>>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:
> ome-devel at lists.openmicroscopy.org.uk><mailto:
> ome-devel at lists.openmicroscopy.org.uk<mailto:
> ome-devel at lists.openmicroscopy.org.uk>>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
> Christopher Coletta
> Computer Scientist
> Image Informatics and Computational Biology Unit
> Laboratory of Genetics
> National Institute on Aging
> Biomedical Research Center
> 251 Bayview Boulevard, Room 10B125
> Baltimore, MD 21224
> Desk: 410-558-8170<tel:410-558-8170>
> Cell: 617-943-9745<tel:617-943-9745>
>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:
> ome-devel at lists.openmicroscopy.org.uk>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk<mailto:
> ome-devel at lists.openmicroscopy.org.uk>
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
> Christopher Coletta
> Computer Scientist
> Image Informatics and Computational Biology Unit
> Laboratory of Genetics
> National Institute on Aging
> Biomedical Research Center
> 251 Bayview Boulevard, Room 10B125
> Baltimore, MD 21224
> Desk: 410-558-8170
> Cell: 617-943-9745
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-devel/attachments/20130827/aab8116d/attachment-0001.html>


More information about the ome-devel mailing list