[ome-devel] The Zen of OME: philosophical questions on data models and functionality

Harry Hochheiser hsh at nih.gov
Mon Dec 5 14:57:42 GMT 2005


A recent private discussion of use cases, user needs, and related  
issues has reached the point where it seemed appropriate to move it  
onto this list. I:'m going to try to summarize and a few points that  
were raised and provide some new comments. Others, please let me know  
if i've mischaracterized things.

Three main topics have been kicked around: appropriate strategies for  
data modeling, the handling of binary data, and the need for lower  
barriers to entry with OME, particularly with respect to analysis.

As I write this note, I realize that these topics all reflect the  
ongoing challenge behind idealized design and the dirty muck that is  
reality.  OME was designed to have a fairly "pure" view of the world:  
even though the data model is extensible, the applications, data  
types, and analysis tools would be very well-defined and constructed  
to adhere to OME rules that would make things work together nicely.  
Unfortunately, the world is rarely so accommodating, and even less so  
in research. Thus, using OME presents some practical challenges,  
particularly for users who have not "drunk the kool-aid" and been  
completely convinced of the appropriateness of the OME approach.

The first manifestation of this tension that I'll  mention is the  
question of data modeling. OME's extensible data model allows users  
to create new semantic types, which can contain structured data as  
inputs to or outputs from analysis. Run-time import tools can be used  
to create new types when needed, making things in theory very flexible.

However, the ST model is, in some other ways, not so flexible. Once a  
type has been added, it can't be changed easily. Data can be added by  
defining new STs that refer to the original, but fields aren't easily  
removed.  In some sense, this means that STs are great for defining  
models that are already clearly understood and fairly mature. If the  
data modeling needs are not well-understood,  defining STs may be   
pre-mature.

The category group/category model is a more lightweight approach.  
Having Category Groups and Categories defined as STs, a user can  
create a new CategoryGroup (instance of a CategoryGroup ST) and  
associate one or  more Category instances with that group. These  
Category instances can then be associated with images, with the  
presumed interpretation that categories in a category group are  
mutually exclusive.

This approach has the advantage of being drop-dead simple and, as it  
does not  require any ST definition, it does not lead to any  
"cluttering' of the data space with STs that might need to be  
deprecated. New Categories are also easy to add at any time.  
(Certainly, this would also be true if a user defined one ST to act  
as a  "CategoryGroup" and another ST to define instances of that  
group, but this approach is less lightweight).

There are, however, a couple of problems with this approach. Unlike  
more general STs, categories can not have additional information  
stored with them.  Using STs computationally runs the risk of making  
things a bit less clear and rigorous.  For example, if  I have a  
module that outputs an ST which is "DevelopmentalAge",  I can see  
from the module description what it's doing. However, if I have an  
output which is a "Category", I don't really know what's going on  
unless I look a bit more deeply.


So, where does this leave the poor developer or user? STs are good,  
but require fully-defined, well-understood uses? Category Groups are  
appealing and simple, but limited.

My take is that users should start with Category Groups and  
Categories where they work, and move to STs when they can: i.e., when  
data models are clearly understood.

How to make this distinction? Who Knows? Data modeling is very hard.  
Folks in the library community still have trouble defining seemingly  
simple concepts like "author". Despite all of the efforts of the  
semantic web community to rigorously model data on the web, it's  
arguable that the informal, "folksonomies" approach - which is very  
similar to category groups and categories - is having a larger  
impact.  These problems are not unique to OME, and we're not going to  
able to solve them.

Binary data raises similar questions:  OME's approach is to to  
explain and document each bit of data, making it as easy to interpret  
as possible. Binary data is by definition opaque, requiring external  
knowledge to interpret it correctly. The preference would be,  
whenever possible,  to avoid the use of arbitrary binary data - I'm  
guessing that this argument is not too controversial. However, there  
might be times when the benefits of adding such data are significant.

My hunch is that the decision about such things should be based on a  
cost-benefits analysis involving factors such as the benefits gained,  
the cost of implementation, the cost to the data model, and other  
questions. If there's a significant benefit, and things can be done  
easily without leading to problems with the data model, then sure,  
why not? I'd say that this is particularly true if we can do it via  
custom modules and STs that need not be part of "core" OME.

On the other hand, saying that OME shouldn't handle binary data   
because it somehow offends aesthetics or some sense of philosophical  
purity seems to me to be the wrong approach. This, however, is a  
straw man - i don't hear anybody making this argument.

As far as lowering barriers to entry, it's clear that this needs to  
happen. One thing that would help me would be a concise list of  
needs: what are the top items (ideally, prioritizied) that would get  
some new folks to jump in?  I think it might also help for us to come  
up with some descriptive scenarios that might describe how people  
might work to combine OME with current work practices.

For example, users with ever-changing  home-grown analyses might  
simply want to define STs as needed to store data and then import  
external analysis results. If we could describe how to do this, and  
perhaps provide appropriate tools, that might provide them with a  
clear enough view of how to proceed.

thoughts?


-harry




More information about the ome-devel mailing list