""" .. The goal of this example is to show how Markers and SNPMarkersSet are defined. We will assume that, as explained in FIXME, we already have a set of individuals in omero and that they are enrolled in 'TEST01'. For each individual, we have a sample of DNA in a well of a work plate, and we assume that we have run a series of TaqMan genotyping assays that associates to each well a collection of genotyping values. This is, of course, a simplistic model but it should be enough to illustrate the procedure. **Note:** DO NOT run this examples against a production database. Also, what will be shown here is supposed to be didactically clear, not efficient. See the implementation of the importer tools for more efficient solutions. First, as usual, we open the connection to the KnowledgeBase """ from bl.vl.kb import KnowledgeBase as KB import numpy as np import os OME_HOST = os.getenv('OME_HOST', 'localhost') OME_USER = os.getenv('OME_USER', 'test') OME_PASSWD = os.getenv('OME_PASSWD', 'test') kb = KB(driver='omero')(OME_HOST, OME_USER, OME_PASSWD) """ .. The following is the collection of (fake) markers used in the TaqMan assays. """ taq_man_markers = [ ('A001', 'xrs122652', 'TCACTTCTTCAAAGCT[A/G]AGCTACAAGCATTATT'), ('A002', 'xrs741592', 'GGAAGGAAGAAATAAA[C/G]CAGCACTATGTCTGGC'), ('A003', 'xrs807079', 'CCGACCTAGTAGGCAA[A/G]TAGACACTGAGGCTGA'), ('A004', 'xrs567736', 'AGGTCTATGTTAATAC[A/G]GAATCAGTTTCTCACC'), ('A005', 'xrs4693427', 'AGATTACCATGCAGGA[A/T]CTGTTCTGAGATTAGC'), ('A006', 'xrs4757019', 'TCTACCTCTGTGACTA[C/G]AAGTGTTCTTTTATTT'), ('A007', 'xrs7958813', 'AAGGCAATACTGTTCA[C/T]ATTGTATGGAAAGAAG') ] """ .. The first element of a marker defining tuple is its label, the second is the dbSNP db label, if available, while the third is the marker mask. .. todo:: put a reference to reference documentation Now we will load the markers set definition into Omero.biobank. **Note:** We are considering an ideal case where none of the markers is already in the db. """ study = kb.get_study('TEST01') action = kb.create_an_action(study, doc='importing markers') source, context, release = 'foobar', 'fooctx', 'foorel' lvs = kb.create_markers(source, context, release, taq_man_markers, action) """ .. where lvs is a list of (label, vid) tuples. We can assume that the markers above have been alligned against a reference genome, say fake18, and save in omero.biobank the alignment information. We will fake the alignment results as follows. """ aligns = [(t[1], 1, i*1000, True, 'A' if (i%2)== 0 else 'B', 1) for i, v in enumerate(lvs)] ref_genome = 'fake19' kb.save_snp_markers_alignments(ref_genome, aligns, action) """ .. We can now create an actual markers set. """ taq_man_set = [ (t[1], i, False) for i, t in enumerate(lvs)] label, maker, model, release = 'FakeTaqSet01', 'CRS4', 'TaqManSet', '23/09/2011' mset = kb.create_snp_markers_set(label, maker, model, release, taq_man_set, action) """ ... The function below will provide us with the data needed to fake the results of a genotyping assays on the given SNPMarkersSet. As discussed in XXXX, we need a numpy array of shape '''(2, N)''', where '''N''' is the number of markers, that represents the probability of being homozygous on allele A and allele B, and a numpy array of shape '''(N,)''' that represent the relative confidence on the measurements. .. todo:: write a general discussion on the floating point approach to genotyping data storage. """ def make_fake_data(mset): n = len(mset) probs = 0.5 * np.cast[np.float32](np.random.random((2, n))) confs = np.cast[np.float32](np.random.random(n)) return probs, confs """ .. Now that we have the SNPMarkersSet, we can proceed with the definition of the datasets. We will consider a minimalistic model, where we will not keep track of the Tube, PlateWell, ... chain that goes from the individual to the experiment, but we directly link the dataset to the individuals. To register this dependency and how the dependency was established, each DataSample object is linked with an action to the the relevant Individual instance. Mainly for historical reasons, the action is though as an arrow that starts from the 'newer' object, the GenotypeDataSample instance in this case, and has as a 'target' the object on which the process operated. It is, essentially, a representation of the 'inverse' of the creation process. .. todo:: uhmmm, not sure that the Action explanation above is crystal clear :-). """ mset.load_markers() data_sample_by_id = {} family = [] for i, ind in enumerate(kb.get_individuals(study)): family.append(ind) action = kb.create_an_action(study, target=ind, doc='fake dataset') conf = {'label' : 'taq-%03d' % i, 'status' : kb.DataSampleStatus.USABLE, 'action' : action, 'snpMarkersSet' : mset} data_sample = kb.factory.create(kb.GenotypeDataSample, conf).save() probs, conf = make_fake_data(mset) do = kb.add_gdo_data_object(action, data_sample, probs, conf) data_sample_by_id[ind.id] = data_sample """ .. Note how we first create a DataSample object (GenotypeDataSample) which basically keeps track of the fact that there exists a genotyping data set defined on a given snp markers set, and then we provide an actual data object that describes the physical object that contains the real data. The idea is that there could be many instances, data equivalent, that link to the same DataSample, e.g., on different file systems, in different formats and so on. """ """ .. As an example, we will now write out the information we have just saved as a plink pedfile. """ from bl.vl.genotype.io import PedWriter pw = PedWriter(mset, base_path="./foo") pw.write_map() family_label = study.label pw.write_family(family_label, family, data_sample_by_id) pw.close()