Jupyter Notebook

Validate & register flow cytometry data#

!lamin init --storage ./test-flow --schema bionty
Hide code cell output
πŸ’‘ creating schemas: core==0.46.1 bionty==0.30.0 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 13:52:12)
βœ… saved: Storage(id='H7Pxu11w', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow', type='local', updated_at=2023-08-28 13:52:12, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-flow
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"  # globally set species
βœ… loaded instance: testuser1/test-flow (lamindb 0.51.0)
βœ… set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 13:52:14, bionty_source_id='7Wa0', created_by_id='DzTjkKse')
ln.track()
πŸ’‘ notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0 readfcs==1.1.6
βœ… saved: Transform(id='OWuTtS4SAponz8', name='Validate & register flow cytometry data', short_name='flow', version='0', type=notebook, updated_at=2023-08-28 13:52:14, created_by_id='DzTjkKse')
βœ… saved: Run(id='SswraaPvvTqVHpKZo6iA', run_at=2023-08-28 13:52:14, transform_id='OWuTtS4SAponz8', created_by_id='DzTjkKse')

Transform #

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

We start with a flow cytometry file from Alpert19:

ln.dev.datasets.file_fcs_alpert19(
    populate_registries=True,  # pre-populate registries to simulate an used instance
)


PosixPath('Alpert19.fcs')

Use readfcs to read the fcs file into memory:

adata = readfcs.read("Alpert19.fcs")
adata
AnnData object with n_obs Γ— n_vars = 166537 Γ— 40
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnR'
    uns: 'meta'

Validate #

First, let’s validate the features in .var.

We’ll use the CellMarker reference to link features:

lb.CellMarker.validate(adata.var.index, "name");
πŸ’‘ using global setting species = human
βœ… 27 terms (67.50%) are validated for name
❗ 13 terms (32.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead, CD19, CD4, IgD, CD11b, CD14, CCR6, CCR7, PD-1

We see that many features aren’t validated. Let’s standardize the identifiers:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 35/40 terms

Now things look much better, but we still have 5 CellMaker records that seem more like metadata.

validated = lb.CellMarker.validate(adata.var.index, "name")
πŸ’‘ using global setting species = human
βœ… 35 terms (87.50%) are validated for name
❗ 5 terms (12.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead

Hence, let’s curate the AnnData a bit more.

Let’s move metadata (non-validated cell markers) into adata.obs:

adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()

Now we have a clean panel of 35 cell markers:

lb.CellMarker.validate(adata.var.index, "name");
πŸ’‘ using global setting species = human
βœ… 35 terms (100.00%) are validated for name

Next, let’s register the metadata features we moved to .obs:

features = ln.Feature.from_df(
    adata.obs
)  # Feature.from_df create feature records with type auto-populated
ln.add(features)

In addition, We’d also like to link this file with external features:

ln.Feature.validate("assay", "name")
lb.ExperimentalFactor.validate("FACS", "name");
βœ… 1 term (100.00%) is validated for name
❗ 1 term (100.00%) is not validated for name: FACS

Since we never validated the term β€œFACS”, let’s search for it’s ontology and register it:

lb.ExperimentalFactor.bionty().search("FACS").head(2)
ontology_id definition synonyms parents molecule instrument measurement __ratio__
name
fluorescence-activated cell sorting EFO:0009108 A Flow Cytometry Assay That Provides A Method ... FACS|FAC sorting [] None None None 100.000000
acute chest syndrome EFO:0007129 A Vaso-Occlusive Crisis Of The Pulmonary Vascu... ACS|Acute Chest Syndrome|acute chest syndrome|... [EFO:0003818] None None None 85.714286
facs = lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108")
facs.save()
βœ… created 1 ExperimentalFactor record from Bionty matching ontology_id: EFO:0009108

Adding a new modality:

modality = ln.Modality(name="protein", description="readouts of protein abundance")
modality.save()

Register #

file = ln.File.from_anndata(adata, description="Alpert19", var_ref=lb.CellMarker.name)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/LhWKPCF6x4twXk0dR1x9.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    35 terms (100.00%) are validated for name
πŸ’‘    using global setting species = human
βœ…    linked: FeatureSet(id='om9elFwRNETTAVa7DnUX', n=35, type='float', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    5 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='1xPw4lpxAKxmsIynZ7XY', n=5, registry='core.Feature', hash='Vg9TGnGYfiWe_xegabUX', modality_id='woWqSfKt', created_by_id='DzTjkKse')
file.save()
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'LhWKPCF6x4twXk0dR1x9' at '.lamindb/LhWKPCF6x4twXk0dR1x9.h5ad'
file.add_labels(facs, "assay")
file.add_labels(lb.settings.species, "species")
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='OsHMgu4GieqPEzY535pG', n=1, registry='core.Feature', hash='ylt2e3mmGjTejCKrQxza', updated_at=2023-08-28 13:52:19, modality_id='woWqSfKt', created_by_id='DzTjkKse')
πŸ’‘ no file links to it anymore, deleting feature set FeatureSet(id='OsHMgu4GieqPEzY535pG', n=1, registry='core.Feature', hash='ylt2e3mmGjTejCKrQxza', updated_at=2023-08-28 13:52:19, modality_id='woWqSfKt', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='U5vIJj85XYwRU1gkMKFl', n=2, registry='core.Feature', hash='h2fOf0lfmzWz0gCeP9OP', updated_at=2023-08-28 13:52:19, modality_id='woWqSfKt', created_by_id='DzTjkKse')
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
file.features
'var': FeatureSet(id='om9elFwRNETTAVa7DnUX', n=35, type='float', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', updated_at=2023-08-28 13:52:19, modality_id='OX1girsy', created_by_id='DzTjkKse')
'obs': FeatureSet(id='1xPw4lpxAKxmsIynZ7XY', n=5, registry='core.Feature', hash='Vg9TGnGYfiWe_xegabUX', updated_at=2023-08-28 13:52:18, modality_id='woWqSfKt', created_by_id='DzTjkKse')
'external': FeatureSet(id='U5vIJj85XYwRU1gkMKFl', n=2, registry='core.Feature', hash='h2fOf0lfmzWz0gCeP9OP', updated_at=2023-08-28 13:52:19, modality_id='woWqSfKt', created_by_id='DzTjkKse')

Check a few validated cell markers in .var:

file.features["var"].df().head(10)
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
lRZYuH929QDw CD85j None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
CR7DAHxybgyi CD38 CD38 952 B4E006 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
fpPkjlGv15C9 Ccr6 CCR6 1235 P51684 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse

Validate and register a second dataset#

Let’s validate and register another flow file:

adata2 = readfcs.read(ln.dev.datasets.file_fcs())

We’d like to track all features in .var, so we register them:

adata2.var.index = lb.CellMarker.bionty().standardize(adata2.var.index)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 14/16 terms
markers = lb.CellMarker.from_values(adata2.var.index, "name")
ln.save(markers)
πŸ’‘ using global setting species = human
βœ… loaded 10 CellMarker records matching name: CD3, CD28, CD8, Cd4, CD57, Cd14, Cd19, CD27, Ccr7, CD127
βœ… created 4 CellMarker records from Bionty matching name: CCR5, CD45RO, Ki67, SSC-A
❗ did not create CellMarker records for 2 non-validated names: FSC-A, FSC-H

Standardize synonyms so that all features pass validation:

adata2.var.index = lb.CellMarker.standardize(adata2.var.index)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 14/16 terms
lb.CellMarker.validate(adata2.var.index, "name");
πŸ’‘ using global setting species = human
βœ… 14 terms (87.50%) are validated for name
❗ 2 terms (12.50%) are not validated for name: FSC-A, FSC-H

file2 = ln.File.from_anndata(
    adata2, description="My fcs file", var_ref=lb.CellMarker.name
)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/4p3ouvLoPTJXfcq6dEA5.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    14 terms (87.50%) are validated for name
❗    2 terms (12.50%) are not validated for name: FSC-A, FSC-H
πŸ’‘    using global setting species = human
βœ…    linked: FeatureSet(id='FA0HDOn6RahAVIjs50iH', n=14, type='float', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', created_by_id='DzTjkKse')
file2.save()
βœ… saved 1 feature set for slot: 'var'
βœ… storing file '4p3ouvLoPTJXfcq6dEA5' at '.lamindb/4p3ouvLoPTJXfcq6dEA5.h5ad'
file2.add_labels(facs, "assay")
file2.add_labels(lb.settings.species, "species")
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='BC2pHSmrRsm3Dr1i1KAM', n=1, registry='core.Feature', hash='ylt2e3mmGjTejCKrQxza', updated_at=2023-08-28 13:52:21, modality_id='woWqSfKt', created_by_id='DzTjkKse')
βœ… loaded: FeatureSet(id='U5vIJj85XYwRU1gkMKFl', n=2, registry='core.Feature', hash='h2fOf0lfmzWz0gCeP9OP', updated_at=2023-08-28 13:52:19, modality_id='woWqSfKt', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='U5vIJj85XYwRU1gkMKFl', n=2, registry='core.Feature', hash='h2fOf0lfmzWz0gCeP9OP', updated_at=2023-08-28 13:52:21, modality_id='woWqSfKt', created_by_id='DzTjkKse')
var_feature_set = file2.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
file2.features
'var': FeatureSet(id='FA0HDOn6RahAVIjs50iH', n=14, type='float', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', updated_at=2023-08-28 13:52:21, modality_id='OX1girsy', created_by_id='DzTjkKse')
'external': FeatureSet(id='U5vIJj85XYwRU1gkMKFl', n=2, registry='core.Feature', hash='h2fOf0lfmzWz0gCeP9OP', updated_at=2023-08-28 13:52:21, modality_id='woWqSfKt', created_by_id='DzTjkKse')
file2.view_lineage()
https://d33wubrfki0l68.cloudfront.net/449786ec05c70e6ce4bb0bf0cf827666b8624872/1f87d/_images/3c4e90337c6f12c9cd5d5916e798da3ed2772e899b80168d14d6b56842e249b3.svg

Query by cell markers#

Which datasets have CD14 in the flow panel:

cell_markers = lb.CellMarker.lookup()
cell_markers.cd14
CellMarker(id='roEbL8zuLC5k', name='Cd14', synonyms='', gene_symbol='CD14', ncbi_gene_id='4695', uniprotkb_id='O43678', updated_at=2023-08-28 13:52:16, species_id='uHJU', bionty_source_id='i6sZ', created_by_id='DzTjkKse')
panels_with_cd14 = ln.FeatureSet.filter(cell_markers=cell_markers.cd14).all()
ln.File.filter(feature_sets__in=panels_with_cd14).df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
4p3ouvLoPTJXfcq6dEA5 H7Pxu11w None .h5ad AnnData My fcs file None None 6876232 Cf4Fhfw_RDMtKd5amM6Gtw md5 OWuTtS4SAponz8 SswraaPvvTqVHpKZo6iA 2023-08-28 13:52:21 DzTjkKse
LhWKPCF6x4twXk0dR1x9 H7Pxu11w None .h5ad AnnData Alpert19 None None 33367624 14w5ElNsR_MqdiJtvnS1aw md5 OWuTtS4SAponz8 SswraaPvvTqVHpKZo6iA 2023-08-28 13:52:18 DzTjkKse

Shared cell markers between two files:

files = ln.File.filter(feature_sets__in=panels_with_cd14, species__name="human").list()
file1, file2 = files[0], files[1]
file1_markers = file1.features["var"]
file2_markers = file2.features["var"]

shared_markers = file1_markers & file2_markers
shared_markers.list("name")
['CD8', 'CD57', 'CD127', 'Cd19', 'Cd4', 'CD28', 'CD3', 'Cd14', 'CD27', 'Ccr7']

Flow marker registry#

Check out your CellMarker registry:

lb.CellMarker.filter().df()
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
lRZYuH929QDw CD85j None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
CR7DAHxybgyi CD38 CD38 952 B4E006 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
fpPkjlGv15C9 Ccr6 CCR6 1235 P51684 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
k0zGbSgZEX3q HLADR HLA‐DR|HLA-DR|HLA DR None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
L0m6f7FPiDeg CD86 CD86 942 A8K632 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
YA5Ezh6SAy10 DNA1 None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
ttBc0Fs01sYk CD8 CD8A 925 P01732 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
h4rkCALR5WfU CD56 NCAM1 4684 P13591 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
bspnQ0igku6c CD16 FCGR3A 2215 O75015 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
a624IeIqbchl CD45RA None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
8OhpfB7wwV32 Cd19 CD19 930 P15391 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
n40112OuX7Cq CD123 IL3RA 3563 P26951 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
a4hvNp34IYP0 CD3 None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
0evamYEdmaoY Igd None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
CLFUvJpioHoA CD28 CD28 940 B4E0L1 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
4uiPHmCPV5i1 CXCR5 CXCR5 643 A0N0R2 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRΞ³Ξ΄ None None None uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
HEK41hvaIazP Cd4 CD4 920 B4DT49 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
agQD0dEzuoNA CXCR3 CXCR3 2833 P49682 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
50v4SaR2m5zQ CD25 IL2RA 3559 P01589 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
gEfe8qTsIHl0 CD24 CD24 100133941 B6EC88 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
Nb2sscq9cBcB CD57 B3GAT1 27087 Q9P2W7 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
L0WKZ3fufq0J CD11c ITGAX 3687 P20702 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
0vAls2cmLKWq ICOS ICOS 29851 Q53QY6 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
c3dZKHFOdllB CD33 CD33 945 P20138 uHJU i6sZ 2023-08-28 13:52:16 DzTjkKse
XvpJ6oL3SG7w CD45RO None None None uHJU i6sZ 2023-08-28 13:52:20 DzTjkKse
UMsp5g0fgMwY CCR5 CCR5 1234 P51681 uHJU i6sZ 2023-08-28 13:52:20 DzTjkKse
VZBURNy04vBi SSC-A SSC A|SSCA None None None uHJU i6sZ 2023-08-28 13:52:20 DzTjkKse
Qa4ozz9tyesQ Ki67 Ki-67|KI 67 None None None uHJU i6sZ 2023-08-28 13:52:20 DzTjkKse
Hide code cell content
# a few tests
assert set(shared_markers.list("name")) == set(
    [
        "Ccr7",
        "CD3",
        "Cd14",
        "Cd19",
        "CD127",
        "CD27",
        "CD28",
        "CD8",
        "Cd4",
        "CD57",
    ]
)
ln.File.filter(feature_sets__in=panels_with_cd14).exists()
True
Hide code cell content
# clean up test instance
!lamin delete --force test-flow
!rm -r test-flow
πŸ’‘ deleting instance testuser1/test-flow
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-flow.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow