The bench problem that wouldn’t go away

There’s a thing that happens in a tissue culture lab that I’ve never quite gotten used to, even after years of doing it. You pull a sample from a flask or a bioreactor, you load it onto a counting chamber, you put your eye to the microscope, and then you spend the next several minutes squinting at a grid and clicking a tally counter. If you’re being honest about it, the count you write down is half measurement, half judgement call. Were those two cells one cell that was dividing? Was that thing on the line in or out? You decide, you write a number, you move on.

Thank you for reading this post, don't forget to subscribe!

Most of the time it doesn’t matter, because the precision you actually need isn’t there to begin with. But sometimes it does matter — when you’re deciding whether a flask is ready to passage, when you’re trying to compare two conditions, when you’re tracking a line over many weeks and you’d really like the noise floor of the measurement to be lower than the signal you’re chasing. And in those cases the gap between “what I just did with my eye and a clicker” and “what I would have measured with a flow cytometer or an imaging cytometer in the corner I don’t have one in” gets uncomfortable.

I’ve been thinking about this problem on and off for about a year. The shape of the obvious solution has been there the whole time: take a picture, run a segmentation model, count the labels, done. The Cellpose project from the labs at HHMI Janelia has been doing exactly this in Python for years now, and the Cellpose-SAM release earlier this year is genuinely remarkable — it generalises across cell types, imaging modalities, and lighting conditions in a way that even six months ago I’d have said wasn’t possible without per-experiment fine-tuning. Watching it segment images you’ve taken yourself feels a bit like magic the first few times.

So why hasn’t this already become how everyone counts? Friction. The friction between the Python environment that runs the model and the bench where the image is taken. The friction between the laptop in the office and the microscope in the basement. The friction between “I have an image” and “I have a number I can put in the lab notebook,” when those two states are separated by half an hour of context switching, file shuffling, and copy-pasting. I have watched genuinely capable scientists abandon image-based workflows not because the tools didn’t work but because the tools didn’t fit into the actual rhythm of bench work, which is fast, gloved, and doesn’t have an Anaconda prompt open.

A few months ago I started building the Android app I’d been wanting for myself. The premise is small and stubborn: phone in hand, image in three taps, count on the screen in a few seconds, all of it done at the bench, never leave the room.

The hardest decision was the boring one: where does the model live

You can run a vision model in three places: in the cloud on a GPU you don’t own, in the cloud on a GPU you do own, or on the device. The Cellpose-SAM weights are roughly half a gigabyte and the architecture is a vision transformer; that’s a non-starter for a phone. So if I wanted Cellpose-SAM specifically, I was committed to a cloud round-trip. The public Hugging Face Space the Cellpose team maintains gives you free GPU minutes via Zero-GPU, with rate limits, and you can duplicate that Space into your own account for higher limits at your own cost. Both of those are real options.

But the friction I’m trying to eliminate is mostly network friction. The basement microscope rooms I have in mind have notoriously bad WiFi, and even when the connection works, the round-trip — upload, queue, GPU run, download — is the kind of thing that makes people give up after the first two attempts. Cloud-only was going to feel like every other Python pipeline they’d already abandoned, just with a slightly nicer UI.

So I split the difference. The default mode runs the smaller Cellpose 3 cyto3 model directly on the device. Cyto3 is a U-Net with about 25 million parameters, and after exporting it to ONNX and quantising the weights to FP16 you end up with a model file that fits in 26 MB. The Android app downloads that file once on first launch from a Hugging Face Hub repo I’m hosting, verifies the SHA-256, caches it, and never touches the network again unless the user explicitly asks for cloud mode. ONNX Runtime’s Android build picks up the phone’s Neural Networks API automatically, which routes the heavy lifting onto whatever NPU, GPU, or DSP the device has. On a Pixel-7-class phone the inference itself is well under a second.

The opt-in cloud mode hands the work over to Cellpose-SAM on Hugging Face for the harder images. Same UI, same parameters, same metrics, same export. You pick the mode per run with a small chip near the Run button. The app falls back to local mode automatically if you go offline.

This split has a second nice property: parameter sweeps become free. Cellpose’s segmentation has four knobs — image resize, iteration count, flow threshold, and cell-probability threshold — and figuring out the right values for a particular cell type and imaging setup is iterative. With a cloud backend, ten attempts at tuning means ten round-trips and ten chunks of someone’s GPU quota. With a local backend, ten attempts is a coffee.

The model isn’t the clever bit; the post-processing is

Most of the engineering hours I’ve put into this haven’t been on the neural network. The ONNX export was a one-afternoon job — torch.onnx.export with dynamic axes for the height and width dimensions, a quick FP16 conversion using onnxconverter-common, a numerical equivalence check against the PyTorch reference (max absolute difference under 5e-3 for FP16, well within tolerance), upload to Hub, done. ONNX Runtime is mature and the Android integration is, if not exactly fun, at least well-trodden.

The actually interesting part is what Cellpose does after the network. The network’s output isn’t a segmentation mask. It’s two things: a per-pixel “cellprob” map that says how confident the model is that this pixel belongs to any cell, and a per-pixel flow field — basically a vector arrow at every pixel that points toward the centre of whichever cell that pixel belongs to. To turn those two outputs into actual instance labels (cell #1, cell #2, cell #3…), you have to do gradient ascent on the flow field, in parallel for every foreground pixel, until each pixel has converged to its cell’s centre. Then pixels that converge to the same centre belong to the same cell. The reference Python implementation in cellpose/dynamics.py is around 100 lines of NumPy and SciPy.

In Kotlin, with no NumPy, on a phone, the same logic looks roughly like this:

for each iteration in 1..200:
    for each foreground pixel (y, x):
        dy, dx = bilinear_lookup(flow_field, y, x)
        y, x = clamp(y + dy, x + dx)
    early-stop if movement is below threshold

cluster pixels by final (y, x) bucket → integer label image
filter clusters smaller than min_size
compact labels to 1..n

That’s a few hundred lines of Kotlin once you’re done with the bookkeeping, and most of those lines are about not allocating in the inner loop. The first version I wrote allocated a tiny Pair<Float, Float> object per bilinear lookup and ran at about a tenth the speed it should have; switching to two FloatArray parameters that get filled in place was a 10× win on its own. Profiling on a real device with Android Studio’s CPU profiler, then chasing the green-to-red boxes, then profiling again, has been a meditative exercise.

The verification is what actually keeps me up. There’s a real risk in porting numerical code from one language to another — you can write something that looks like the original, runs without errors, produces vaguely sensible output, and is actually subtly wrong. Cellpose has been used in enough labs that I’m not going to ship something that disagrees with the reference. So I generated test fixtures from the reference Python implementation: a handful of synthetic images, the network’s flow output for each one, and the corresponding labels the Python code produces from those flows. The Kotlin pipeline runs against the same flow inputs and has to produce labels with IoU ≥ 0.99 against the Python reference. Below that threshold I treat it as a bug, not a “rounding difference.” Last week I spent three days driving the IoU from 0.91 to 0.97 to 0.998-something, mostly by being more careful about how the bilinear interpolation handles the borders of the image and how the final clustering step buckets pixels whose final positions are exactly between two integer coordinates.

Trade-offs I made and the ones I deliberately didn’t

A few engineering choices worth being upfront about, because someone reading this and thinking about a similar project will hit the same forks in the road.

I quantised to FP16 rather than INT8. INT8 would have made the model file smaller (~13 MB instead of 26 MB) and the inference faster on devices with INT8 NPU paths, but cell counting is genuinely sensitive to small numerical changes — a cell at the cellprob threshold disappears or splits, and your count moves by one. INT8 quantisation needs careful calibration on representative images to be safe, and that’s a separate research project. FP16 is essentially lossless for this network. INT8 is a v2 conversation.

I wrote the post-processor in pure Kotlin rather than C++ via the NDK. NDK would be 2–3× faster, but it doubles the build complexity, requires shipping .so files for each ABI, and makes debugging worse. For a 1024×1024 image with around 500 cells, the pure Kotlin version takes 2–4 seconds on a Pixel-7-class device. That’s well inside the budget I set for myself. If usage grows and people want bigger images or more cells, NDK is also a v2 conversation.

I’m hosting the cyto3 ONNX file on my own Hugging Face Hub account rather than bundling it inside the APK. This keeps the APK small (under 45 MB including ONNX Runtime and all of Compose), lets me push model updates without forcing a full app update through the Play Store review pipeline, and gives the user the option to skip the download if they’re network-constrained. The cost is one extra step on first launch, which I think is fair.

I deliberately did not implement on-device Cellpose-SAM. There’s no maintained TFLite or ExecuTorch port of Cellpose-SAM as a vision transformer at that scale, and writing one would be a four-week side quest with uncertain returns. The hybrid local-default-cloud-opt-in design is a strict improvement over either pure-cloud or pure-local for this particular bench workflow, and it sidesteps the on-device-SAM problem entirely.

The licensing chapter, which surprised me

This is the part of the project I went into most cautiously. Cellpose-SAM’s training data is licensed CC-BY-NC, which is the kind of thing that, read pessimistically, could rule out a public app store listing entirely. I sent a polite email to the upstream authors — Marius Pachitariu and Carsen Stringer at Janelia — explaining what I wanted to build, that the app would be free, ad-free, open-source under MIT, and that I’d attribute their work prominently. I asked specifically about both the cloud client and the cyto3 ONNX redistribution.

Instead of the long silence and cautious response I was expecting, I got the kindest reply in two sentences: yes, attribution and licence propagation are the only things they require. Janelia’s IP director followed up within a few hours to clarify that the upstream chain — back to Meta’s original SAM weights — is fully permissive and allows commercial applications. The CC-BY-NC notice on the upstream README, it turned out, applies to the training data, not to the resulting weights.

It was a small reminder of why open scientific software is a different kind of thing from open commercial software. People build these tools because they want them used, and when you ask honestly and offer attribution generously, they make it easy. I’ve been around the open-source world long enough to know this isn’t always how it goes. When it does, it’s worth saying out loud. I’m including the full attribution and licence-propagation language in the in-app credits, the GitHub README, and the Hugging Face model card.

What’s next

The app is in the build phase now and a few weeks of polish away from a closed-test release. I’ll write more as the milestones land — the model conversion in detail, the on-device pipeline with profiling numbers across a range of devices, the metrics and exports I’m computing on the segmentation output, the bench testing once it’s in real hands, the messy reality of submitting an Android app to the Play Store as an individual developer for the first time. None of these are problems I’d characterise as solved; they are problems I am working on, in the open, and that, more than any particular result, is what I wanted to start writing about.

If your bench has a count problem, I’d love to hear what shape it takes. The thing about building tools for yourself is that you find out very quickly which parts of the design were generalisable and which were just your own specific frustrations. Mine, it turns out, are pretty common — but I expect there are corners I haven’t thought about, and I’d like to.

The bench problem that wouldn’t go away

The hardest decision was the boring one: where does the model live

The model isn’t the clever bit; the post-processing is

Trade-offs I made and the ones I deliberately didn’t

The licensing chapter, which surprised me

What’s next

By Kemal

Related Post

Remove Background Noise from Video Without Re-encoding: An Audio-Only Approach with DeepFilterNet3

Cracker Blaster: a little block puzzle I built for my daughter

From Lab Bench to Browser: A Hybrid Digital Twin for CHO Cell Culture