I Built a Single-Cell RNA-seq Pipeline That Writes Its Own Report

Single-cell RNA sequencing is one of those techniques that generates a huge amount of data and then demands a huge amount of effort to interpret it. You run the sequencer, get a matrix of 20,000 genes by several thousand cells, and then spend the next several days (or weeks) figuring out what’s in there. I wanted to change that, at least for the specific problem I keep coming back to: understanding the immune landscape inside solid tumours.

Thank you for reading this post, don't forget to subscribe!

The result is TIL-Scope github.com/lynchaos/tilscope, an open-source Python tool that takes a tumour microenvironment (TME) scRNA-seq dataset and (in one command) produces a self-contained HTML report with automated QC, clustering, cell type annotation, a T-cell exhaustion score, and a written summary you can actually share.

https://github.com/lynchaos/tilscope

What problem does it solve?

Every time I get a new TME dataset, the analysis is roughly the same: filter out low-quality cells, normalise counts, find highly variable genes, reduce dimensionality, cluster, annotate clusters against canonical markers, and calculate an exhaustion score for the CD8⁺ T-cell compartment. The first time you build this pipeline it takes a week. The fifth time it’s still two days of copy-paste debugging.

TIL-Scope packages that entire workflow into a single installable tool. One command:

tilscope run --input my_tme.h5ad --out report.html

You get a report with all the figures, the QC thresholds that were chosen (and *why* MAD based outlier detection, not hard coded cutoffs), the cluster annotations, the exhaustion scores, and a written narrative section.

The LLM part

The narrative is the bit I’m most interested in. Computational biologists spend a lot of time translating numbers into sentences for collaborators. I built narrative.py to do that automatically: it takes the structured analysis results (cluster sizes, marker gene enrichments, exhaustion score distributions) as a JSON payload, sends it to Claude with a tight system prompt that forbids hallucinating numbers, and gets back a two to three paragraph executive summary.

The design principle I cared about here was graceful degradation. If you don’t have an API key, or if the API call fails, the tool falls back to a deterministic template-based summary built from the same data. It never breaks. The report always contains a narrative and you just won’t know it’s LLM generated unless you look at the narrative_source field in the JSON.

It runs on real data too

The demo uses synthetic TME data (useful for CI no external downloads, byte-identical output every run), but I also included a download helper for the Tirosh et al. 2016 melanoma dataset from GEO. It’s 4,645 single cells from 19 metastatic melanoma tumours. A well-characterised benchmark with T cells, B cells, NK cells, macrophages, and malignant cells all present. The script downloads the raw GEO file, converts it to h5ad, and you’re running the real pipeline in about 30 seconds.

python scripts/download_tirosh2016.py

tilscope run --input data/tirosh2016_melanoma.h5ad --out report_real.html

Is it production-ready?

No, and I’m not pretending it is. It’s a compact, end-to-end demonstration of what AI first computational biology tooling could look like. The kind of thing a wet-lab scientist can run unattended and a reviewer can read in two minutes. The annotation is unsupervised (it never sees ground truth), and on the synthetic demo it recovers the planted populations with an adjusted Rand index around 0.96, which is good enough to trust for exploration.

The code is MIT-licensed, CI runs across Python 3.10 – 3.12, and the annotation logic is transparent enough to audit. If you’re doing immuno-oncology scRNA-seq and want a quick, reproducible first pass on a new dataset, give it a try.

https://github.com/lynchaos/tilscope