Building ML Tools Scientists Will Actually Use

The Gap Between Models and Tools

I’ve seen a lot of impressive ML models in biopharma that never get used. Not because the science is wrong, but because the tool doesn’t fit into anyone’s workflow. The model might be published in Nature Methods with beautiful receiver operating characteristic curves, but if a discovery scientist can’t access it without filing an IT ticket or if it requires command-line expertise, it sits unused.

Thank you for reading this post, don't forget to subscribe!

This is the reality of building ML tools for scientific users: technical correctness is necessary but not sufficient. The tool needs to be accessible, interpretable, and trustworthy. It needs to answer the questions scientists actually have, not the questions we think they should have. And it needs to integrate into existing workflows rather than requiring people to change how they work.

When I built the antibody developability predictor, I spent as much time thinking about these usability questions as I did training the model. The result is a tool that scientists can actually use without a data science degree. Here’s what I learned.

Start with the User’s Question

The first mistake I almost made was building a general-purpose protein property predictor. My initial plan was to calculate every possible sequence feature and let users decide what mattered. But when I talked to actual antibody engineers, they didn’t want a Swiss Army knife of features. They wanted answers to specific questions: Will this express well? Will it aggregate? Is it stable enough for formulation?

So instead of building a feature calculator, I built a developability classifier. The output isn’t a list of numbers. It’s a verdict: high, medium, or low developability. That’s what decision-makers need. They’re choosing between candidate molecules. A binary or ternary decision is more useful than a spreadsheet of features.

But here’s the important part: I still provide the features. The overall prediction is front and center, but if you want to dig into why the model made that prediction, the features are available. This layered approach works because it serves two audiences. The antibody engineer who just needs a quick answer gets one. The process development scientist who wants to understand the details can access them.

This design principle (answer the main question immediately, provide details on demand) applies to most scientific ML tools. Scientists are busy. They don’t want to interpret raw model outputs. But they also don’t want black boxes. The solution is layered transparency: simple top-level output with drill-down capability.

Make It Interpretable

Trust is the single biggest barrier to ML adoption in regulated industries like biopharma. Scientists are trained to ask “why?” A model that says “this antibody will aggregate” without explanation is useless in a regulated environment. You can’t put “the neural network said so” in a regulatory filing. You need mechanistic understanding.

This is why I chose interpretable models (Random Forest, XGBoost) over deep learning. These models provide feature importance rankings. You can trace exactly which features contributed to a prediction. When the model predicts high aggregation risk, I can show that the decision was driven by high hydrophobicity in CDR2, three aromatic residues clustered in a seven-amino-acid window, and a GRAVY score above the threshold. That’s something a scientist can evaluate.

I also implemented SHAP (SHapley Additive exPlanations) values, which show not just which features are generally important, but which features influenced this specific prediction. For any given antibody, I can generate a waterfall plot showing: “The base rate would predict medium developability, but this molecule has unusually low aggregation propensity (+15% toward high) and favorable charge distribution (+8% toward high), so the final prediction is high developability.”

This level of interpretability isn’t just nice to have. It’s essential for adoption. Scientists need to be able to sanity-check the model’s logic. If the model predicts low developability for a molecule that looks fine to an experienced engineer, that engineer needs to see why. Maybe the model caught something subtle. Or maybe it’s a false positive and the model’s reasoning doesn’t hold up. Either way, interpretability enables that conversation.

Integrate into Existing Workflows

The developability predictor is a web app. You paste in sequences, hit a button, get results. This might seem obvious, but it’s a deliberate design choice. I could have built this as a Python package that requires local installation, or as an API that requires coding knowledge. But those approaches create adoption barriers.

Web apps meet users where they are. Scientists are comfortable with web interfaces. They use them for BLAST, for PDB structure viewers, for ordering reagents. A web app doesn’t require IT approval to install software. It doesn’t require learning a command-line interface. It works on any computer with a browser.

The flip side is that web apps have limitations. They can’t easily integrate into computational pipelines. If a company wants to screen thousands of antibody sequences, they’re not going to manually paste each one into a web form. For that use case, an API or downloadable tool makes sense. But for the initial use case (individual scientists evaluating a handful of candidates), the web app is the right level of accessibility.

I also made sure the output was easy to export. You can download the prediction results as a CSV file. You can screenshot the visualizations. You can copy the SHAP plot and paste it into a presentation. These are small details, but they matter. If the tool doesn’t output data in formats people already use, they’ll spend time reformatting instead of analyzing.

Handle Edge Cases Gracefully

One of the first things I had to handle was invalid input. What happens if someone pastes in a DNA sequence instead of a protein sequence? What if there are typos or special characters? What if the sequence is too short?

In early testing, these errors crashed the model or produced nonsense predictions. The fix was adding input validation. The tool now checks that sequences contain only valid amino acids, that they’re long enough to be plausible antibody variable regions, and that heavy and light chains are provided together. If validation fails, the user gets a clear error message explaining what went wrong.

This might seem basic, but it’s the difference between a research prototype and a production tool. Research code can assume perfect input. Production tools need to handle reality. And reality includes typos, confusion about file formats, and users who aren’t sure exactly what the tool expects.

I also had to handle ambiguous cases. Some antibodies have unusual features (non-canonical cysteines, rare amino acid substitutions) that the model wasn’t trained on. In these cases, the model still makes a prediction, but it also flags uncertainty. The user sees: “Prediction: Medium developability. Note: This sequence contains non-standard features. Confidence may be lower than usual.” That transparency helps users calibrate how much to trust the output.

Build Trust Through Validation Examples

The tool includes a set of example antibodies (Trastuzumab, Rituximab, Adalimumab) that users can try with one click. This serves two purposes. First, it makes the tool easy to explore. New users can see what a prediction looks like without having to find their own sequences. Second, it builds trust. These examples are well-known molecules with well-understood developability profiles. Users can verify that the model’s predictions align with reality.

This is a strategy I borrowed from molecular modeling tools. When scientists try a new docking program or protein structure predictor, they first test it on known examples. If the tool can correctly predict the structure of a protein they already have the crystal structure for, that builds confidence. If it gets the known example wrong, they don’t trust it for new predictions.

The same principle applies here. By providing validated examples, I’m letting users kick the tires. If they see that the tool correctly identifies Trastuzumab as highly developable (which it does) and correctly flags problematic features in more challenging molecules (which it does), they’re more likely to trust it for their own candidates.

Communicate Uncertainty

Every ML model has uncertainty, but most scientific ML tools hide it. The prediction is presented as definitive. This is a mistake, especially for decisions with real consequences. If a company deprioritizes a molecule because the model predicted low developability, and that molecule would have actually been fine, that’s a costly error.

So I made uncertainty explicit. The tool reports confidence scores. “High developability, 87% confidence” is different from “High developability, 52% confidence.” The first is a strong signal. The second is barely better than random and should be treated cautiously.

I also provide probability distributions. Instead of just saying “the prediction is high,” the tool shows: 12% chance low, 15% chance medium, 73% chance high. This helps users understand the strength of the evidence. A 73% prediction might be worth acting on. A 40% prediction means the model is uncertain, and you should rely more on other information.

Communicating uncertainty this way requires some user education. Many scientists aren’t trained in probabilistic thinking and want definitive answers. But in my experience, once you explain that the percentages reflect model confidence (based on how similar this molecule is to things the model was trained on), people get it. And they appreciate the honesty more than a false sense of certainty.

Iterate Based on Feedback

The current version of the tool is version 0.3. Versions 0.1 and 0.2 had different feature sets, different UIs, and different outputs. They evolved based on feedback from scientists who tried them.

Version 0.1 just reported the developability score with no explanation. Users hated it. “Why did it predict this? What should I change?” So I added feature breakdowns.

Version 0.2 showed all 47 features in a giant table. Users found it overwhelming. “There’s too much information. I don’t know what to focus on.” So I simplified to show only the top 10 most influential features by default, with an option to expand to the full list.

The SHAP visualizations came from a request. A process scientist said, “I get that hydrophobicity is generally important, but how much did it matter for this specific molecule?” That was a great question, and SHAP values answered it perfectly.

This iterative approach is essential for building useful tools. No matter how much you think about usability upfront, you won’t get it right the first time. You need real users to tell you what’s confusing, what’s missing, and what’s extraneous. Then you iterate.

Make the Code Open

I made the tool open source on GitHub. This wasn’t just philosophical; it was pragmatic. Pharmaceutical companies are (rightly) cautious about using black-box external tools for critical decisions. If the code is closed and proprietary, they have no way to verify it or adapt it to their needs.

By making the code open, I enable transparency. Anyone can read the feature engineering logic, inspect the model training process, and verify that I’m doing what I claim to do. This builds trust in a way that no amount of benchmarking can.

It also enables customization. A company might want to retrain the model on their internal data or modify the feature set to reflect their specific process. With open code, they can do that. They can fork the repository, make their changes, and have a tool that’s tailored to their context.

The downside of open source is that you lose control. Someone could take the code, strip out the attribution, and present it as their own work. Someone could modify it incorrectly and produce bad predictions. But in practice, these risks are smaller than the benefits. Open source signals that you’re confident in your work and willing to let it be scrutinized. In science, that’s valuable.

The Adoption Question

At the end of the day, the measure of success for an ML tool isn’t accuracy on a held-out test set. It’s whether people use it. A model with 95% accuracy that no one uses is less valuable than a model with 75% accuracy that gets integrated into decision workflows.

I don’t know yet whether this tool will achieve broad adoption. It’s early. But the design principles I’ve followed (answer clear questions, provide interpretability, integrate into workflows, communicate uncertainty) are the right ones. They’re the principles that separate research prototypes from production tools.

If there’s one thing I’ve learned, it’s that ML in biopharma isn’t just a technical challenge. It’s a human challenge. You’re asking scientists to change how they work, to trust a model they don’t fully understand, and to incorporate probabilistic thinking into processes that have historically been deterministic. That’s a big ask. The only way it works is if the tool meets them halfway: make it accessible, make it trustworthy, and make it solve a problem they actually have.

The antibody developability predictor does that. It’s not perfect, but it’s useful. And in the end, useful is what matters.

This is part of a series on applying machine learning to antibody developability. For the tool and code, see the GitHub repository. For technical details, read the full project writeup.