How I Built a Machine Learning Tool to Predict Drug Manufacturing Failures

A bioprocess engineer’s journey into machine learning and why the pharmaceutical industry desperately needs this bridge

Thank you for reading this post, don't forget to subscribe!

When I tell people I work in bioprocess engineering, I usually get blank stares. When I explain that I help manufacture proteins in giant tanks for therapeutic use, the response is often: “Oh, like brewing beer?”

Not quite. But close enough. What I don’t usually mention is that I’ve been teaching myself machine learning on nights and weekends. Not because it’s trendy, but because I keep seeing the same expensive problem over and over again in drug development and I think AI can help solve it.

The $50 Million Problem Nobody Talks About

Here’s something most people outside pharma don’t know: discovering a promising new drug is the easy part.

Seriously.

The hard part? Actually making it.

I’ve seen it happen more times than I’d like to admit. A discovery team spends years designing a brilliant therapeutic antibody. It binds perfectly to its target. It’s highly selective. It works beautifully in mice. Everyone’s excited.

Then it lands on my desk for process development.

And that’s when we discover the molecule is a nightmare to manufacture: Expression yields are terrible (we get 0.5 g/L instead of the 5 g/L we need). It aggregates during purification like it has a personal vendetta against staying soluble. The formulation team can’t get it stable enough for storage. Scale-up? Forget about it.

The timeline extends by 18 months. The budget balloons by $30-50 million. Sometimes the program gets scrapped entirely.

All because we didn’t know the molecule would be difficult to make until we tried to make it.

Why This Keeps Happening

There’s a fundamental disconnect in drug development.

Discovery scientists optimize for binding affinity (how tightly it sticks to the target), selectivity (no off-target effects), efficacy (it works in disease models), and safety (it doesn’t harm healthy tissue).

All critical. All necessary.

But there’s one thing that often gets overlooked until it’s too late: Can we actually manufacture this at scale?

It’s not that discovery scientists don’t care. They absolutely do. It’s just that by the time we discover manufacturing problems, we’ve already invested millions. The incentive is to push forward, not go back to the drawing board.

We need earlier warning signals.

The “Aha” Moment

Last year, I was troubleshooting yet another low-expressing antibody. Looking at the sequence, I noticed something: almost all the “difficult” molecules we’d worked on had similar characteristics.

Lots of hydrophobic patches in the CDRs (the binding regions). High numbers of deamidation-prone sites. Unusual charge distributions.

I started keeping a list. Pattern matching, basically.

Then I thought: If I can see these patterns, could a machine learning algorithm?

And more importantly: Could we predict these problems before we waste months in the lab?

That question became an obsession.

Building the Solution (On My Personal Time)

I want to be very clear about something: I built this project entirely on my own time, using my personal laptop, with only publicly available data. Nothing proprietary. Nothing from my employer. This is a portfolio project, pure and simple.

But it’s a portfolio project that solves a real problem.

The Approach

1. Gathering Data

I collected sequences and developability data for over 100 therapeutic antibodies from public sources: DrugBank (approved drugs), published papers with expression data, and structural antibody databases.

For each antibody, I labeled it based on reported manufacturing characteristics: expression level (high/medium/low), aggregation propensity, and known stability issues.

2. Engineering Features

This is where my bioprocess background became crucial. I didn’t just throw sequences at a neural network and hope for the best. I calculated 40+ meaningful features based on what actually matters in manufacturing:

Physicochemical properties: Molecular weight and charge (affects purification), hydrophobicity profiles (predicts aggregation), isoelectric point (impacts formulation), and instability index (thermal stability indicator).

Aggregation risk indicators: Hydrophobic patches (I scan the sequence with a sliding window to find clusters of hydrophobic amino acids these are aggregation hotspots), surface charge distribution, and unusual amino acid motifs.

Manufacturing liability markers: N-glycosylation sites (can cause heterogeneity), deamidation hotspots (NG, NS motifs that cause instability), oxidation-prone residues (methionine, tryptophan), and isomerization sites.

Every feature I included has a biological reason to exist. This isn’t black box AI it’s AI guided by domain expertise.

3. Training the Model

I trained two models: Random Forest (as a baseline) and XGBoost (gradient boosting).

The results? 75% accuracy on held-out test data.

That might not sound impressive if you’re used to computer vision hitting 99% accuracy. But in biology, with limited data, predicting a complex phenotype from sequence alone? That’s actually pretty good.

More importantly: it’s useful. A 75% accurate screening tool can save months of wasted experimental work.

What I Learned

The most predictive features weren’t what I expected:

1. Hydrophobic patches in CDRs (importance: 18%). The single strongest predictor of aggregation. Patches longer than 7 residues with high hydrophobicity are red flags.

2. Deamidation site count (14%). NG and NS motifs in variable regions. Directly correlates with shelf-life problems.

3. Isoelectric point (12%). Extreme values (< 6 or > 9) cause viscosity issues. Sweet spot appears to be 7.0-8.5.

4. Charge asymmetry between chains (9%). Large differences in pI between heavy and light chains. Linked to poor expression (probably pairing issues).

Some features I thought would matter didn’t: Overall sequence length was a weak predictor. Total cysteine count only matters if unpaired. Molecular weight was surprisingly unimportant.

The model is telling us something real about the biology.

The Demo: Making It Real

I built an interactive Streamlit app where you can paste in antibody sequences and get instant predictions.

Type in a heavy chain and light chain sequence and get a developability score in under a second.

I tested it on known therapeutic antibodies: Trastuzumab (Herceptin) was predicted “High” developability (correct it’s manufactured at multi-ton scale). Rituximab was “High” (also a manufacturing success story). Some experimental antibodies from papers were “Low” (matched reported expression problems).

It works. For Drug Discovery

Imagine running this during lead optimization: “Let’s mutate position 52 to increase binding affinity” and the AI responds “That creates a deamidation site and drops developability from 85% to 58%. Consider F or Y instead?”

Real-time feedback during design. Not after $10M has been spent.

For Process Development

We could prioritize candidates with better manufacturability. Or at least go into process development with eyes open about which molecules will need extra attention.

The business case: About 30% of antibody candidates fail due to manufacturability. Each failure costs $10-50M and 12-24 months. AI screening costs: approximately $0 per molecule, less than 1 second.

For the Industry

We need to stop treating discovery and manufacturing as separate silos that occasionally throw things over the wall to each other.

Discovery shouldn’t just ask “Does it bind?”

They should ask “Does it bind and can we make it?”

AI can help bridge that gap.

The Honest Limitations

Let me be clear about what this model can’t do: It can’t predict exact expression titers, account for different cell lines or culture conditions, replace experimental validation, or predict immunogenicity or in vivo behavior.

What it can do: Rank candidates by likely manufacturability, flag high-risk sequences early, guide rational sequence engineering, and start conversations between discovery and manufacturing.

It’s a screening tool, not a crystal ball.

What I’d Do Next (If This Were My Full-Time Job)

Short-term improvements:

Better data: Partner with CDMOs to get real manufacturing data. Build company-specific models with proprietary sequences. Prospective validation on molecules in development.

Structural features: Integrate AlphaFold predictions. Calculate surface hydrophobicity from predicted structures. Add B-factor predictions for flexibility analysis.

Advanced ML: Transfer learning from protein language models (ESM-2, ProteinBERT). Multi-task learning (predict expression + aggregation + stability simultaneously). Attention mechanisms to identify which sequence regions drive predictions.

Long-term vision:

I want to see this kind of tool built into antibody design software.

Real-time developability feedback while you’re designing molecules. Not as a checkbox at the end, but as part of the creative process.

Discovery and manufacturing working together from day one.

The Bigger Picture: Why I Did This

I didn’t build this just to learn machine learning (though I did learn a lot).

I built it because I’m tired of seeing good programs fail for preventable reasons.

I’m tired of the “discovery throws it over the wall to manufacturing” model.

I’m tired of watching timelines slip and budgets balloon because we didn’t ask the manufacturability question early enough.

We have the data. We have the algorithms. We have the domain knowledge.

What we need is more people willing to bridge the gap between disciplines.

A Personal Note

I work in precision fermentation (making proteins for food applications). This project is about therapeutic antibodies, a completely different application area. I built it on my own time, with my own equipment, using only public data.

Why mention this?

Because I think the best solutions come from people who can see across disciplinary boundaries. Bioprocess engineering plus machine learning. Manufacturing reality plus AI capability.

That’s the bridge we need.

Want to Try It?

The project is open source. You can try the demo, browse the code at github.com/lynchaos/antibody-developability-ml, or read the technical details in the GitHub repo.

If you’re working in antibody discovery, process development, or CMC, I’d love to hear your thoughts. What would make this more useful? What am I missing?

Final Thoughts

Drug development is hard enough without manufacturing surprises derailing programs late in development.

We can do better.

AI won’t replace the brilliant scientists doing discovery or the skilled engineers doing process development. But it can give us better tools to work together.

It can help us ask the right questions earlier.

It can help us design molecules that are both effective and manufacturable.

And maybe just maybe it can help more medicines make it to patients who need them.

That’s worth a few nights and weekends of coding.

Bridging the gap between discovery and manufacturing in biologics development. This project was built entirely during personal time using public data for educational and portfolio purposes.

Technical Appendix (For the Nerds)

Stack: Python 3.9+, BioPython for sequence analysis, XGBoost and scikit-learn for ML, Streamlit for web app, Plotly for interactive visualizations.

Dataset: Training: 80 antibodies, Test: 20 antibodies (stratified, held-out), Sources: DrugBank, published literature, SAbDab.

Performance: XGBoost accuracy: 75%, Random Forest accuracy: 72%, Cross-validation: 73% ± 4%, Inference time: less than 1 second per antibody.

Top 5 Features by Importance:

1. HC hydrophobic patch score (0.18)

2. Deamidation site count (0.14)

3. Isoelectric point (0.12)

4. N-glycosylation sites (0.11)

5. Charge asymmetry (0.09)

Feature Engineering Highlights:

Aggregation scoring uses a sliding window approach. The algorithm iterates through the sequence with a 7-residue window, calculates the GRAVY score for each window, and returns the maximum hydrophobicity found multiplied by 10.

PTM site identification uses regex patterns. N-glycosylation sites follow the N-X-S/T motif (where X is not proline). Deamidation hotspots are identified by counting NG and NS motifs in the sequence.

All code available on GitHub with detailed documentation.

https://github.com/lynchaos/antibody-developability-ml.git

https://antibody-developability.streamlit.app

Have questions? Feedback? Job opportunities? Let’s talk!

Drop a comment below or reach out directly. I’m always happy to discuss biologics manufacturing, ML for proteins, or how we can bridge discovery and development more effectively.

How I Built a Machine Learning Tool to Predict Drug Manufacturing Failures

By Kemal

Related Post

Remove Background Noise from Video Without Re-encoding: An Audio-Only Approach with DeepFilterNet3

Cracker Blaster: a little block puzzle I built for my daughter

From Lab Bench to Browser: A Hybrid Digital Twin for CHO Cell Culture