The Features That Matter
After training the developability model on 100+ therapeutic antibodies, I looked at the feature importance rankings. The model had learned to weight certain properties more heavily than others when making predictions. Some of these were obvious. Some were surprising. All of them tell us something about what actually makes antibodies difficult to manufacture.
Thank you for reading this post, don't forget to subscribe!The single most predictive feature was aggregation propensity in the complementarity-determining regions (CDRs). Specifically, the model calculated a “maximum window hydrophobicity” score by sliding a seven-amino-acid window across the sequence and finding the most hydrophobic stretch. High values here correlated strongly with poor developability. This makes biological sense. Hydrophobic patches on the surface of a protein are sticky. They want to interact with other hydrophobic patches. When you have millions of antibody molecules in solution at high concentration (which is what manufacturing requires), hydrophobic patches cause self-association. Self-association leads to aggregation. Aggregation is one of the most common reasons antibodies fail during process development.
The second most important feature was the number of deamidation-prone sites. Deamidation is a chemical degradation pathway where asparagine (N) converts to aspartic acid (D) or isoaspartic acid. This happens spontaneously over time, especially at NG and NS sequences. The problem with deamidation is that it creates product heterogeneity. You start with one molecule, and over weeks or months in storage, you end up with a mixture of deamidated variants. Regulatory agencies care about this because it makes it harder to demonstrate product consistency. Process development teams care about it because it limits shelf life and complicates formulation.
The third critical feature was overall charge. But the relationship wasn’t linear. The model learned that extreme charge (either very positive or very negative) correlated with poor developability. Highly charged molecules can have low expression yields in CHO cells, possibly because they interact with cellular machinery in problematic ways. They also require carefully designed formulation buffers to prevent aggregation driven by electrostatic interactions.
These top three features (hydrophobic patches, deamidation sites, extreme charge) accounted for roughly 60% of the model’s predictive power. The remaining 40% was distributed across dozens of other features: glycosylation sites, oxidation-prone methionines and tryptophans, CDR length, framework region composition, and various calculated stability indices.
Why Aggregation Dominates
Let me dig deeper into why aggregation is such a dominant concern. When pharmaceutical companies manufacture therapeutic antibodies, they’re not working with dilute solutions. Commercial processes push titers as high as possible to maximize productivity. Modern fed-batch CHO processes routinely achieve 5 to 10 grams per liter. That’s an extraordinarily high protein concentration.
At those concentrations, molecules that have any tendency to stick together will aggregate. The aggregates can be reversible (dissociate when diluted) or irreversible (covalently linked or so tightly bound they never come apart). Either way, aggregates are a problem. They reduce product yield. They can trigger immune responses in patients. They’re difficult to remove during purification. They fail quality control tests.
The model learned to look for sequence features that predict aggregation risk. Hydrophobic patches are the most obvious, but there are others. Aromatic residues (phenylalanine, tryptophan, tyrosine) can stack together via pi-pi interactions. Certain dipeptide sequences are known to form beta-sheet structures that promote fibril formation. The model incorporates these signals through calculated indices like the “aggregation propensity score” derived from the TANGO algorithm principles.
What makes this interesting from a machine learning perspective is that the model discovered these patterns from data, not from first principles. I didn’t explicitly program in “aromatic residues are risky.” The model inferred it by seeing that antibodies with high aromatic content in their CDRs tended to have lower developability scores in the training data. This is exactly what you want from a data-driven approach—it captures empirical patterns that might be hard to codify as explicit rules.
The Deamidation Problem
Deamidation is subtler than aggregation but equally important for long-term product stability. The model learned to count NG and NS motifs, which are the primary sites where deamidation occurs. But the context matters. Deamidation happens faster when the asparagine is in a flexible loop (like CDRs often are) compared to a rigid structured region. It also happens faster at higher temperatures and certain pH ranges.
One of the antibodies in my training set had four NG sites in the heavy chain CDRs. The model flagged it as medium developability despite other favorable properties. When I dug into the literature, I found that this antibody required extensive formulation optimization to achieve acceptable shelf life. The deamidation sites were the culprit. The company ultimately succeeded in developing it, but the timeline was longer than average and required significant process chemistry work.
This is exactly the kind of insight a developability predictor should provide. Not “this molecule is impossible,” but “this molecule will require extra work to stabilize.” That information is valuable at the discovery stage when teams are choosing between multiple candidates. If two antibodies have similar binding affinity and specificity, but one has four NG sites and the other has zero, the choice is clear.
Charge and Expression
The charge story was more complex than I initially expected. I assumed highly positive molecules would have problems (they do), but I didn’t anticipate that the relationship would be non-monotonic. The model learned that moderate positive charge (isoelectric point around 7 to 9) was optimal. Very low pI (acidic antibodies) and very high pI (basic antibodies) both correlated with lower developability.
Why? For acidic antibodies, one issue is that they can interact with negatively charged host cell proteins and nucleic acids during production, reducing expression efficiency. For highly basic antibodies, the problem is different. They bind to negatively charged surfaces (like chromatography resins) very tightly, which makes purification challenging. They also tend to aggregate at neutral pH due to electrostatic attraction to negatively charged impurities.
There’s also a biophysical phenomenon called “salting out” where highly charged proteins precipitate at high salt concentrations. Since downstream processing often involves high-salt buffers (for things like viral inactivation), molecules that salt out easily are problematic.
The model captured this by including both the isoelectric point and the net charge at pH 7 as separate features. This allowed it to learn that the optimal zone is pI between 7.5 and 9.0, which is where most successful therapeutic antibodies cluster.
The Role of Glycosylation
Every IgG antibody has a conserved N-glycosylation site at position 297 in the Fc region (using Kabat numbering). This glycosylation is critical for antibody function—it’s required for Fc receptor binding and complement activation. The model learned to expect this site and didn’t penalize molecules for having it.
But additional glycosylation sites, especially in the variable regions, are a red flag. Non-canonical glycosylation introduces heterogeneity (the glycan structures can vary), complicates analytics (you have to characterize more glycoforms), and can affect binding. The model learned that molecules with more than one N-glycosylation site (the NXS/T motif where X is not proline) had lower developability scores.
Interestingly, the model also learned to distinguish between glycosylation sites that are likely to be occupied versus those that are in contexts where glycosylation might not occur. This came from the training data—some antibodies have potential N-glycosylation motifs that are buried or in unfavorable structural contexts and don’t actually get glycosylated. The model picked up on sequence patterns around these sites (local hydrophobicity, secondary structure predictions) that correlate with actual glycosylation occupancy.
What About Oxidation?
Methionine and tryptophan are susceptible to oxidation, which is another form of chemical degradation. Oxidation can occur during manufacturing (especially during purification steps) and during storage. Oxidized antibodies can have reduced binding affinity or altered pharmacokinetics.
The model counted methionine and tryptophan residues and incorporated these into a “potential oxidation sites” feature. However, this feature was less predictive than I expected. Oxidation turned out to be less of a developability killer compared to aggregation or deamidation. This might be because oxidation can often be managed through process design (avoiding oxidative conditions) and formulation (adding antioxidants), whereas aggregation and deamidation are more intrinsic to the molecule.
That said, the model did learn that tryptophans in CDRs were more problematic than methionines elsewhere. Tryptophan oxidation can affect binding since the CDRs are the binding interface. This nuance came through in the feature importance—the model weighted CDR tryptophan content more heavily than overall tryptophan content.
Structural Context: What the Model Doesn’t See
All of these features are sequence-based. The model doesn’t have access to the three-dimensional structure. This is both a limitation and a practical choice. Predicting or modeling 3D structure adds computational cost and complexity. For a first-pass developability screen, sequence features provide good signal.
But there are situations where structure matters and sequence alone is misleading. A hydrophobic patch that looks alarming in sequence might be completely buried in the core of the folded protein and never exposed to solvent. Conversely, two residues that are far apart in sequence might be adjacent in 3D space and create a surface patch that’s problematic.
Incorporating structural predictions is on my roadmap. With tools like AlphaFold now widely available, it’s feasible to predict antibody structures and calculate structure-based features like solvent-accessible surface area of hydrophobic patches, surface charge distribution, and conformational stability. This would improve the model, but it also makes the predictions slower and requires more computational infrastructure.
For now, the sequence-based approach works. The model correctly identifies most problematic molecules, and the false negative rate is acceptable for a screening tool. It’s not perfect, but it’s useful.
Industry Implications
These data-driven insights have practical implications for how discovery teams design antibodies. If the model is correct that hydrophobic CDR patches, deamidation sites, and extreme charge are the main risk factors, then we should actively design away from those properties.
Some of this is already standard practice. Experienced antibody engineers know to avoid NG motifs in CDRs. They know to check for hydrophobic patches. But having quantitative thresholds is useful. The model learned, for example, that aggregation propensity scores above 0.3 (on the scale I calculated) correlated strongly with manufacturing problems. That’s a specific number discovery teams can design to.
There’s also potential to integrate this kind of analysis directly into antibody design workflows. Imagine a tool where scientists input a candidate sequence and get instant developability feedback. “Warning: three NG sites detected in heavy chain CDR2. Consider redesigning.” Or: “Favorable profile detected. Proceed to lead optimization.” This kind of real-time guidance could save months or years by steering teams away from problematic molecules early.
The other implication is for platform processes. Pharmaceutical companies typically develop standardized manufacturing platforms (cell lines, media, purification schemes) that work well for a broad range of antibodies. The model’s insights suggest which molecular properties make antibodies compatible with platform processes. Molecules with moderate hydrophobicity, moderate charge, and minimal deamidation sites are “platform-friendly.” Molecules outside those ranges will require process customization.
What the Model Got Wrong
No model is perfect, and mine is no exception. There were molecules in the test set that the model misclassified. One antibody with a favorable sequence profile (low aggregation propensity, no deamidation hotspots, moderate charge) was flagged as high developability but actually had expression problems in CHO cells. The issue turned out to be related to a specific VH/VL pairing that caused misfolding. The model didn’t catch this because it analyzes heavy and light chains separately and doesn’t model the interaction between them.
Another misclassification involved a molecule with high predicted developability that ended up having severe aggregation during formulation. The model was right that the sequence looked good, but it didn’t account for the fact that this particular antibody was being formulated at extremely high concentration (>150 mg/mL for subcutaneous administration). At that concentration, even mildly sticky molecules aggregate. The model was trained on standard manufacturing conditions, not high-concentration formulations.
These errors are informative. They tell me where the model’s assumptions break down and what additional features or training data are needed. They also reinforce that this is a screening tool, not a replacement for actual process development. You still need to test molecules experimentally. But the model helps prioritize which molecules to test.
Looking Forward
The patterns I’ve described—aggregation, deamidation, charge—are just the beginning. There’s a long tail of less common but still important features: unusual amino acids, rare PTMs, framework region variations, glycosylation micro-heterogeneity. As I collect more data, the model will learn to recognize these edge cases.
I’m also interested in moving beyond classification (high/medium/low) to regression (predict actual titer, actual aggregation percentage, actual shelf life). That requires more granular training data, which is hard to get from public sources. But some companies are starting to share this kind of data through precompetitive consortia, and that could enable the next generation of models.
For now, the current model does what it’s designed to do: flag sequences that are likely to cause problems and prioritize those that look manufacturable. The features it learned align with what experienced process scientists already know, which is reassuring. But it codifies that knowledge in a scalable, quantitative way. That’s the promise of ML in biopharma—not replacing expertise, but scaling it.
This is part of a series on applying machine learning to antibody developability. For the tool and code, see the GitHub repository.
- Case Study: Predicting Trastuzumab Developability https://kemal.yaylali.uk/case-study-predicting-trastuzumab-developability/