Mohammed AlQuraishi announced OpenFold last week on twitter. [See also github, openfold.io, businesswire] For biology research and biotech, this is huge: the first publicly available reproduction of AlphaFold2, the revolutionary protein-folding AI from DeepMind/Alphabet. OpenFold is trained from scratch, by people who don’t work at Alphabet, in about 100,000 A100-chip-hours (a chip-hour costs about $3 — the model reaches 90% of that accuracy in 3,000 A100-hours), and includes a PyTorch implementation of the trainer, and multiple sequence alignments for 400,000 proteins. Compared to AlphaFold2, OpenFold runs on proteins that are 1.7x larger (because it’s more memory efficient), runs twice as fast on short proteins, and is slightly more accurate.
Why does OpenFold matter? First, these innovations are a pretty big deal in themselves. It handles proteins with 4600 residues max, compared to 2700 max by AlphaFold2. This is a bit like having a 5-foot step ladder instead of a 3-foot step ladder. And some scientists will be very excited to have the new large corpus of multiple sequence alignments, which give OpenFold’s best attempt at aligning related proteins from different organisms.
But there are three bigger reasons OpenFold matters.
Democratizing cutting edge AI
OpenFold and AlphaFold2 are cutting-edge models of the same class as GPT-3, DALL-E-2, and many other cutting edge AI models you’ve heard about lately (here's the paper that started it all). Those models have all been trained by organizations with the deepest pockets in the world: DeepMind/Alphabet, Facebook, or OpenAI/Microsoft. OpenFold changes that. To quote openfold.io: “The launching group includes the AlQuraishi lab at Columbia University, Arzeda, Cyrus Biotechnology and Outpace Bio.” AlQuraishi’s tweet thanks a longer list of people including NVIDIA Health, and if NVIDIA is substantially involved, it’s the only participant whose pockets are at all deep.
That matters, because it means we’re going to see a lot more AI innovation in this area (techcrunch). As more people can help drive this technology, we’ll get more and better discoveries. It especially matters that they’re different groups of people. AlphaFold2 is written by a small group of ML researchers, chemists, and biologists, on a single team, in a research organization within a for-profit tech company. Professor AlQuraishi is a different kind of guy entirely: a leading academic researcher working at the intersection of molecular and systems biology and ML. The people at Arzeda, Cyrus, and Outpace are yet a third group. These biotech startups combine AI with high-throughput screening (biology experiments done in tiny dishes with robot pipettes) to design proteins and discover drugs – expect them to use their discoveries and platforms in future deals with big pharmaceutical companies and other biotechs.
It seems a good guess that, following the launch of OpenFold, more people from these groups and others will contribute more ideas to how the OpenFold/AlphaFold2 class of models might work, and test them out. We can hope that another three biotechs and another three university labs will get involved. And then another ten. It seems likely that OpenFold will be retrained from scratch again and again.
Strengthening medicine as a science
At the deepest level, AlphaFold2 and OpenFold matter because they allow scientists to understand the human body – and how drugs work on it – in much greater detail.
You probably know that medical understanding is extremely superficial. Until 100 years ago, most of what we knew was simple things, like that if part of your body had a spreading problem, we could cut it off. We also knew how a lot of drugs work, but again at the simplest possible level: we knew what their final effect was (he’s cured!), and perhaps one intermediate step.
In the 1950s, we began to figure out what DNA, RNA, and proteins were, and how they worked. Starting about 20 years ago, the human genome was sequenced, as were the genes of other organisms, as were the proteins that make up our bodies. In many cases, and with much effort, scientists also figured how proteins were folded into their working structure. Suddenly in December 2020 and July 2021, AlphaFold2 brought that problem much closer to a solution, with accurate software predictions of how many proteins fold.
There is, however, much more to learn. These systems form very long sequences of cause and effect, with each of the 20,000 genes in your DNA unlocked by multiple-lock-and-key systems called promoters, which produce RNA, which produce proteins, which fold into particular shapes, which then undergo a series of interactions with other proteins to make our bodies work, including assembling into complex structures, like muscle, heart, and brain.
These protein interaction sequences depend crucially on local factors: what other proteins and minerals are present in this part of the cell (or this part of the extracellular area)? This proteome can be thought of as a protein mixture or cocktail. It varies from tissue to tissue, from cell to cell, and from place to place within the cell, which is a highly-structured environment. Proteins are of seven major types – antibodies, contractile proteins, enzymes, hormonal proteins, structural proteins, storage proteins, and transport proteins – and as the presence of enzymes and hormones suggest, an awful lot of them are there to react with each other so that the resulting products can interact with yet more proteins. An awful lot of them are there to be assembled into large, complex structures like muscles, livers, and kidneys. Many of the structures are not static. Over time, cells within them grow old, die, and are replaced with newly built cells. And remember those promoters that enabled protein production? The keys for those are proteins.
Transport and storage proteins extend the causal sequences: they move proteins around the body: from place to place within cells, to cell membranes, in and out of cells, into the bloodstream, out of the bloodstream in a different part of the body, through the blood vessel walls into other tissues. What do the proteins do there? Interact with other proteins. Most bodily functions we can name are examples: using oxygen, using food, building muscle, thinking.
Finally, proteins build up into larger systems. Important systems include all the organs and tissue classes. Especially important are circulation, the brain, the immune system, aging, and cancer. Proteins are triggers for important actions these systems take (good and bad), like neural firing, inflammation, killing viruses, muscle loss, and tumor growth.
Curing people
Putting it all together, why do AlphaFold2 and OpenFold matter? Because they predict protein folding, and thus how proteins interact with other proteins and minerals, and build up into larger structures, and thus show us how the body functions and malfunctions specifically. They give us a better scientific understanding of function and disease. And this makes it a lot more likely that drug developers like Arzeda, Cyrus, and Outpace, and thousands of others, can tell whether their drug candidates will work, and whether they’ll have side effects, using science and calculation, before they begin years of expensive clinical trials.
Drug companies and biotechs benefit from having this much stronger base of science behind them. This benefit is the ongoing story of medicine in recent decades. Without the basic science, we wouldn’t have the Covid vaccines.
Biotech startups are also running additional processes in parallel, using machine learning to predict the best answers to their drug discovery questions. How will each of 10,000 candidate drugs interact with the target proteins in the body? Will the drug candidate get past the immune system and the cell membrane to get to the target protein? What else will it interact with, and what side effects will that cause? Biotechs are building machine learning systems and combining them with high-throughput screening to answer these questions and many others on the long road to effective drugs. OpenFold is the next step forward in this process. One small step for Mohammed AlQuraishi, Arzeda, Cyrus, and Outpace – a giant leap for biology and medicine as a whole.