I work in bioinformatics and this is the kind of thing I keep trying to communicate to people in the field. Yes, these AI tools (like AlphaFold) are amazing, but if there’s a significant gap in their training data, the AI is going to have that gap too (most of the structures in the protein database were solved via X ray crystallography, which isn’t great for studying highly flexible or disordered proteins)
Yes. My (minimally informed from a single class) understanding is that it sort-of depends on the problem too. Like perhaps in looking at all the data on proteins, the neural network might notice a pattern in protein folding is applicable to the tweaked problem. Of course, there is no guarantee that such a generally applicable rule exists. And even if it does, it might not be discovered by the net before overtraining occurs.
It sounds like your memory from that class is pretty good, and you’re right, it depends on what we’re trying to solve, but the problem in this case is protein folding, so if a neural network spotted a pattern, that’s what we want. Figuring out the generalisable “rules” (i.e. why proteins fold a certain way) isn’t what we’re trying to do with these tools (yet), we’re just on the pattern finding side, which is why the developments from AlphaFold are so incredible, it’s just limited.
It feels like my job for the next few years is going to be “professional killjoy”, because I get people’s excitement, but we can’t properly use these tools if we don’t acknowledge their limitations. If we did that, they’d actually become more powerful because we could develop new and different tools, or go gather experimental data to validate some of the generated structures (or to round out the training data).
I don’t know if this would count as overtraining, because it has so far performed amazingly on structures that are similar to the training data but not in the training data. The problem is we don’t have much training data for the tricky parts. That’s fine, it just means it won’t help us learn much about those areas, but headlines like “alphafold predicts the structures of all human proteins” are so misleading
Ah, but where do you find the training set of all of the human-written good commit messages? 😃
Came to say this. Take my up vote.
I work in bioinformatics and this is the kind of thing I keep trying to communicate to people in the field. Yes, these AI tools (like AlphaFold) are amazing, but if there’s a significant gap in their training data, the AI is going to have that gap too (most of the structures in the protein database were solved via X ray crystallography, which isn’t great for studying highly flexible or disordered proteins)
Yes. My (minimally informed from a single class) understanding is that it sort-of depends on the problem too. Like perhaps in looking at all the data on proteins, the neural network might notice a pattern in protein folding is applicable to the tweaked problem. Of course, there is no guarantee that such a generally applicable rule exists. And even if it does, it might not be discovered by the net before overtraining occurs.
It sounds like your memory from that class is pretty good, and you’re right, it depends on what we’re trying to solve, but the problem in this case is protein folding, so if a neural network spotted a pattern, that’s what we want. Figuring out the generalisable “rules” (i.e. why proteins fold a certain way) isn’t what we’re trying to do with these tools (yet), we’re just on the pattern finding side, which is why the developments from AlphaFold are so incredible, it’s just limited.
It’s articles like this that do my head in https://www.scientificamerican.com/article/one-of-the-biggest-problems-in-biology-has-finally-been-solved/
It feels like my job for the next few years is going to be “professional killjoy”, because I get people’s excitement, but we can’t properly use these tools if we don’t acknowledge their limitations. If we did that, they’d actually become more powerful because we could develop new and different tools, or go gather experimental data to validate some of the generated structures (or to round out the training data).
I don’t know if this would count as overtraining, because it has so far performed amazingly on structures that are similar to the training data but not in the training data. The problem is we don’t have much training data for the tricky parts. That’s fine, it just means it won’t help us learn much about those areas, but headlines like “alphafold predicts the structures of all human proteins” are so misleading