------------------------------ CRITERIA SATISFIED BY THE WORK (E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions. (F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered. (H) The result holds its own or wins a regulated competition involving human contestants (in the form of either live human players or human-written computer programs). ------------------------------------------ STATEMENT OF HUMAN-COMPETITIVENESS (E, F & H) The attempts to understand the mechanism of protein folding were pioneered by Christian Anfinsen almost forty years ago. Since the very beginning the main limitation of this research was the inaccuracy of the molecular dynamics simulations. Although along with the growth of computers performance the detail of protein structure models has increased, the molecular simulation on atomic level is even nowadays only possible with a use of large distributed systems such as Folding@Home, which is the most powerful distributed computing system on Earth operating on performance levels reaching 5 petaFLOPS. As soon as community has realised that simulation is not practical yet, the scientific effort has shifted towards prediction where simplified models and expert designed statistical potentials are used. Recent progress in the field of protein structure prediction was achieved thanks to the use of machine learning techniques to solve the prediction sub-problems, e.g. solvent accessibility or contact number prediction, providing better building blocks for the statistical potentials. The energy function formulation however, remained unchanged and is still a linear combination of potentials as in folding, even if the statistical potentials do not represent the physical energy. In this work we use selected statistical potentials used by I-TASSER, the best predictor of the last three editions of CASP experiment, and we challenge the human-made energy function used there. By using genetic programming we allow a free combination of the energy terms to be evolved and compare its quality against the I-TASSER approach, that is a weighted sum of terms were weights are chosen by a human expert using a non-linear numerical optimisation method as a decision support tool. The quality of evolved energy functions is found to be better and therefore we believe the work satisfies criteria E, F and H. ---------------------------------- WHY THIS WORK IS WORTH CONSIDERING The work discussed here deals with an important issue of the energy function design. It proposes a novel use of an automated method to discover the best combination of the energy terms, instead of simple weighted sum with hand-picked coefficients used in the state-of-the-art predictors. The results indicate that the new approach is more appropriate and leads to higher quality energy functions. As the formulation of the energy function is a key element of protein structure predictor, as it drives the process of search for the native-like structures, a better function also means a higher quality of prediction. And structural models of good quality are very important in the protein research because since the advance of the DNA sequencing techniques the gap between the number of known protein sequences and the number of known structures is growing, currently being at a level of 0.2% of sequences solved. So we think this work should be consider as best not only because it presents an interesting human-competitive improvement to the solution of a long-standing problem but also because the importance and the long-term effects in protein science that the improvement in prediction quality could bring. Considering the fact, that after many years of research in the protein structure prediction field that involved large community of experimenters gather around the CASP experiments being held regularly since 1994, this is the first time the automated approach was proposed and the results are competitive with the approach used in the state-of-the-art I-TASSER predictor, they are extremely encouraging. Despite all the gradual improvements made in predictors over recent years and despite the vast amount of research dedicated into optimisation of structures, the changes in the design of energy function were limited so far to introduction of new potentials. However, it is the energy function that defines the search landscape where the best structure is to be found and its smoothness is essential for the efficient prediction. Without such functions the only resort is a random walk over a rugged landscape that requires a vast resources as in Folding@Home. The problem of the design of energy functions for the protein structure prediction is also a new a truly difficult challenge for the GP. Having that in mind we have made the input data used in our experiments available online (with detailed annotations) for everyone who would like to take on this challenge and we would like to encourage the community engage in solving this interesting problem: http://www.infobiotics.org/gpchallange/ ------------ PUBLICATIONS P. Widera, J.M. Garibaldi, N. Krasnogor, "GP challenge: evolving the energy function for protein structure prediction", Genetic Programming and Evolvable Machines 11(1), p.61-88, 2010 DOI: 10.1007/s10710-009-9087-0 publisher's link: http://dx.doi.org/10.1007/s10710-009-9087-0 P. Widera, "Automated design of energy functions for protein structure prediction by means of genetic programming and improved structure similarity assessment", PhD Thesis, Univeristy of Nottingham, UK, 2010 P. Widera, J.M. Garibaldi, N. Krasnogor, "Evolutionary design of the energy function for protein structure prediction", IEEE Congress on Evolutionary Computation, p.1305-1312, Trondheim, Norway, May 2009 DOI: 10.1109/CEC.2009.4983095 publisher's link: http://dx.doi.org/10.1109/CEC.2009.4983095 --------- ABSTRACTS 1) "GP challenge: evolving the energy function for protein structure prediction" One of the key elements in protein structure prediction is the ability to distinguish between good and bad candidate structures. This distinction is made by estimation of the structure energy. The energy function used in the best state-of-the-art automatic predictors competing in the most recent CASP (Critical Assessment of Techniques for Protein Structure Prediction) experiment is defined as a weighted sum of a set of energy terms designed by experts. We hypothesised that combining these terms more freely will improve the prediction quality. To test this hypothesis, we designed a genetic programming algorithm to evolve the protein energy function. We compared the predictive power of the best evolved function and a linear combination of energy terms featuring weights optimised by the Nelder-Mead algorithm. The GP based optimisation outperformed the optimised linear function. We have made the data used in our experiments publicly available in order to encourage others to further investigate this challenging problem by using GP and other methods, and to attempt to improve on the results presented here. 2) "Automated design of energy functions for protein structure prediction by means of genetic programming and improved structure similarity assessment" The process of protein structure prediction is a crucial part of understanding the function of the building blocks of life. It is based on the approximation of a protein free energy that is used to guide the search through the space of protein structures towards the thermodynamic equilibrium of the native state. A function that gives a good approximation of the protein free energy should be able to estimate the structural distance of the evaluated candidate structure to the protein native state. This correlation between the energy and the similarity to the native is the key to high quality predictions. State-of-the-art protein structure prediction methods use very simple techniques to design such energy functions. The individual components of the energy functions are created by human experts with the use of statistical analysis of common structural patterns that occurs in the known native structures. The energy function itself is then defined as a simple weighted sum of these components. Exact values of the weights are set in the process of maximisation of the correlation between the energy and the similarity to the native measured by a root mean square deviation between coordinates of the protein backbone. In this dissertation I argue that this process is oversimplified and could be improved on at least two levels. Firstly, a more complex functional combination of the energy components might be able to reflect the similarity more accurately and thus improve the prediction quality. Secondly, a more robust similarity measure that combines different notions of the protein structural similarity might provide a much more realistic baseline for the energy function optimisation. To test these two hypotheses I have proposed a novel approach to the design of energy functions for protein structure prediction using a genetic programming algorithm to evolve the energy functions and a structural similarity consensus to provide a reference similarity measure. The best evolved energy functions were found to reflect the similarity to the native better than the optimised weighted sum of terms, and therefore opening a new interesting area of research for the machine learning techniques. 3) "Evolutionary design of the energy function for protein structure prediction" Automatic protein structure predictors use the notion of energy to guide the search towards good candidate structures. The energy functions used by the state-of-the-art predictors are defined as a linear combination of several energy terms designed by human experts. We hypothesised that the energy based guidance could be more accurate if the terms were combined more freely. To test this hypothesis, we designed a genetic programming algorithm to evolve the protein energy function. Using several different fitness functions we examined the potential of the evolutionary approach on a set of candidate structures generated during the protein structure prediction process. Although our algorithms were able to improve over the random walk, the fitness of the best individuals was far from the optimum. We discuss the shortcomings of our initial algorithm design and the possible directions for further research. ---------------------------- AUTHORS' CONTACT INFORMATION Natalio Krasnogor (corresponding author), School of Computer Science University of Nottingham Jubilee Campus, Wollaton Road Nottingham, NG8 1BB, UK e-mail: nxk@cs.nott.ac.uk phone: +44 115 8467592 Paweł Widera, School of Computer Science University of Nottingham Jubilee Campus, Wollaton Road Nottingham, NG8 1BB, UK e-mail: plw@cs.nott.ac.uk phone: +44 115 9514234 Jonathan Garibaldi, School of Computer Science University of Nottingham Jubilee Campus, Wollaton Road Nottingham, NG8 1BB, UK e-mail: jmg@cs.nott.ac.uk phone: +44 115 9514216 The prize money, if any, is to be divided equally among the co-authors.