KP SYMBOLIC REGRESSION HUMIES Entry for the 2014 "Humies" Awards for Human-Competitive Results -------------------------------------------------------------------------- (1) PAPER TITLE -------------------------------------------------------------------------- Kaizen Programming ------------------------------------------------------------------------- (2) AUTHORS ------------------------------------------------------------------------- Vinícius Veloso de Melo 330, Talim, Vila Nair São José dos Campos, SP, Brazil CEP vinicius.melo@unifesp.br ------------------------------------------------------------------------- (3) CORRESPONDING AUTHOR ------------------------------------------------------------------------- Vinícius Veloso de Melo ------------------------------------------------------------------------- (4) PAPER ABSTRACT ------------------------------------------------------------------------- This paper presents Kaizen Programming, an evolutionary tool based on the concepts of Continuous Improvement from Kaizen Japanese methodology. One may see Kaizen Programming as a new paradigm since, as opposed to classical evolutionary algorithms where individuals are complete solutions, in Kaizen Programming each expert proposes an idea to solve part of the problem, thus a solution is composed of all ideas together. Consequently, evolution becomes a collaborative approach instead of an egocentric one. An idea's quality (analog to an individual's fitness) is not how good it fits the data, but a measurement of its contribution to the solution, which improves the knowledge about the problem. Differently from evolutionary algorithms that simply perform trial-and-error search, one can determine, exactly, parts of the solution that should be removed or improved. That property results in the reduction in bloat, number of function evaluations, and computing time. Even more important, the Kaizen Programming tool, proposed to solve symbolic regression problems, builds the solutions as linear regression models - not linear in the variables, but linear in the parameters, thus all properties and characteristics of such statistical tool are valid. Experiments on benchmark functions proposed in the literature show that Kaizen Programming easily outperforms Genetic Programming and other methods, providing high quality solutions for both training and testing sets while requiring a small number of function evaluations. ------------------------------------------------------------------------- (5) HUMAN COMPETITIVE CRITERIA ------------------------------------------------------------------------- A list containing one or more of the eight letters (A, B, C, D, E, F, G, or H) that correspond to the criteria (see above) that the author claims that the work satisfies, (B) The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal. (D) The result is publishable in its own right as a new scientific result independent of the fact that the result was mechanically created. (F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered. (H) The result holds its own or wins a regulated competition involving human contestants (in the form of either live human players or human-written computer programs). ------------------------------------------------------------------------- (6) WHY CRITERIA ARE SATISFIED ------------------------------------------------------------------------- GP-inspired methods search for high-quality solutions in a competitive manner, thus it is the survival of the fittest. On the other hand, KP combines partial solutions of different qualities to achive the final high-quality solutions, thus it is a cooperation. This way, the user can reduce the overfitting of a KP solution by removing the low-quality partial solutions. That kind of adjustment of the final solution is not possible for GP-inspired methods. KP, used in this work to solve symbolic regression, is better than other methods for several reasons: (1) Usability and interpretability: (a) KP improves the manual modeling of a linear regression by automatically building the explanatory variables, thus being able to generate complex expressions unthinkable by humans; (b) The linear regression model, created by KP, that employs several explanatory variables can be easier to interpret/understand than a single huge expression generated by GP; (c) Differently from tools that use predefined templates for explanatory variables, KP builds them on the fly; (d) Over-fitting can be treated by removing the less important ideas (expressions) like a pruning mechanism; (e) The user can evaluate an individual idea (partial solution) to analyze its behavior / distribution. (2) Search quality: KP results significantly surpassed the published methods by three distinct measures (success rate, mean fitness, number of nodes evaluations): a) The previous best average success rate was 46.6%, whereas KP reached 100%; b) KP achieved better mean best fitness values than the others methods in the comparison; c) The previous best number of nodes evaluation was 1 million, whereas KP required an average of 30,000. Thus, the computing time can be considerably smaller because of the required number of nodes evaluation. ------------------------------------------------------------------------- (7) FULL PAPER CITATION ------------------------------------------------------------------------- de Melo, V. V. (2014), "Kaizen Programming," In: Proceedings of the 2014 Genetic and Evolutionary Computation Conference (GECCO), Vancouver, Canada, 2014. (accepted paper, to appear). ------------------------------------------------------------------------- (8) STATEMENT ON PRIZE MONEY ------------------------------------------------------------------------- Prize money, if any, is to be paid to the corresponding author, Vinícius Veloso de Melo. ------------------------------------------------------------------------- (9) STATEMENT ON WHY THIS ENTRY IS THE BEST ------------------------------------------------------------------------- KP is a new methodology for evolving solutions for computational problems. The strategy of composing a complete solution based on partial solutions allows the calculation of the importance of these partial solutions. Therefore, the proposed method can easily identify which partial solutions can be discarded. Moreover, the importance of a solution is not necessarily how good it solves part of the problem. In fact, the importance is how much it contributes to improve the complete solution. For instance, two high-fitness partial solutions (if analyzed separately) compete against each other, and probably will not be included in the same final complete solution. On the other hand, KP can generate a low-fitness partial solution to be used in conjunction with a high-fitness partial solution, resulting in an optimal complete solution. Therefore, when compared to how humans search for solutions, KP works more similarly than GP and GP-inspired methods. However, KP can be much more efficient than humans in the search process. Given that KP was used in this work to generate linear regression models (linear in the parameters, not in the variables) for symbolic regression problems, that many statistical tools can be used to analyze/improve the linear model, and that that it achieved 100% success in the benchmarks (the best results in the literature) makes KP not only human competitive across many domains but opens the possibility for a new research area.