Measuring Bloat, Overfitting and Functional Complexity in Genetic Programming



Recent contributions clearly show that eliminating bloat in a genetic programming system does not necessarily eliminate overfitting and vice-versa. This fact seems to contradict a common agreement of many researchers known as the minimum description length principle, which states that the best model is the one that minimizes the amount of information needed to encode it. Another common agreement is that overfitting should be, in some sense, related to the functional complexity of the model. The goal of this paper is to define three measures to respectively quantify bloat, overfitting and functional complexity of solutions and show their suitability on a set of test problems including a simple bidimensional symbolic regression test function and two real-life multidimensional regression problems. The experimental results are encouraging and should pave the way to further investigation. Advantages and drawbacks of the proposed measures are discussed, and ways to improve them are suggested. In the future, these measures should be useful to study and better understand the relationship between bloat, overfitting and functional complexity of solutions.


Genetic Programming, Bloat, Overfitting, Complexity


Genetic Programming

Related Project

EnviGP - Improving Genetic Programming for the Environment and Other Applications


2010 Genetic and Evolutionary Computation Conference (GECCO 2010), July 2010

Cited by

Year 2012 : 4 citations

 Bassett JK, Kamath U, De Jong, KA (2012). A new methodology for the GP theory toolbox. In Proc Genetic and Evolutionary Computation Conference (GECCO 2012), 719-726.

 Kamath U, Bassett JK, De Jong, KA (2012). Using Quantitative Genetics and Phenotypic Traits in Genetic Programming.

 Dabhi VK, Chaudhary S (2012). A Survey on Techniques of Improving Generalization Ability of Genetic Programming Solutions. arXiv preprint arXiv:1211.1119.

 Nguyen TH, Nguyen XH, McKay B, Nguyen QU (2012). Where should we stop? an investigation on early stopping for GP learning. Proceedings of the 9th international conference on Simulated Evolution and Learning (SEAL-2012), 391-399.

Year 2011 : 7 citations

 Kronberger G, Kommenda M, Affenzeller M (2011). Overfitting detection and adaptive covariant parsimony pressure for symbolic regression. In Proc Genetic and Evolutionary Computation Conference (GECCO 2011), 631–638.

 Trujillo L, Martinez Y, Galvan-Lopez E, Legrand P (2011). Predicting problem difficulty for genetic programming applied to data classification. In Proc Genetic and Evolutionary Computation Conference (GECCO 2011), 1355–1362.

 Trujillo L, Martinez Y, Melin P (2011). Estimating Classifier Performance with Genetic Programming. In Proc European Conference on Genetic Programming (EuroGP 2011), 274-285.

 Fitzgerald J, Ryan C (2011). Validation Sets for Evolutionary Curtailment with Improved Generalisation. In Convergence and Hybrid Information Technology, 282-289.

 Nguyen QU (2011). Examining Semantic Diversity and Semantic Locality of Operators in Genetic Programming. PhD Thesis. School of Computer Science and Informatics, University College Dublin.

 Hien NT, Hoai NX, McKay B (2011). A Study on Genetic Programming with Layered Learning and Incremental Sampling. In IEEE Congress on Evolutionary Computation (CEC 2011), 1179 – 1185.

 Azad RMA, Ryan C (2011). Variance based selection to improve test set performance in genetic programming. In Genetic and Evolutionary Computation Conference (GECCO 2011), 1315-1322.