We present an empirical analysis of the effects of incorporating novelty-based fitness (phenotypic behavioral diversity) into Genetic Programming with respect to training, test and generalization performance. Three novelty-based approaches are considered: novelty comparison against a finite archive of behavioral archetypes, novelty comparison against all previously seen behaviors, and a simple linear combination of the first method with a standard fitness measure. Performance is evaluated on the Santa Fe Trail, a well known GP benchmark selected for its deceptiveness and established generalization test procedures. Results are compared to a standard quality-based fitness function (count of food eaten). Ultimately, the quality style objective provided better overall performance, however, solutions identified under novelty based fitness functions generally provided much better test performance than their corresponding training performance. This is interpreted as representing a requirement for layered learning/ symbiosis when assuming novelty based fitness functions in order to more quickly achieve the integration of diverse behaviors into a single cohesive strategy.