FIBR: BeeSpace ­
An Interactive Environment for Analyzing
Nature and Nurture in Societal Roles

Bruce Schatz (Information Science), Gene Robinson (Entomology), ChengXiang Zhai (Computer Science), Sandra Rodriguez-Zas (Animal Science), Bertram Bruce (Education) University of Illinois at Urbana-Champaign
Susan Fahrbach (Biology), Wake Forest University, Winston-Salem, North Carolina

Abstract

Honeybees are individuals within a society. They are individually complex enough so that their behavior takes on different roles at different times in their lives. But the timing of the roles is not determined solely by nature (genetics). The timing of, and even the occurrence of, societal roles is also determined by nurture (environment). Honeybees represent an excellent model to study how combinations of nature and nurture combine to produce different roles and different behaviors in the variation of individuals within a society.
Our goal is simple but ambitious. We will functionally analyze all the roles of a honeybee within her society. It is possible today, due to choice of an appropriate model and to feasible technology on the biology and on the informatics ends. We will create a master list of roles including genetic and environmental variations. This list will be based on extensive research into social insects, but generically specified for potential comparisons to higher organisms.
Next, we will develop databases completely covering all of the necessary features for functional analysis of these roles. These databases will be generated using wet-lab technologies (biology) for genetics, including expression products for normal behaviors, and using dry-lab technologies (informatics) for environment, including natural history of normal behaviors. Statistical techniques will be used to identify differentiating genes for each behavior on the genetics end (from microarray expressions) and to identify differentiating phrases for each behavior on the environment end (from biological literature).
Finally, we will develop a comprehensive interactive software environment for functional analysis of our model organism. For any role, corresponding functional phrases from textual sources for genetics (such as gene descriptions from model organisms) and for environment (such as scientific articles from biological literature) can be located via interactive selection through multiple sources.


Understanding Nature and Nurture in Societal Roles
Individuals function in society; they assume different roles as they live and die. Understanding societal roles requires observing individuals in the environments they interact in and describing the underlying mechanisms. We propose a new approach to understanding social behavior, using new technologies in biology and informatics that integrate molecular description with information and concepts from behavior and evolution.
Now, after a half-century of breakthroughs in learning how to decode and catalog genes, tools are finally at hand to subject genes and behavior to rational analysis. The success of molecular biology makes possible a new form of behavioral analysis, using the genome to investigate the relationship between genes and behavior. The first wave of genome sequencing focused on molecular function, for organisms selected for genetic tractability in the laboratory.
To understand social behavior, we need to move beyond the laboratory to more natural environments. We need to investigate biological processes for organisms showing normal behavior as they interact with their environment, while striving to achieve the same depth of description in molecular function attained with the model genetic animals. Moving from laboratory conditions to natural everyday conditions is necessary to understand social behavior.
Such an approach is only just possible today because of advances in technology. We propose a hero experiment today that will show what is possible for everyday biology tomorrow, after the end of the project in the foreseeable future of five to ten years, when molecular analysis will be routine in all biology labs. In the future, genome sequencing and expression profiling will be performed routinely for individual organisms in particular experimental situations. But how will the biologists relate this molecular information to normal behavior, and understand it from the diverse perspectives that inform behavior?
We propose to develop BeeSpace, an interactive environment to support functional analysis of normal behavior at the molecular level. On the biology side, we will develop brain gene expression profiles for all normal behaviors of the honey bee. On the informatics side, we will use all functional sources from databases and literature to integrate molecular description with information and concepts from behavior and evolution. When biotechnology enables routine sequencing and expression analysis, bioinformatics must enable routine functional analysis to interpret the experimental data from a broad integrative perspective.
The process of explanation and the system of informatics thus developed will be applicable to many other models and organisms across biology. This is because the tools will use the full range of available functional information ­ in the scientific literature from the many special communities and in the gene descriptions from the few model organisms. The BeeSpace will be the model for the BioSpace, universal infrastructure to navigate all biological knowledge.


The Model Organism: Apis Mellifera the Honeybee
Honeybees are individuals within a society. They are individually complex enough so that their behavior takes on different roles at different times in their lives. But the timing of the roles is not determined solely by nature (genetics). The timing of, and even the occurrence of, societal roles is also determined by nurture (environment). They thus represent an excellent model to study how combinations of nature and nurture combine to produce different roles and different behaviors in the variation of individuals within a society.
Unlike simpler organisms, honeybees exist within a society and each responds to its needs as well as their own. For example, the life cycle of a female honeybee is to serve the role of nurse within the hive when young, feeding the offspring, and to serve the role of forager outside the hive when mature, gathering the food. The change from nurse to forager is a significant lifestyle change. This change typically occurs around two weeks of age, with some individual variation. But conditions within the hive, particularly the amount of food supply (honey in the cells), significantly affect the number of foragers and the timing of their transformations. If there is not enough food, precocious foragers emerge earlier than normal. That is, the timing of the nurse to forager transition depends on nature and on nurture, on the genetics of the individual organism and on the food supply of the societal environment.
The effectiveness of the honeybee as a model is that it is just the right size to perform a complete functional analysis at the present level of technology, both biological and informational. The unity of nature and nurture, genetics and environment can be captured and functionally analyzed ­ by using the brain as a window into the environment. The genome is sequenced, but so are other organisms smaller and larger, more simple and more complex. But the sequence being present means that the expression products of all of the genes, thus capturing the genetics of any behavior transitions. In particular, the expression products of all of the genes in the brain can be captured, thus capturing the environments of any behavior transitions.
The efficiency of the honeybee as a model is that both the genetics and the environment can be controlled, but within natural rather than within artificial conditions. That is, nature itself can become the laboratory, varying conditions to observe behaviors while maintaining only normal conditions that commonly occur in the wild. The genetic control is due to the logical structure within the society, where all individuals are the progeny of a single queen. The environmental control is due to the physical structure within the society, where all individuals base their behaviors within the hive. Thus genetics can be controlled by varying the colony (features of the queen) while environment can be controlled by varying the base (features of the hive).
To functionally analyze all the roles of a honeybee within its society, all the genetic variation and all the environmental variation must be captured. Another unique feature of the honeybee is the enormous detail of knowledge about its natural history. The behaviors of social insects in general and the honeybee in specific are known at a level of detail unmatched by any other organism. The behavioral observations are done on normal honeybees during normal behavior in the wild, thus recording average functioning rather than extreme functioning.
The special relationship of honeybees to humans during several thousand years of beekeeping, but without any special domestication to change normal behavior, is absolutely unique. The bee is also at the cusp of complexity, where closely related families are not essentially similar, such as flies and wasps, but not eusocial. Our choice as a model organism for functional analysis of normal behavior at the molecular level is Apis mellifera, the Western honey bee. Honey bees live in societies that rival our own in complexity, internal cohesion, and success in dealing with the myriad challenges posed by social life, including those related to communication, aging, social dysfunction and infectious disease.
This organism was one of the five chosen by the National Human Genome Research Institute for genome sequencing in the first competition held after the completion of the genomes of the model organisms of classical genetics and the human. Our biology lead coPI Robinson was the lead author of the proposal for the honey bee genome and coordinates the project together with sequencers at the Human Genome Sequencing Center at Baylor College of Medicine. http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/HoneyBee_Genome.pdf
We believe that the honey bee is the appropriate model for this project [Robinson,2002b]. (1) Because the bee is an insect, its behavior is highly stereotyped and rigorously assayable; complete behavioral maturation occurs within a relatively short lifespan (4-6 weeks). (2) Owing to a long and rich association with humans, comprehensive knowledge of bee behavior [Winston,1987] provides a firm foundation upon which to build analyses that integrate molecular biology, neuroscience, ecology, sociobiology, and evolutionary biology. (3) Methods of raising and manipulating bees are well established, due to the ancient and close association between bees and humans for honey production. These techniques enable control of both genetic and environmental parameters. (4) Bees live in large colonies that are maintained economically, making it relatively easy to obtain robust sample sizes. (5) Bees exhibit the haplodiploid form of sex determination. Using standard technology, it is relatively easy to collect large numbers of individuals for analyses that share as much as 75% of their genome with each other. (6) Bees live in tightly structured societies in which an individual's physiological and behavioral status is highly dependent upon communication with other society members. (7) Bees display a pattern of behavioral maturation that is "vertebrate-like" in richness and complexity, proceeding from hive tasks such as nursing to the cognitively demanding task of foraging. This behavioral development is controlled by neural and endocrine mechanisms similar to those of vertebrates.
The honey bee A. mellifera has just the right complexity to successfully demonstrate that normal behavior can be molecularly analyzed. The nematode worm C. elegans was the key model that opened the gates for genome-enabled analysis of metabolic function. We believe that the honey bee is the key model for genome-enabled analysis of social behavior.
Our informatics lead PI Schatz was the PI of the flagship project in the NSF National Collaboratory Program that built the Worm Community System (WCS), which helped push the interactive analysis of C. elegans in the pre-genome, pre-web era. He subsequently served as PI of the flagship project in the NSF Digital Libraries Program [Schatz,1999] that built the Interspace Prototype, which developed the distributed software technologies necessary to support interactive functional analysis in the post-genome post-web era [Schatz,1997]. BeeSpace for bees in the Interspace era might be considered the semantic version of WCS for worms in the Internet era. In the Interspace, the infrastructure supports interactive analysis [Schatz,2002].


The Societal Roles: Social Insects into Generic Roles
The extensive natural history literature deals with the honey bee both at the colony level, where all the bees in the entire society are considered at once, and at the level of individual bees. Both perspectives must be employed to understand social behavior, but an emphasis on molecular analyses of the individual is needed to lay the foundation for a comprehensive picture.
Honey bees are complex social animals and individuals take on different roles throughout their lifetimes. The females remain in the hive when they are young to feed the babies (nurses), but leave the hive when they are mature to search for food (foragers). Complex social behavior occurs in response to environmental conditions, such as regulating the temperature of the hive by closer swarming or changing the sources of the food by dance languages.
While the experimental model is an insect, we will choose behaviors that are potentially applicable to higher organisms, including humans. Previous efforts at sociobiology have tried to use detailed observations of insects to make predictions about humans [Wilson,1971,1975]. The current efforts at sociogenomics [Robinson,2002a] have the advantage of pitching the comparisons at the level of molecular analysis, where the arguments are more objective.
Fortunately, the fundamental features of social behavior are well classified for our model of social insects, by E. O. Wilson and others [Wilson,1976; Oster and Wilson,1978; Seeley,1982,1985; Robinson,1987]. There are about thirty behaviors, in the general categories of behavior: care of young; construction and maintenance of the home; food acquisition; defense; and communication. We will also sample bees engaged in reproduction but this is an activity little engaged in by worker bees. Behaviors in other categories, in contrast, form the basic fabric of all societies, and are performed by worker bees with exquisite coordination and great success.

[TABLE of SOCIETAL ROLES]


Background: Expressions differentiating Behaviors
CoPI Robinson's lab has developed the first cDNA microarray for honey bees [Whitfield,2002] and recently reported that brain gene expression profiles can be used to distinguish between social roles for individual bees [Whitfield,2003].
Given that generating 100 gene expression profiles from bees took about 1 person-year for a similar but smaller experiment [Whitfield,2003], the proposed experiment will require about 10 person-years for the foundation data set for the BeeSpace project.
Explain the Striking Example of Precocious Foraging, which will be utilized throughout.

Estimates of the statistical power based on previous analysis indicate that the proposed experiment can detect potentially small sequence expression differences between similar behaviors. Considering that for each behavior, data from 3 hives, 10 bees per hive, 2 arrays per bee and 2 spots per array, a total of 120 observations are potentially available in a ratio analysis and 240 observations in an absolute analysis. For 120 observations, and adjusting the degrees of freedom for the estimation of fixed effects (e.g. hives), approximately 100 degrees of freedom per behavior are available. Assuming that, in the worse scenarios, half of the ratio observations must be discarded, 50 degrees of freedom would be available. Both, 100 and 50 degrees of freedom (DF) scenarios will be evaluated for power. Standard errors (SE) were assumed to be equal (0.2 units) or larger (0.4 or 0.6 units) than those observed in previous bee expression studies conducted in the same lab. Three levels of experiment-wise false positive rate allowing for different multiple testing adjustments from less to most stringent (a =E-4, E-7, E-10) and different magnitude of expression difference (Diff.) or standardized difference (Std. Diff.) between behaviors were evaluated in Table 1.

Table 1. Power to detect difference between two behaviors
DF=100 DF=50
Std. a= a= a= a= a= a=
Diff. SE Diff 1.00E-4 1.00E-7 1.00E-10 1.00E-4 1.00E-7 1.00E-10
0.17 0.6 0.1 1.26E-2 1.12E-4 6.38E-7 1.01E-3 2.67E-6 5.46E-9
0.25 0.4 0.1 7.86E-2 2.05E-3 2.77E-5 3.60E-3 1.57E-5 4.62E-8
0.33 0.6 0.2 2.78E-1 2.01E-2 6.41E-4 1.11E-2 7.98E-5 3.41E-7
0.5 0.2 0.1 8.56E-1 3.39E-1 5.51E-2 6.87E-2 1.34E-3 1.23E-5
0.67 0.6 0.4 9.97E-1 8.89E-1 5.10E-1 2.47E-1 1.30E-2 2.59E-4
1 0.2 0.2 1.00E+0 1.00E+0 9.99E-1 8.20E-1 2.47E-1 2.38E-2
1.5 0.4 0.6 1.00E+0 1.00E+0 1.00E+0 1.00E+0 9.48E-1 5.97E-1
1.67 0.6 1 1.00E+0 1.00E+0 1.00E+0 1.00E+0 9.92E-1 8.37E-1
2 0.2 0.4 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 9.93E-1
2.5 0.4 1 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0
3 0.2 0.6 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0
5 0.2 1 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0

The tables show that at the most stringent criteria (a = 1E-10) and for a conservative number of observations (100 degrees of freedom), even a one unit log change between behaviors can be detected with a power of 99%. This is equivalent to two fold change in expression units. We can even detect half a unit fold change with a power of 89% under a more reasonable significance criteria (a = 1E-7) for the proposed design and statistical methodology. This power computations are empirically confirmed by results from the comparison of gene expression between nurses and foragers conducted in co-PI Robinson labs (Whitfield et al., 2003). Using the same bee population and microarray techniques that are being proposed here, Whitfield et al. (2003) were able to detect more than 30 sequences differentially expressed (P < 1E-7) that had less than two fold difference (Figure A).

 

Figure A. Fold difference and significance of expression between forager (OF) and nurse (YN) bees (Whitfield et al., 2003)

The Biological Experiment: Capturing the Organism
Our hero experiment on the biology side will attempt to provide a molecular signature of all the roles performed by worker bees in their society. The behavior of an individual within society will be analyzed at the molecular level. To accomplish this, we will enumerate all behaviors, then generate brain gene expression profiles for a range of individuals captured in the very act of their performing a normal activity. The small size of the bee enables flash freezing in mid-flight.
We propose a comprehensive project, which aims to analyze virtually all social behavior, with sensible definitions. Since current microarrays record mRNA transcripts, physically generated within minutes, behaviors representing social roles that last for days will be emphasized.
We will generate about 1000 brain gene expression profiles, for 1000 individual bees. This is computed by 30 behaviors * 10 bees/behavior * 3 colonies, to account for individual variation.


Analyzing the Biological Experiment: Statistical Differentiation
The large number of experimental, technical and biological sources of variation present in microarray studies requires complex models that account for known correlations among expression levels from genes pertaining to the same pathway or family. Goals accomplished through the statistical analysis of the gene expression data will be: 1) the detection of sequences that exhibit differential expression between behaviors, 2) the identification of behaviors with related gene expression patterns among sequences, 3) the identification of sequences with correlated expression patterns across behaviors, 4) the construction of mathematical functions that can help predict behaviors based on the most informative sequence patterns.
To accomplish any of the goals, first the numerical fluorescence intensity values of the sequences will be normalized to remove systematic sources of variation (e.g. differential labeling) and to render the data compatible with the assumption in the following stages. Data filtering will be implemented if objective and visual inspection indicate spurious measurements. Alternative data transformation and normalization approaches including LOESS adjustments will be evaluated (Cui and Churchill, 2003).
For the first goal, the identification of genes differentially expressed among the 30 behaviors will be conducted using linear models (Jin et al., 2002) and nonparametric approaches. The multiple approaches have different strengths and will provide a complementary view of the expression patterns. Linear models will be implemented using frequentist (least-squares, likelihood), and Bayesian frameworks. Meanwhile the likelihood and least square estimates of the relative expression of the sequences are solely based on the data, the Bayesian approach will allow the incorporation prior information on the sequence function or pathway. The statistical significance (p-value) of the variation of the gene expression across behaviors will be adjusted for the multiple comparison among the 30 behaviors and for the multiple testing across the more than 6000 sequences. Significance adjustment with varying degrees of stringency (e.g. Bonferroni, false discovery rate) and different assumptions will be evaluated (Reiner, 2003).
To accomplish the second and third goal hierarchical and disjoint clustering approaches will be used. In each case different measures of the similarity between groups of behaviors or sequences and different grouping methods will be evaluated, because of their potential impact on the final clusters. In addition, principal component analysis will be used to identify a lower number of features that can be used to complete characterize most of the variation of the 30 behaviors and thousands of sequences. A biological interpretation will be assigned to the main principal components whenever possible. Linear and quadratic discriminant approaches will be used to detect the set of sequences that best characterize each behavior or group of behaviors. In addition, mathematical function of the sequence expressions to discriminate behaviors will be constructed and crossvalidated. The results from all the statistical analyses will be combined with biological knowledge of the sequences and their function and of the phenotypes to facilitate the interpretation of the results.
The different uses of the estimates and inferences resulting from the data analysis admit different levels of false positive results. For example, meanwhile the biological verification of the results is costly and can be accomplished in the most informative sequences, the informatics component will be benefited by the consideration of not only the most informative sequences but also of the less significant but related sequences. Hence, a range of experiment-wise type I error rates will be considered.
Hierarchical mixed effects models will be used to account for sources of uncertainty:
y = Xb + Z u + e
Here, y is a vector of gene intensities of the ith gene across n behaviors; b is a vector of explanatory variables (fixed, including array, dye swap) associated with the i recorded intensities per gene; X is an incidence matrix relating fixed effects to the intensities; u are random effects (e.g. bees) associated with each gene; Z is an incidence matrix of appropriate order relating the random effects. All the genes with significant gene expression differences detected with this model will be studied with a similar model where the random variables are the sequence effects ui ~ NIID (0, S) and ei ~ N(0, s2Ini), where NIID denotes values that are normal, independent, and identically distributed. The positive definite matrix S contains the variances and covariances of gene effects on parameters and s2 is the residual variance. The estimated variance covariance matrix will permit to identify groups of sequences with correlated expression patterns through similar function or shared signal pathway. The assumption of constant residual variance will be tested and if violated an heterosedastic model will be evaluated instead. The unknown parameters (b, S, s2) will be estimated under least-squares, likelihood (under particular assumptions, equivalent to least-squares), and Bayesian frameworks.
In a likelihood context, the estimates result from maximizing the Gaussian likelihood function. Uniform priors for the fixed effects and Jeffrey's priors for the variance components will be used in the Bayesian approach. Meanwhile the likelihood approach provides point estimates of differences among behaviors, the Bayesian independence-chain algorithmic implementation will provide posterior density estimates of the contrast between behaviors. Co-PI Rodriguez-Zas has successfully applied different normalization approaches and linear models in the likelihood and Bayesian frameworks to the analysis of gene expression in C. elegans and B. Taurus (Rodriguez-Zas 2002; Rodriguez-Zas et al., 2003; Zou et al., 2003, Loor et al., 2003a, b; Loor et al., 2004a,b; Zou et al., 2004; Clough et al., 2004).
Clustering (hierarchical and k-means) approaches will be used to complement the previously described parametric approaches and will be applied to genes that express significant variation of expression across condition (Dudoit and Fridlyand, 2003). These approaches will provide collection of genes that exhibit similar expression across ages or genotypes. Different similarity (e.g. Pearson correlation) and distance (e.g., Euclidean) measurements will be considered. Likewise, several clustering methods (e.g., maximum, minimum, centroid, average, Ward) will be used, since the final clustering can vary substantially among methods. In the K-means approach, the user must indicate the desired number of partitions, hence different inputs will be evaluated. Co-PI Rodriguez-Zas has conducted clustering analysis of gene expression data (Rodriguez-Zas et al., 2003).
A discriminant analysis using stepwise selection was used to find the subset of sequences that best describe the differences among the classes of tissues. P-values < 0.05 were selected for a sequence to be accepted as discriminatory sequence and to be kept once another sequences were also accepted as discriminatory. Starting with no variables in the model, only the sequence that contributes most to the discriminatory power among tissue groups (based by Wilks' lambda) is accepted. In each step of the process all sequences are evaluated for their discriminatory power based on the p-values to be accepted and kept conditional on the others until thee stepwise selection process stops. Co-PI Rodriguez-Zas has conducted discriminant analysis on biological data (Yeater et al., 2004).
All approaches will be implemented with a combination of available statistical software (SAS, S-plus) and complemented with novel programming routines (Matlab, C++) developed to address the particularities of the large and rich data set of this project.

Utilizing the Biological Experiment: Anatomical Localization
The new complete gene expression profile database will anchor the second biological experiment, creation of a bee brain atlas of gene expression. Expression profiles do not directly identify behavior genes, but provide associations that can be explored using other methods. Our goal is to use patterns of gene expression to reveal underlying neural circuitry. We will modify current procedures for in situ hybridization of bee brains, pioneered by coPI Fahrbach, to obtain the data for an interactive graphical atlas. This gene expression atlas will link the microarray data to more than a century of descriptive neuroanatomy published on the honey bee brain [Fahrbach, 2003]. Functional correspondences between insect brain centers and vertebrate brain regions, e.g. mushroom body versus hippocampus [Strausfeld,1998] or central complex versus cerebellum, will then support navigation into this new database for vertebrate biologists.
There is excitement in the neuroscience community about gene expression maps of the brain. It is widely recognized that we need to employ genomics to understand how neurons form circuits and signaling systems that orchestrate complex and flexible behavior. Anatomy and physiology alone are not sufficient for the complexity of brain and behavior. Two major projects are preparing expression maps for the entire mouse genome, the Trans-NIH Molecular Brain Neuroanatomy (http://trans.nih.gov/bmap/resources/resources.htm) and the Allen Brain Atlas of the new Allen Institute of Brain Science (http://www.brainatlas.org). These projects will be driven by pressure to develop automated high throughput technologies, required due to the large number of neur0ons in the mouse brain and the large number of genes in the mouse genome.
An estimate of a representative mouse brain indicated 75 million neurons present [Williams,2000]; estimates of the number of genes encoded by the mouse genome range from 30 to 35,000. The corresponding figures for the honey bee brain are 750,000 neurons (100-fold less), and 13,000 genes (1/3 less based on estimated number of genes in the fruit fly genome).
We plan to develop a gene expression atlas for the honey bee, since once again the bee is just the right scale to show now what will be possible in the future. The technology for this sized brain and genome is already possible for a FIBR sized grant. Furthermore, because we will use the microarray data as a direct screen for neuronal genes of behavioral relevance, the number of genes important enough to map onto the physical brain will be significantly smaller, possibly no more than several hundred. We see the marriage of the behavioral filter to in situ hybridization as the primary importance of the brain mapping aspect of our project. At present, patterns of expression in the adult insect brain are known for only a small number of genes. [Ben-Shahar,2002; Kamikouchi,2000; Kucharski,1998; Kurshan,2003]

Background: Literatures differentiating Functions
BeeSpace is an interactive analysis environment enabling functional analysis from all sources relevant to honey bees. The sources include textual databases, such as scientific literature and gene descriptions, and experimental databases, such as genome sequences and expression products. The environment is a "space" in that all items are conceptually represented and can be interactively navigated [Schatz,2002b]. The BeeSpace for normal behavior of the bee is the model for the BioSpace for normal behavior of all organisms. [Schatz,2002a]
The Figure illustrates the range of functional sources to be integrated to build the BeeSpace. The functional sources are rich but partial, with good explanations but major gaps. There is a natural path from expression profiles to functional sources towards some biological perspective. Expressions can be related to data sources, which can be related to text sources. Towards Genes, the genetic descriptions from the model organisms such as Drosophila can be navigated. Towards Behaviors, the natural history literature on the honey bee can be navigated, placed in ecological and evolutionary context. Towards Sequences, the genomic annotations of biological literature can be navigated. Towards Regions, the neuro-anatomy literature can be navigated.
By navigating within BeeSpace, a biologist can simultaneously apply different perspectives for functional analysis, from molecular ecology (Genes to Behaviors from left to right in Figure) to cellular neuroscience (Sequences to Regions from top to bottom in Figure). The sources to interconnect to form BeeSpace can be found in scientific literature (MEDLINE, BIOSIS, AGRICOLA), trade literature (beekeeping journals), gene descriptions (FLYBASE, ACEDB, MGI), and biological databases from NCBI and EMBL for sequences and expressions.


BEESPACE: INTEGRATING FUNCTIONAL SOURCES

Honey bees are among the best studied animals. But molecular studies are relatively sparse, and proceed by comparison with laboratory animals, particularly Drosophila melanogaster, also a winged insect but not a social one. Functional analysis of normal behavior for natural animals must build on the genetic analysis available for laboratory animals, here the correspondence of honey bees to fruit flies implies that the extensive genetics and genome information for Drosophila can be utilized. The gene description database within FlyBase provides a rich source of functional information and the genome sequence database sometimes enables functional annotation with no additional steps. Our BeeSpace project has a close collaboration with the FlyBase PI, William Gelbart at Harvard University, to expand FlyBase beyond flies to encompass bees, including comparative genomics [see letter of collaboration].


Analyzing the Informatics Environment: Concept Extraction
Our unique strategy is to use literature analysis to boost the functional analysis of genome sequences, for an organism without extensive genetic information available. PI Schatz has pioneered large-scale semantic analysis of biomedical literature. His hero experiment five years ago on supercomputers parsed all articles in MEDLINE for conceptual phrases and computed the relationships between these phrases within community collections [Bennett,1999; Chung,1999].
For this project, done today, better software and faster computers will enable routine extraction of biological concepts. The literature analysis can begin with the bee literature, which is typically sized for a scientific community, on the order of 25,000 articles. These articles can be gathered from the bibliographic databases mentioned above. A conceptual navigation can then be done from the bee literature through several other specialty literatures to locate functional descriptions of related behaviors. This process will be generic, similar for any community literature [Houston,2000; Chen,1997].
This literature navigation will be used in the many cases where an immediate function is not known. The expression experiments generate a computed gene based on a particular cDNA segment. In some cases, perhaps 1/3 of the Apis sequence, these "genes" are the same segments as for Drosophila. In this case, the function can be found immediately by locating the corresponding gene description from the electronic Red Book within FlyBase. In all other cases, including those generally harder and more interesting, some functional phrase must be located from the literature by a multi-step process.
The bee community literature can be partitioned into the major behaviors for which expression data has been collected from normal activities. The bee literature can be partitioned into thirty clusters, where each cluster contains the articles describing a particular societal role from the master list. The articles contain the available functional description of the behavior, but in terms of organism function not molecular function. A biologist must then navigate conceptually through the biological literature from a selected bee behavior cluster to another cluster in another community literature, which has related links to gene descriptions that are presumably similar to the genes being expressed in bees. For example, navigating from foraging in bees to foraging in flies to genes in flies, as a functional explanation for expressions in bees.
More sophisticated navigations can be used to locate underlying mechanisms from similar roles in other organisms and from similar situations in other environments. Community repositories will be generated for a wide variety of related organisms, by generating a subcollection of the scientific literature for each organism partitioned by its societal roles.
In addition to these mechanisms from model organisms from the bibliographic databases, environment effects on honeybees and related species will be mined by using the reference books. The environment effects on honeybee behavior are recorded in the extensive natural history literature. We will obtain complete electronic versions of a wide sampling of the standard reference books to cover the natural history. These books will be partitioned into short functional descriptions for semantic indexing. Perhaps at a paragraph level, to mimic the literature citations from the bibiliographic databases. These books, perhaps 50 from Harvard University and Cornell University Presses on entomology and social behavior, will cover a comprehensive history of the honeybee and related species.

The expertise of coPI Zhai will be used to develop an effective and efficient natural language parser for concept navigation [Zhai,1997].
Also discuss term-term and cluster-cluster concept switching.

Developing the BeeSpace: Interactive Analysis Environment
The value of interactive discovery in an analysis environment was first demonstrated in the PI's Worm Community System ten years ago for C. elegans, supported by NSF BIO [Shoman,1995] (http://www.canis.uiuc.edu/projects/wcs). At that time, databases were sparse and the links were forged manually. Today, the databases are far more complete and the links can be forged largely automatically. Thus, development and deployment of a general technology for analysis environments would be of enormous scientific importance.
The analysis environment embodied in the BeeSpace enables biologists to use their own special knowledge to annotate experiments. On the data side, we will make effective use of the just generated complete sequence for Apis and compare the genes within this sequence to those within the previously generated sequence for Drosophila. On the text side, we will make effective use of our extensive analysis of the scientific literature and compare the extracted functions to those described within the gene databases for the model organisms.
The analysis environment is interactive since individual sources can be simultaneously compared. A biologist can find similar clusters within a source, as well as similar clusters across sources. By interactively examining similarity variations in multiple sources and following relationship links to play sources off against each other, complicated patterns can be discovered.
Most of the technologies developed for BeeSpace would be applicable to other biological problems, so the software also serves as a model for the future BioSpace to navigate all biological knowledge. The software performing this mapping is generic across all biology, with little specialized to honey bee. Most of the functional descriptions are contained in scientific literature, which is also generic across all biology. Thus, a computational model for functional analysis is being developed, with the honey bee as the biological model serving as the driver.

Education and Training Plan (undergrads and K-12)
Our premise is that students learn science best when they are engaged in authentic scientific inquiry, making use of the tools, methods, and ideas of current science [Dewey,1933; Donovan,1999; Driver,1985; Krajcik,1994; Minstrell, 2000]. It also emphasizes the importance of community, whether that be the learning community of a classroom, that of a neighborhood, or the larger scientific community [Bruce,2002]. Our approach involves students: K-12 students, undergraduates, and graduate students as both beneficiaries and participants in the research. It places special emphasis on traditionally-underserved populations, including women, minorities, and students in low-resource communities.
We have developed curricula, workshops, websites, and other tools to support learning in a variety of life science areas. For example, the NCSA Biology Workbench http://peptide.ncsa.uiuc.edu/ is widely recognized as a significant bioinformatics resource because it provides a suite of interactive tools that draw on a host of biology databases and allow users to compare molecular sequences using high performance computing facilities, then visualize and manipulate molecular structures. (The first version of the Biology Workbench was developed using the first Web-based version of PI Schatz's Worm Community System as the underlying network database engine [Jamison,1996].) Education lead B.C. Bruce has worked for the last five years with Biology Workbench to develop inquiry-based approaches to learning bioinformatics in this web-based analysis environment [Bruce,2003; Thakkar,2000].

Undergraduates: The new course being developed by Fahrbach at Wake Forest.

High School: The University of Illinois runs the University Laboratory High School, a grades 7-12 school on campus. We will establish curriculum units for Inquiry using BeeSpace with the biology teacher there (David Stone), who has won a national award for teaching innovation from the Entomology Society of America.

Middle School. Summer workshop taught by Undergraduates above. Science Kits to recruit minority students in Champaign County. We plan to use the improved Inquiry system to map the analysis paradigm down to the level where middle school students can cross-correlate sources, using computers and bees as the lure.

Finally, we will hold Bee Workshops to enable participants to observe behavior in the wild and try to predict its mechanisms using our BeeSpace environment. These workshops will leverage on longtime outreach strengths in the Department of Entomology. CoPI Robinson has hosted a science-oriented BeeKeeping Workshop for adults for nearly a decade. The Department, through its chair May Berenbaum, has hosted the Insect Fear Film Festival for two decades, an internationally famous festival with attached educational activities for children.


Sharing: Deploying the BeeSpace
Bruce and his colleagues have developed the Inquiry Page http://inquiry.uiuc.edu project to foster a growing bioinformatics education community. The Inquiry Page supports incorporation of the Workbench into day-to-day educational activities, in a way that encourages an inquiry-based approach to teaching and learning. All units posted to the Inquiry Page can be searched by keyword, and units can be viewed by the public. Additionally, Inquiry project participants can give feedback on others' pages, and develop and post their own Inquiry Units.
Bruce also leads education and outreach activities for a new NSF S&T Center at UIUC on Advanced Materials for Water Purification. For this NSF Center, the Inquiry system again supports interactive discovery, beyond traditional online documentation or curricular materials.
The Inquiry Page has recently been extended to Community Inquiry Labs (CIL). These are a means to engage in research and practice related to learning with people from all walks of life. A community inquiry lab is a place where members of a community come together to develop shared capacity and work on common problems. "Community" emphasizes support for collaborative activity and for creating knowledge connected to people's values, history, and experiences. "Inquiry" emphasizes open-ended, democratic, participatory engagement. "Laboratory" suggests resources to bring theory and action together in an experimental manner. CILs provide an easy to use, web-based infrastructure for communication and collaboration.

We plan to develop a BeeSpace environment for learning by making use of the Inquiry Page and the Community Inquiry Labs tools. We will test the general applicability of our BeeSpace environment by supporting a range of test users. The first wave will be the few labs that do molecular genetics for the honey bee (about 10). The second wave will be labs that directly work on related genes but in other organisms such as worms and voles (about 10). These will form our experimental users during the period supported by a FIBR grant. See sample letters of collaboration as attached.

The Inquiry system will be our initial vehicle for training graduate students to use the system. Note the paradigm of analysis differs from existing information systems for biologists, but is close to that within the Inquiry system. During the grant period, we will produce an improved version of the Inquiry Page and the Community Inquiry Labs tools, by incorporating some of BeeSpace research technology. This will be used in second half of project to extend the system to related communities. Again the primary users will be students and postdocs in these labs.

We will hold an annual workshop for the users of the BeeSpace environment, using project funds to invite members from each bee lab and related lab to learn features and share problems. This type of workshop was used with great success in PI Schatz's previous large NSF projects that built working research systems that influenced a generation of students. To further insure that we build a new system with the widest possible influence in the scientific community, we will create an Advisory Committee of biologists and informaticians with international reputations, who will make powerful contacts for our development and deployment.

TABLE of BEE LABS and Related Organisms (User Sites)

Management: Organization and Timelines

Task Year 1 Year 2 Year 3 Year 4 Year 5

Expressions Samples Half Completed
Localizations Samples Quarter Half Completed
Statistics Samples Quarter Quarter Half Completed
Collections Databases Literatures Books Re-index
Indexing Extractor Indexer Switcher Tuning Completed
BeeSpace Setup Databases First Redo Second