FIBR: BeeSpace
An Interactive Environment for Analyzing
Nature and Nurture in Societal Roles
Bruce Schatz (Information Science), Gene Robinson (Entomology),
ChengXiang Zhai (Computer Science), Sandra Rodriguez-Zas (Animal
Science), Bertram Bruce (Education) University of Illinois at
Urbana-Champaign
Susan Fahrbach (Biology), Wake Forest University, Winston-Salem,
North Carolina
Abstract
Honeybees are individuals within a society. They are individually
complex enough so that their behavior takes on different roles
at different times in their lives. But the timing of the roles
is not determined solely by nature (genetics). The timing of,
and even the occurrence of, societal roles is also determined
by nurture (environment). Honeybees represent an excellent model
to study how combinations of nature and nurture combine to produce
different roles and different behaviors in the variation of individuals
within a society.
Our goal is simple but ambitious. We will functionally analyze
all the roles of a honeybee within her society. It is possible
today, due to choice of an appropriate model and to feasible technology
on the biology and on the informatics ends. We will create a
master list of roles including genetic and environmental variations.
This list will be based on extensive research into social insects,
but generically specified for potential comparisons to higher
organisms.
Next, we will develop databases completely covering all of the
necessary features for functional analysis of these roles. These
databases will be generated using wet-lab technologies (biology)
for genetics, including expression products for normal behaviors,
and using dry-lab technologies (informatics) for environment,
including natural history of normal behaviors. Statistical techniques
will be used to identify differentiating genes for each behavior
on the genetics end (from microarray expressions) and to identify
differentiating phrases for each behavior on the environment end
(from biological literature).
Finally, we will develop a comprehensive interactive software
environment for functional analysis of our model organism. For
any role, corresponding functional phrases from textual sources
for genetics (such as gene descriptions from model organisms)
and for environment (such as scientific articles from biological
literature) can be located via interactive selection through multiple
sources.
Understanding Nature and Nurture in Societal Roles
Individuals function in society; they assume different roles
as they live and die. Understanding societal roles requires
observing individuals in the environments they interact in and
describing the underlying mechanisms. We propose a new approach
to understanding social behavior, using new technologies in biology
and informatics that integrate molecular description with information
and concepts from behavior and evolution.
Now, after a half-century of breakthroughs in learning how to
decode and catalog genes, tools are finally at hand to subject
genes and behavior to rational analysis. The success of molecular
biology makes possible a new form of behavioral analysis, using
the genome to investigate the relationship between genes and behavior.
The first wave of genome sequencing focused on molecular function,
for organisms selected for genetic tractability in the laboratory.
To understand social behavior, we need to move beyond the laboratory
to more natural environments. We need to investigate biological
processes for organisms showing normal behavior as they interact
with their environment, while striving to achieve the same depth
of description in molecular function attained with the model genetic
animals. Moving from laboratory conditions to natural
everyday conditions is necessary to understand social behavior.
Such an approach is only just possible today because of advances
in technology. We propose a hero experiment today that will show
what is possible for everyday biology tomorrow, after the end
of the project in the foreseeable future of five to ten years,
when molecular analysis will be routine in all biology labs.
In the future, genome sequencing and expression profiling will
be performed routinely for individual organisms in particular
experimental situations. But how will the biologists relate
this molecular information to normal behavior, and understand
it from the diverse perspectives that inform behavior?
We propose to develop BeeSpace, an interactive environment
to support functional analysis of normal behavior at the molecular
level. On the biology side, we will develop brain gene expression
profiles for all normal behaviors of the honey bee. On the informatics
side, we will use all functional sources from databases and literature
to integrate molecular description with information and concepts
from behavior and evolution. When biotechnology enables routine
sequencing and expression analysis, bioinformatics must enable
routine functional analysis to interpret the experimental data
from a broad integrative perspective.
The process of explanation and the system of informatics thus
developed will be applicable to many other models and organisms
across biology. This is because the tools will use the full range
of available functional information in the scientific literature
from the many special communities and in the gene descriptions
from the few model organisms. The BeeSpace will be the model
for the BioSpace, universal infrastructure to navigate all biological
knowledge.
The Model Organism: Apis Mellifera the Honeybee
Honeybees are individuals within a society. They are
individually complex enough so that their behavior takes on different
roles at different times in their lives. But the timing of the
roles is not determined solely by nature (genetics). The timing
of, and even the occurrence of, societal roles is also determined
by nurture (environment). They thus represent an excellent model
to study how combinations of nature and nurture combine to produce
different roles and different behaviors in the variation of individuals
within a society.
Unlike simpler organisms, honeybees exist within a society and
each responds to its needs as well as their own. For example,
the life cycle of a female honeybee is to serve the role of nurse
within the hive when young, feeding the offspring, and to serve
the role of forager outside the hive when mature, gathering the
food. The change from nurse to forager is a significant lifestyle
change. This change typically occurs around two weeks of age,
with some individual variation. But conditions within the hive,
particularly the amount of food supply (honey in the cells), significantly
affect the number of foragers and the timing of their transformations.
If there is not enough food, precocious foragers emerge earlier
than normal. That is, the timing of the nurse to forager transition
depends on nature and on nurture, on the genetics of the individual
organism and on the food supply of the societal environment.
The effectiveness of the honeybee as a model is that it is just
the right size to perform a complete functional analysis at the
present level of technology, both biological and informational.
The unity of nature and nurture, genetics and environment can
be captured and functionally analyzed by using the brain
as a window into the environment. The genome is sequenced, but
so are other organisms smaller and larger, more simple and more
complex. But the sequence being present means that the expression
products of all of the genes, thus capturing the genetics of any
behavior transitions. In particular, the expression products
of all of the genes in the brain can be captured, thus capturing
the environments of any behavior transitions.
The efficiency of the honeybee as a model is that both the genetics
and the environment can be controlled, but within natural rather
than within artificial conditions. That is, nature itself can
become the laboratory, varying conditions to observe behaviors
while maintaining only normal conditions that commonly occur in
the wild. The genetic control is due to the logical structure
within the society, where all individuals are the progeny of a
single queen. The environmental control is due to the physical
structure within the society, where all individuals base their
behaviors within the hive. Thus genetics can be controlled by
varying the colony (features of the queen) while environment can
be controlled by varying the base (features of the hive).
To functionally analyze all the roles of a honeybee within its
society, all the genetic variation and all the environmental variation
must be captured. Another unique feature of the honeybee is the
enormous detail of knowledge about its natural history. The behaviors
of social insects in general and the honeybee in specific are
known at a level of detail unmatched by any other organism. The
behavioral observations are done on normal honeybees during normal
behavior in the wild, thus recording average functioning rather
than extreme functioning.
The special relationship of honeybees to humans during several
thousand years of beekeeping, but without any special domestication
to change normal behavior, is absolutely unique. The bee is also
at the cusp of complexity, where closely related families are
not essentially similar, such as flies and wasps, but not eusocial.
Our choice as a model organism for functional analysis of normal
behavior at the molecular level is Apis mellifera, the
Western honey bee. Honey bees live in societies that rival our
own in complexity, internal cohesion, and success in dealing with
the myriad challenges posed by social life, including those related
to communication, aging, social dysfunction and infectious disease.
This organism was one of the five chosen by the National Human
Genome Research Institute for genome sequencing in the first competition
held after the completion of the genomes of the model organisms
of classical genetics and the human. Our biology lead coPI Robinson
was the lead author of the proposal for the honey bee genome and
coordinates the project together with sequencers at the Human
Genome Sequencing Center at Baylor College of Medicine. http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/HoneyBee_Genome.pdf
We believe that the honey bee is the appropriate model for this
project [Robinson,2002b]. (1) Because the bee is an insect, its
behavior is highly stereotyped and rigorously assayable; complete
behavioral maturation occurs within a relatively short lifespan
(4-6 weeks). (2) Owing to a long and rich association with humans,
comprehensive knowledge of bee behavior [Winston,1987] provides
a firm foundation upon which to build analyses that integrate
molecular biology, neuroscience, ecology, sociobiology, and evolutionary
biology. (3) Methods of raising and manipulating bees are well
established, due to the ancient and close association between
bees and humans for honey production. These techniques enable
control of both genetic and environmental parameters. (4) Bees
live in large colonies that are maintained economically, making
it relatively easy to obtain robust sample sizes. (5) Bees exhibit
the haplodiploid form of sex determination. Using standard technology,
it is relatively easy to collect large numbers of individuals
for analyses that share as much as 75% of their genome with each
other. (6) Bees live in tightly structured societies in which
an individual's physiological and behavioral status is highly
dependent upon communication with other society members. (7)
Bees display a pattern of behavioral maturation that is "vertebrate-like"
in richness and complexity, proceeding from hive tasks such as
nursing to the cognitively demanding task of foraging. This behavioral
development is controlled by neural and endocrine mechanisms similar
to those of vertebrates.
The honey bee A. mellifera has just the right complexity
to successfully demonstrate that normal behavior can be molecularly
analyzed. The nematode worm C. elegans was the key model
that opened the gates for genome-enabled analysis of metabolic
function. We believe that the honey bee is the key model for
genome-enabled analysis of social behavior.
Our informatics lead PI Schatz was the PI of the flagship project
in the NSF National Collaboratory Program that built the Worm
Community System (WCS), which helped push the interactive analysis
of C. elegans in the pre-genome, pre-web era. He subsequently
served as PI of the flagship project in the NSF Digital Libraries
Program [Schatz,1999] that built the Interspace Prototype, which
developed the distributed software technologies necessary to support
interactive functional analysis in the post-genome post-web era
[Schatz,1997]. BeeSpace for bees in the Interspace era might
be considered the semantic version of WCS for worms in the Internet
era. In the Interspace, the infrastructure supports interactive
analysis [Schatz,2002].
The Societal Roles: Social Insects into Generic Roles
The extensive natural history literature deals with the honey
bee both at the colony level, where all the bees in the entire
society are considered at once, and at the level of individual
bees. Both perspectives must be employed to understand social
behavior, but an emphasis on molecular analyses of the individual
is needed to lay the foundation for a comprehensive picture.
Honey bees are complex social animals and individuals take on
different roles throughout their lifetimes. The females remain
in the hive when they are young to feed the babies (nurses), but
leave the hive when they are mature to search for food (foragers).
Complex social behavior occurs in response to environmental
conditions, such as regulating the temperature of the hive by
closer swarming or changing the sources of the food by dance languages.
While the experimental model is an insect, we will choose behaviors
that are potentially applicable to higher organisms, including
humans. Previous efforts at sociobiology have tried to use detailed
observations of insects to make predictions about humans [Wilson,1971,1975].
The current efforts at sociogenomics [Robinson,2002a] have the
advantage of pitching the comparisons at the level of molecular
analysis, where the arguments are more objective.
Fortunately, the fundamental features of social behavior are well
classified for our model of social insects, by E. O. Wilson and
others [Wilson,1976; Oster and Wilson,1978; Seeley,1982,1985;
Robinson,1987]. There are about thirty behaviors, in the general
categories of behavior: care of young; construction and maintenance
of the home; food acquisition; defense; and communication. We
will also sample bees engaged in reproduction but this is an activity
little engaged in by worker bees. Behaviors in other categories,
in contrast, form the basic fabric of all societies, and are performed
by worker bees with exquisite coordination and great success.
[TABLE of SOCIETAL ROLES]
Background: Expressions differentiating Behaviors
CoPI Robinson's lab has developed the first cDNA microarray
for honey bees [Whitfield,2002] and recently reported that brain
gene expression profiles can be used to distinguish between social
roles for individual bees [Whitfield,2003].
Given that generating 100 gene expression profiles from bees took
about 1 person-year for a similar but smaller experiment [Whitfield,2003],
the proposed experiment will require about 10 person-years for
the foundation data set for the BeeSpace project.
Explain the Striking Example of Precocious Foraging, which will
be utilized throughout.
Estimates of the statistical power based on previous analysis indicate that the proposed experiment can detect potentially small sequence expression differences between similar behaviors. Considering that for each behavior, data from 3 hives, 10 bees per hive, 2 arrays per bee and 2 spots per array, a total of 120 observations are potentially available in a ratio analysis and 240 observations in an absolute analysis. For 120 observations, and adjusting the degrees of freedom for the estimation of fixed effects (e.g. hives), approximately 100 degrees of freedom per behavior are available. Assuming that, in the worse scenarios, half of the ratio observations must be discarded, 50 degrees of freedom would be available. Both, 100 and 50 degrees of freedom (DF) scenarios will be evaluated for power. Standard errors (SE) were assumed to be equal (0.2 units) or larger (0.4 or 0.6 units) than those observed in previous bee expression studies conducted in the same lab. Three levels of experiment-wise false positive rate allowing for different multiple testing adjustments from less to most stringent (a =E-4, E-7, E-10) and different magnitude of expression difference (Diff.) or standardized difference (Std. Diff.) between behaviors were evaluated in Table 1.
Table 1. Power to detect difference between two behaviors
DF=100 DF=50
Std. a= a= a= a= a= a=
Diff. SE Diff 1.00E-4 1.00E-7 1.00E-10 1.00E-4 1.00E-7 1.00E-10
0.17 0.6 0.1 1.26E-2 1.12E-4 6.38E-7 1.01E-3 2.67E-6 5.46E-9
0.25 0.4 0.1 7.86E-2 2.05E-3 2.77E-5 3.60E-3 1.57E-5 4.62E-8
0.33 0.6 0.2 2.78E-1 2.01E-2 6.41E-4 1.11E-2 7.98E-5 3.41E-7
0.5 0.2 0.1 8.56E-1 3.39E-1 5.51E-2 6.87E-2 1.34E-3 1.23E-5
0.67 0.6 0.4 9.97E-1 8.89E-1 5.10E-1 2.47E-1 1.30E-2 2.59E-4
1 0.2 0.2 1.00E+0 1.00E+0 9.99E-1 8.20E-1 2.47E-1 2.38E-2
1.5 0.4 0.6 1.00E+0 1.00E+0 1.00E+0 1.00E+0 9.48E-1 5.97E-1
1.67 0.6 1 1.00E+0 1.00E+0 1.00E+0 1.00E+0 9.92E-1 8.37E-1
2 0.2 0.4 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 9.93E-1
2.5 0.4 1 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0
3 0.2 0.6 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0
5 0.2 1 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0 1.00E+0
The tables show that at the most stringent criteria (a = 1E-10) and for a conservative number of observations (100 degrees of freedom), even a one unit log change between behaviors can be detected with a power of 99%. This is equivalent to two fold change in expression units. We can even detect half a unit fold change with a power of 89% under a more reasonable significance criteria (a = 1E-7) for the proposed design and statistical methodology. This power computations are empirically confirmed by results from the comparison of gene expression between nurses and foragers conducted in co-PI Robinson labs (Whitfield et al., 2003). Using the same bee population and microarray techniques that are being proposed here, Whitfield et al. (2003) were able to detect more than 30 sequences differentially expressed (P < 1E-7) that had less than two fold difference (Figure A).
Figure A. Fold difference and significance of expression between forager (OF) and nurse (YN) bees (Whitfield et al., 2003)
The Biological Experiment: Capturing the Organism
Our hero experiment on the biology side will attempt to provide
a molecular signature of all the roles performed by worker bees
in their society. The behavior of an individual within society
will be analyzed at the molecular level. To accomplish this,
we will enumerate all behaviors, then generate brain gene expression
profiles for a range of individuals captured in the very act of
their performing a normal activity. The small size of the bee
enables flash freezing in mid-flight.
We propose a comprehensive project, which aims to analyze virtually
all social behavior, with sensible definitions. Since
current microarrays record mRNA transcripts, physically generated
within minutes, behaviors representing social roles that last
for days will be emphasized.
We will generate about 1000 brain gene expression profiles, for
1000 individual bees. This is computed by 30 behaviors * 10 bees/behavior
* 3 colonies, to account for individual variation.
Analyzing the Biological Experiment: Statistical Differentiation
The large number of experimental, technical and biological
sources of variation present in microarray studies requires complex
models that account for known correlations among expression levels
from genes pertaining to the same pathway or family. Goals accomplished
through the statistical analysis of the gene expression data will
be: 1) the detection of sequences that exhibit differential expression
between behaviors, 2) the identification of behaviors with related
gene expression patterns among sequences, 3) the identification
of sequences with correlated expression patterns across behaviors,
4) the construction of mathematical functions that can help predict
behaviors based on the most informative sequence patterns.
To accomplish any of the goals, first the numerical fluorescence
intensity values of the sequences will be normalized to remove
systematic sources of variation (e.g. differential labeling) and
to render the data compatible with the assumption in the following
stages. Data filtering will be implemented if objective and visual
inspection indicate spurious measurements. Alternative data transformation
and normalization approaches including LOESS adjustments will
be evaluated (Cui and Churchill, 2003).
For the first goal, the identification of genes differentially
expressed among the 30 behaviors will be conducted using linear
models (Jin et al., 2002) and nonparametric approaches. The multiple
approaches have different strengths and will provide a complementary
view of the expression patterns. Linear models will be implemented
using frequentist (least-squares, likelihood), and Bayesian frameworks.
Meanwhile the likelihood and least square estimates of the relative
expression of the sequences are solely based on the data, the
Bayesian approach will allow the incorporation prior information
on the sequence function or pathway. The statistical significance
(p-value) of the variation of the gene expression across behaviors
will be adjusted for the multiple comparison among the 30 behaviors
and for the multiple testing across the more than 6000 sequences.
Significance adjustment with varying degrees of stringency (e.g.
Bonferroni, false discovery rate) and different assumptions will
be evaluated (Reiner, 2003).
To accomplish the second and third goal hierarchical and disjoint
clustering approaches will be used. In each case different measures
of the similarity between groups of behaviors or sequences and
different grouping methods will be evaluated, because of their
potential impact on the final clusters. In addition, principal
component analysis will be used to identify a lower number of
features that can be used to complete characterize most of the
variation of the 30 behaviors and thousands of sequences. A biological
interpretation will be assigned to the main principal components
whenever possible. Linear and quadratic discriminant approaches
will be used to detect the set of sequences that best characterize
each behavior or group of behaviors. In addition, mathematical
function of the sequence expressions to discriminate behaviors
will be constructed and crossvalidated. The results from all
the statistical analyses will be combined with biological knowledge
of the sequences and their function and of the phenotypes to facilitate
the interpretation of the results.
The different uses of the estimates and inferences resulting from
the data analysis admit different levels of false positive results.
For example, meanwhile the biological verification of the results
is costly and can be accomplished in the most informative sequences,
the informatics component will be benefited by the consideration
of not only the most informative sequences but also of the less
significant but related sequences. Hence, a range of experiment-wise
type I error rates will be considered.
Hierarchical mixed effects models will be used to account for
sources of uncertainty:
y = Xb + Z u + e
Here, y is a vector of gene intensities of the ith gene
across n behaviors; b is a vector of explanatory variables
(fixed, including array, dye swap) associated with the i recorded
intensities per gene; X is an incidence matrix relating
fixed effects to the intensities; u are random effects
(e.g. bees) associated with each gene; Z is an incidence
matrix of appropriate order relating the random effects. All
the genes with significant gene expression differences detected
with this model will be studied with a similar model where the
random variables are the sequence effects ui ~ NIID (0,
S) and ei ~ N(0, s2Ini), where NIID
denotes values that are normal, independent, and identically distributed.
The positive definite matrix S contains the variances
and covariances of gene effects on parameters and s2 is the residual
variance. The estimated variance covariance matrix will permit
to identify groups of sequences with correlated expression patterns
through similar function or shared signal pathway. The assumption
of constant residual variance will be tested and if violated an
heterosedastic model will be evaluated instead. The unknown parameters
(b, S, s2) will be estimated under least-squares,
likelihood (under particular assumptions, equivalent to least-squares),
and Bayesian frameworks.
In a likelihood context, the estimates result from maximizing
the Gaussian likelihood function. Uniform priors for the fixed
effects and Jeffrey's priors for the variance components will
be used in the Bayesian approach. Meanwhile the likelihood approach
provides point estimates of differences among behaviors, the Bayesian
independence-chain algorithmic implementation will provide posterior
density estimates of the contrast between behaviors. Co-PI Rodriguez-Zas
has successfully applied different normalization approaches and
linear models in the likelihood and Bayesian frameworks to the
analysis of gene expression in C. elegans and B. Taurus (Rodriguez-Zas
2002; Rodriguez-Zas et al., 2003; Zou et al., 2003, Loor et al.,
2003a, b; Loor et al., 2004a,b; Zou et al., 2004; Clough et al.,
2004).
Clustering (hierarchical and k-means) approaches will be used
to complement the previously described parametric approaches and
will be applied to genes that express significant variation of
expression across condition (Dudoit and Fridlyand, 2003). These
approaches will provide collection of genes that exhibit similar
expression across ages or genotypes. Different similarity (e.g.
Pearson correlation) and distance (e.g., Euclidean) measurements
will be considered. Likewise, several clustering methods (e.g.,
maximum, minimum, centroid, average, Ward) will be used, since
the final clustering can vary substantially among methods. In
the K-means approach, the user must indicate the desired number
of partitions, hence different inputs will be evaluated. Co-PI
Rodriguez-Zas has conducted clustering analysis of gene expression
data (Rodriguez-Zas et al., 2003).
A discriminant analysis using stepwise selection was used to find
the subset of sequences that best describe the differences among
the classes of tissues. P-values < 0.05 were selected for a
sequence to be accepted as discriminatory sequence and to be kept
once another sequences were also accepted as discriminatory. Starting
with no variables in the model, only the sequence that contributes
most to the discriminatory power among tissue groups (based by
Wilks' lambda) is accepted. In each step of the process all sequences
are evaluated for their discriminatory power based on the p-values
to be accepted and kept conditional on the others until thee stepwise
selection process stops. Co-PI Rodriguez-Zas has conducted discriminant
analysis on biological data (Yeater et al., 2004).
All approaches will be implemented with a combination of available
statistical software (SAS, S-plus) and complemented with novel
programming routines (Matlab, C++) developed to address the particularities
of the large and rich data set of this project.
Utilizing the Biological Experiment: Anatomical Localization
The new complete gene expression profile database will anchor
the second biological experiment, creation of a bee brain atlas
of gene expression. Expression profiles do not directly identify
behavior genes, but provide associations that can be explored
using other methods. Our goal is to use patterns of gene expression
to reveal underlying neural circuitry. We will modify current
procedures for in situ hybridization of bee brains, pioneered
by coPI Fahrbach, to obtain the data for an interactive graphical
atlas. This gene expression atlas will link the microarray data
to more than a century of descriptive neuroanatomy published on
the honey bee brain [Fahrbach, 2003]. Functional correspondences
between insect brain centers and vertebrate brain regions, e.g.
mushroom body versus hippocampus [Strausfeld,1998] or central
complex versus cerebellum, will then support navigation into this
new database for vertebrate biologists.
There is excitement in the neuroscience community about gene expression
maps of the brain. It is widely recognized that we need to employ
genomics to understand how neurons form circuits and signaling
systems that orchestrate complex and flexible behavior. Anatomy
and physiology alone are not sufficient for the complexity of
brain and behavior. Two major projects are preparing expression
maps for the entire mouse genome, the Trans-NIH Molecular Brain
Neuroanatomy (http://trans.nih.gov/bmap/resources/resources.htm)
and the Allen Brain Atlas of the new Allen Institute of Brain
Science (http://www.brainatlas.org). These projects will be driven
by pressure to develop automated high throughput technologies,
required due to the large number of neur0ons in the mouse brain
and the large number of genes in the mouse genome.
An estimate of a representative mouse brain indicated 75 million
neurons present [Williams,2000]; estimates of the number of genes
encoded by the mouse genome range from 30 to 35,000. The corresponding
figures for the honey bee brain are 750,000 neurons (100-fold
less), and 13,000 genes (1/3 less based on estimated number of
genes in the fruit fly genome).
We plan to develop a gene expression atlas for the honey bee,
since once again the bee is just the right scale to show now what
will be possible in the future. The technology for this sized
brain and genome is already possible for a FIBR sized grant.
Furthermore, because we will use the microarray data as a direct
screen for neuronal genes of behavioral relevance, the number
of genes important enough to map onto the physical brain will
be significantly smaller, possibly no more than several hundred.
We see the marriage of the behavioral filter to in situ
hybridization as the primary importance of the brain mapping aspect
of our project. At present, patterns of expression in the adult
insect brain are known for only a small number of genes. [Ben-Shahar,2002;
Kamikouchi,2000; Kucharski,1998; Kurshan,2003]
Background: Literatures differentiating Functions
BeeSpace is an interactive analysis environment enabling functional
analysis from all sources relevant to honey bees. The sources
include textual databases, such as scientific literature and gene
descriptions, and experimental databases, such as genome sequences
and expression products. The environment is a "space"
in that all items are conceptually represented and can be interactively
navigated [Schatz,2002b]. The BeeSpace for normal behavior of
the bee is the model for the BioSpace for normal behavior of all
organisms. [Schatz,2002a]
The Figure illustrates the range of functional sources to be integrated
to build the BeeSpace. The functional sources are rich but partial,
with good explanations but major gaps. There is a natural path
from expression profiles to functional sources towards some biological
perspective. Expressions can be related to data sources, which
can be related to text sources. Towards Genes, the genetic descriptions
from the model organisms such as Drosophila can be navigated.
Towards Behaviors, the natural history literature on the honey
bee can be navigated, placed in ecological and evolutionary context.
Towards Sequences, the genomic annotations of biological literature
can be navigated. Towards Regions, the neuro-anatomy literature
can be navigated.
By navigating within BeeSpace, a biologist can simultaneously
apply different perspectives for functional analysis, from molecular
ecology (Genes to Behaviors from left to right in Figure) to cellular
neuroscience (Sequences to Regions from top to bottom in Figure).
The sources to interconnect to form BeeSpace can be found in
scientific literature (MEDLINE, BIOSIS, AGRICOLA), trade literature
(beekeeping journals), gene descriptions (FLYBASE, ACEDB, MGI),
and biological databases from NCBI and EMBL for sequences and
expressions.
BEESPACE: INTEGRATING FUNCTIONAL SOURCES
Honey bees are among the best studied animals. But molecular studies are relatively sparse, and proceed by comparison with laboratory animals, particularly Drosophila melanogaster, also a winged insect but not a social one. Functional analysis of normal behavior for natural animals must build on the genetic analysis available for laboratory animals, here the correspondence of honey bees to fruit flies implies that the extensive genetics and genome information for Drosophila can be utilized. The gene description database within FlyBase provides a rich source of functional information and the genome sequence database sometimes enables functional annotation with no additional steps. Our BeeSpace project has a close collaboration with the FlyBase PI, William Gelbart at Harvard University, to expand FlyBase beyond flies to encompass bees, including comparative genomics [see letter of collaboration].
Analyzing the Informatics Environment: Concept Extraction
Our unique strategy is to use literature analysis to boost
the functional analysis of genome sequences, for an organism without
extensive genetic information available. PI Schatz has pioneered
large-scale semantic analysis of biomedical literature. His hero
experiment five years ago on supercomputers parsed all articles
in MEDLINE for conceptual phrases and computed the relationships
between these phrases within community collections [Bennett,1999;
Chung,1999].
For this project, done today, better software and faster computers
will enable routine extraction of biological concepts. The literature
analysis can begin with the bee literature, which is typically
sized for a scientific community, on the order of 25,000 articles.
These articles can be gathered from the bibliographic databases
mentioned above. A conceptual navigation can then be done from
the bee literature through several other specialty literatures
to locate functional descriptions of related behaviors. This
process will be generic, similar for any community literature
[Houston,2000; Chen,1997].
This literature navigation will be used in the many cases where
an immediate function is not known. The expression experiments
generate a computed gene based on a particular cDNA segment.
In some cases, perhaps 1/3 of the Apis sequence, these
"genes" are the same segments as for Drosophila.
In this case, the function can be found immediately by locating
the corresponding gene description from the electronic Red Book
within FlyBase. In all other cases, including those generally
harder and more interesting, some functional phrase must be located
from the literature by a multi-step process.
The bee community literature can be partitioned into the major
behaviors for which expression data has been collected from normal
activities. The bee literature can be partitioned into thirty
clusters, where each cluster contains the articles describing
a particular societal role from the master list. The articles
contain the available functional description of the behavior,
but in terms of organism function not molecular function. A biologist
must then navigate conceptually through the biological literature
from a selected bee behavior cluster to another cluster in another
community literature, which has related links to gene descriptions
that are presumably similar to the genes being expressed in bees.
For example, navigating from foraging in bees to foraging in
flies to genes in flies, as a functional explanation for expressions
in bees.
More sophisticated navigations can be used to locate underlying
mechanisms from similar roles in other organisms and from similar
situations in other environments. Community repositories will
be generated for a wide variety of related organisms, by generating
a subcollection of the scientific literature for each organism
partitioned by its societal roles.
In addition to these mechanisms from model organisms from the
bibliographic databases, environment effects on honeybees and
related species will be mined by using the reference books. The
environment effects on honeybee behavior are recorded in the extensive
natural history literature. We will obtain complete electronic
versions of a wide sampling of the standard reference books to
cover the natural history. These books will be partitioned into
short functional descriptions for semantic indexing. Perhaps
at a paragraph level, to mimic the literature citations from the
bibiliographic databases. These books, perhaps 50 from Harvard
University and Cornell University Presses on entomology and social
behavior, will cover a comprehensive history of the honeybee and
related species.
The expertise of coPI Zhai will be used to develop an effective
and efficient natural language parser for concept navigation [Zhai,1997].
Also discuss term-term and cluster-cluster concept switching.
Developing the BeeSpace: Interactive Analysis Environment
The value of interactive discovery in an analysis environment
was first demonstrated in the PI's Worm Community System ten years
ago for C. elegans, supported by NSF BIO [Shoman,1995]
(http://www.canis.uiuc.edu/projects/wcs). At that time, databases
were sparse and the links were forged manually. Today, the databases
are far more complete and the links can be forged largely automatically.
Thus, development and deployment of a general technology for
analysis environments would be of enormous scientific importance.
The analysis environment embodied in the BeeSpace enables biologists
to use their own special knowledge to annotate experiments. On
the data side, we will make effective use of the just generated
complete sequence for Apis and compare the genes within
this sequence to those within the previously generated sequence
for Drosophila. On the text side, we will make effective
use of our extensive analysis of the scientific literature and
compare the extracted functions to those described within the
gene databases for the model organisms.
The analysis environment is interactive since individual sources
can be simultaneously compared. A biologist can find similar
clusters within a source, as well as similar clusters across sources.
By interactively examining similarity variations in multiple
sources and following relationship links to play sources off against
each other, complicated patterns can be discovered.
Most of the technologies developed for BeeSpace would be applicable
to other biological problems, so the software also serves as a
model for the future BioSpace to navigate all biological knowledge.
The software performing this mapping is generic across all biology,
with little specialized to honey bee. Most of the functional
descriptions are contained in scientific literature, which is
also generic across all biology. Thus, a computational model
for functional analysis is being developed, with the honey bee
as the biological model serving as the driver.
Education and Training Plan (undergrads and K-12)
Our premise is that students learn science best when they
are engaged in authentic scientific inquiry, making use of the
tools, methods, and ideas of current science [Dewey,1933; Donovan,1999;
Driver,1985; Krajcik,1994; Minstrell, 2000]. It also emphasizes
the importance of community, whether that be the learning community
of a classroom, that of a neighborhood, or the larger scientific
community [Bruce,2002]. Our approach involves students: K-12 students,
undergraduates, and graduate students as both beneficiaries and
participants in the research. It places special emphasis on traditionally-underserved
populations, including women, minorities, and students in low-resource
communities.
We have developed curricula, workshops, websites, and other tools
to support learning in a variety of life science areas. For example,
the NCSA Biology Workbench http://peptide.ncsa.uiuc.edu/ is
widely recognized as a significant bioinformatics resource because
it provides a suite of interactive tools that draw on a host of
biology databases and allow users to compare molecular sequences
using high performance computing facilities, then visualize and
manipulate molecular structures. (The first version of the Biology
Workbench was developed using the first Web-based version of PI
Schatz's Worm Community System as the underlying network database
engine [Jamison,1996].) Education lead B.C. Bruce has worked
for the last five years with Biology Workbench to develop inquiry-based
approaches to learning bioinformatics in this web-based analysis
environment [Bruce,2003; Thakkar,2000].
Undergraduates: The new course being developed by Fahrbach at Wake Forest.
High School: The University of Illinois runs the University Laboratory High School, a grades 7-12 school on campus. We will establish curriculum units for Inquiry using BeeSpace with the biology teacher there (David Stone), who has won a national award for teaching innovation from the Entomology Society of America.
Middle School. Summer workshop taught by Undergraduates above. Science Kits to recruit minority students in Champaign County. We plan to use the improved Inquiry system to map the analysis paradigm down to the level where middle school students can cross-correlate sources, using computers and bees as the lure.
Finally, we will hold Bee Workshops to enable participants to observe behavior in the wild and try to predict its mechanisms using our BeeSpace environment. These workshops will leverage on longtime outreach strengths in the Department of Entomology. CoPI Robinson has hosted a science-oriented BeeKeeping Workshop for adults for nearly a decade. The Department, through its chair May Berenbaum, has hosted the Insect Fear Film Festival for two decades, an internationally famous festival with attached educational activities for children.
Sharing: Deploying the BeeSpace
Bruce and his colleagues have developed the Inquiry Page http://inquiry.uiuc.edu
project to foster a growing bioinformatics education community.
The Inquiry Page supports incorporation of the Workbench into
day-to-day educational activities, in a way that encourages an
inquiry-based approach to teaching and learning. All units posted
to the Inquiry Page can be searched by keyword, and units can
be viewed by the public. Additionally, Inquiry project participants
can give feedback on others' pages, and develop and post their
own Inquiry Units.
Bruce also leads education and outreach activities for a new NSF
S&T Center at UIUC on Advanced Materials for Water Purification.
For this NSF Center, the Inquiry system again supports interactive
discovery, beyond traditional online documentation or curricular
materials.
The Inquiry Page has recently been extended to Community Inquiry
Labs (CIL). These are a means to engage in research and practice
related to learning with people from all walks of life. A community
inquiry lab is a place where members of a community come together
to develop shared capacity and work on common problems. "Community"
emphasizes support for collaborative activity and for creating
knowledge connected to people's values, history, and experiences.
"Inquiry" emphasizes open-ended, democratic, participatory
engagement. "Laboratory" suggests resources to bring
theory and action together in an experimental manner. CILs provide
an easy to use, web-based infrastructure for communication and
collaboration.
We plan to develop a BeeSpace environment for learning by making use of the Inquiry Page and the Community Inquiry Labs tools. We will test the general applicability of our BeeSpace environment by supporting a range of test users. The first wave will be the few labs that do molecular genetics for the honey bee (about 10). The second wave will be labs that directly work on related genes but in other organisms such as worms and voles (about 10). These will form our experimental users during the period supported by a FIBR grant. See sample letters of collaboration as attached.
The Inquiry system will be our initial vehicle for training graduate students to use the system. Note the paradigm of analysis differs from existing information systems for biologists, but is close to that within the Inquiry system. During the grant period, we will produce an improved version of the Inquiry Page and the Community Inquiry Labs tools, by incorporating some of BeeSpace research technology. This will be used in second half of project to extend the system to related communities. Again the primary users will be students and postdocs in these labs.
We will hold an annual workshop for the users of the BeeSpace environment, using project funds to invite members from each bee lab and related lab to learn features and share problems. This type of workshop was used with great success in PI Schatz's previous large NSF projects that built working research systems that influenced a generation of students. To further insure that we build a new system with the widest possible influence in the scientific community, we will create an Advisory Committee of biologists and informaticians with international reputations, who will make powerful contacts for our development and deployment.
TABLE of BEE LABS and Related Organisms (User Sites)
Management: Organization and Timelines
Task Year 1 Year 2 Year 3 Year 4 Year 5
Expressions Samples Half Completed
Localizations Samples Quarter Half Completed
Statistics Samples Quarter Quarter Half Completed
Collections Databases Literatures Books Re-index
Indexing Extractor Indexer Switcher Tuning Completed
BeeSpace Setup Databases First Redo Second