This document describes the tables that make up the schema. Tables are grouped into categories, and the purpose of each table is explained.

Schema diagram

The schema diagrams are automatically generated as PNG images with Graphviz, and show the links between columns of each table. Here follows the overall schema diagram, while the individual diagrams of each category are available below, together with the table descriptions.

Table list

Here is the list of all the tables found in a typical database, grouped by categories.

Execution

Job tracking

Logging and monitoring

beekeeper
role
worker

accu
analysis_data
job
job_file
semaphore

log_message
worker_resource_usage

Pipeline structure

Resources

analysis_base
analysis_ctrl_rule
analysis_stats
dataflow_rule
dataflow_target
hive_meta
pipeline_wide_parameters

resource_class
resource_description

Execution


Tables to track the agents operating the eHive system

../_images/graphviz-c7326a0c2c87549f177d648936f991d88d708b09.svg

beekeeper


Each row in this table corresponds to a beekeeper process that is running or has run on this pipeline.

Column	Type	Default value	Description	Index
beekeeper_id	INTEGER	not set	unique ID for this beekeeper	primary key
meadow_host	VARCHAR(255)	not set	hostname of machine where beekeeper started	key: beekeeper_signature_idx
meadow_user	VARCHAR(255)	not set	username under which this beekeeper ran or is running	key: beekeeper_signature_idx
process_id	VARCHAR(255)	not set	process_id of the beekeeper	key: beekeeper_signature_idx
is_blocked	SMALLINT	0	beekeeper will not submit workers if this flag is set
cause_of_death	ENUM: ANALYSIS_FAILED DISAPPEARED JOB_FAILED LOOP_LIMIT NO_WORK TASK_FAILED	NULL	reason this beekeeper exited
sleep_minutes	REAL	NULL	sleep interval in minutes
analyses_pattern	TEXT	NULL	restricting analyses_pattern, if given
loop_limit	INTEGER	NULL	loop limit if given
loop_until	ENUM: ANALYSIS_FAILURE FOREVER JOB_FAILURE NO_WORK	not set	beekeeper’s policy for responding to possibly loop-ending events
options	TEXT	NULL	all options passed to the beekeeper
meadow_signatures	TEXT	NULL	signatures for all meadows this beekeeper can submit to

role


Entries of this table correspond to Role objects of the API. When a Worker specialises, it acquires a Role, which is a temporary link between the Worker and a resource-compatible Analysis.

Column	Type	Default value	Description	Index
role_id	INTEGER	not set	unique ID of the Role	primary key
worker_id	INTEGER	not set	the specialised Worker	key: worker
analysis_id	INTEGER	not set	the Analysis into which the Worker specialised	key: analysis
when_started	TIMESTAMP	CURRENT_TIMESTAMP	when this Role started
when_finished	TIMESTAMP	NULL	when this Role finished. NULL may either indicate it is still running or was killed by an external force.
attempted_jobs	INTEGER	0	counter of the number of attempts
done_jobs	INTEGER	0	counter of the number of successful attempts

worker


Entries of this table correspond to Worker objects of the API. Workers are created by inserting into this table so that there is only one instance of a Worker object in the database. As Workers live and do work, they update this table, and when they die they update again.

Column	Type	Default value	Description	Index
worker_id	INTEGER	not set	unique ID of the Worker	primary key
meadow_type	VARCHAR(255)	not set	type of the Meadow it is running on	key: meadow_process
meadow_name	VARCHAR(255)	not set	name of the Meadow it is running on (for “LOCAL” meadows it is the same as meadow_host)	key: meadow_process
meadow_host	VARCHAR(255)	NULL	execution host name
meadow_user	VARCHAR(255)	NULL	scheduling/execution user name (within the Meadow)
process_id	VARCHAR(255)	not set	identifies the Worker process on the Meadow (for “LOCAL” is the OS PID)	key: meadow_process
resource_class_id	INTEGER	NULL	links to Worker’s resource class
work_done	INTEGER	0	how many jobs the Worker has completed successfully
status	VARCHAR(255)	‘READY’	current status of the Worker
beekeeper_id	INTEGER	NULL	beekeeper that created this worker
when_submitted	TIMESTAMP	CURRENT_TIMESTAMP	when the Worker was submitted by a Beekeeper
when_born	TIMESTAMP	NULL	when the Worker process was started
when_checked_in	TIMESTAMP	NULL	when the Worker last checked into the database
when_seen	TIMESTAMP	NULL	when the Worker was last seen by the Meadow
when_died	TIMESTAMP	NULL	if defined, when the Worker died (or its premature death was first detected by GC)
cause_of_death	VARCHAR(255)	NULL	if defined, why did the Worker exit (or why it was killed)
temp_directory_name	VARCHAR(255)	NULL	a filesystem directory where this Worker can store temporary data
log_dir	VARCHAR(255)	NULL	if defined, a filesystem directory where this Worker’s output is logged

Job tracking


Tables to list all the jobs of each analysis, and their dependencies

../_images/graphviz-af2ce79fb6dd47e9e8853fc7b83cede8ff08fb57.svg

accu


Accumulator for funneled dataflow.

Column	Type	Default value	Description	Index
sending_job_id	INTEGER	NULL	semaphoring job in the “box”	key: accu_sending_idx
receiving_semaphore_id	INTEGER	not set	semaphore just outside the “box”	key: accu_receiving_idx
struct_name	VARCHAR(255)	not set	name of the structured parameter
key_signature	VARCHAR(255)	not set	locates the part of the structured parameter
value	MEDIUMTEXT	NULL	value of the part

analysis_data


A generic blob-storage hash. Currently the only legitimate use of this table is “overflow” of job.input_ids: when they grow longer than 254 characters the real data is stored in analysis_data instead, and the input_id contains the corresponding analysis_data_id.

Column	Type	Default value	Description	Index
analysis_data_id	INTEGER	not set	primary id	primary key
md5sum	CHAR(32)	not set	checksum over the data to quickly detect (potential) collisions	key: md5sum
data	MEDIUMTEXT	not set	text blob which holds the data

job


The job table is the heart of this system. It is the kiosk or blackboard where workers find things to do and then post work for other works to do. These jobs are created prior to work being done, are claimed by workers, are updated as the work is done, with a final update on completion.

Column	Type	Default value	Description	Index
job_id	INTEGER	not set	autoincrement id	primary key
prev_job_id	INTEGER	NULL	previous job which created this one
analysis_id	INTEGER	not set	the analysis_id needed to accomplish this job.	unique key: input_id_stacks_analysis key: analysis_status_retry
input_id	CHAR(255)	not set	input data passed into Analysis:RunnableDB to control the work	unique key: input_id_stacks_analysis
param_id_stack	CHAR(64)	‘’	a CSV of job_ids whose input_ids contribute to the stack of local variables for the job	unique key: input_id_stacks_analysis
accu_id_stack	CHAR(64)	‘’	a CSV of job_ids whose accu’s contribute to the stack of local variables for the job	unique key: input_id_stacks_analysis
role_id	INTEGER	NULL	links to the Role that claimed this job (NULL means it has never been claimed)	key: role_status
status	ENUM: SEMAPHORED READY CLAIMED COMPILATION PRE_CLEANUP FETCH_INPUT RUN WRITE_OUTPUT POST_HEALTHCHECK POST_CLEANUP DONE FAILED PASSED_ON	‘READY’	state the job is in	key: analysis_status_retry key: role_status
retry_count	INTEGER	0	number times job had to be reset when worker failed to run it	key: analysis_status_retry
when_completed	TIMESTAMP	NULL	when the job was completed
runtime_msec	INTEGER	NULL	how long did it take to execute the job (or until the moment it failed)
query_count	INTEGER	NULL	how many SQL queries were run during this job
controlled_semaphore_id	INTEGER	NULL	the dbID of the semaphore that is controlled by this job (and whose counter it will decrement by 1 upon successful completion)

job_file


For testing/debugging purposes both STDOUT and STDERR streams of each Job can be redirected into a separate log file. This table holds filesystem paths to one or both of those files. There is max one entry per job_id and retry.

Column	Type	Default value	Description	Index
job_id	INTEGER	not set	foreign key	primary key
retry	INTEGER	not set	copy of retry_count of job as it was run	primary key
role_id	INTEGER	not set	links to the Role that claimed this job	key: role
stdout_file	VARCHAR(255)	NULL	path to the job’s STDOUT log
stderr_file	VARCHAR(255)	NULL	path to the job’s STDERR log

semaphore


The semaphore table is our primary inter-job dependency relationship. Any job may control up to one semaphore, but the semaphore can be controlled by many jobs. This includes remote jobs, so the semaphore keeps two counters - one for local blockers, one for remote ones. As soon as both counters reach zero, the semaphore unblocks one dependent job - either a local one, or through a chain of dependent remote semaphores.

Column	Type	Default value	Description	Index
semaphore_id	INTEGER	not set	autoincrement id	primary key
local_jobs_counter	INTEGER	0	the number of local jobs that control this semaphore
remote_jobs_counter	INTEGER	0	the number of remote semaphores that control this one
dependent_job_id	INTEGER	NULL	either NULL or points at a local job to be unblocked when the semaphore opens (exclusive with dependent_semaphore_url)	unique key: unique_dependent_job_id
dependent_semaphore_url	VARCHAR(255)	NULL	either NULL or points at a remote semaphore to be decremented when this semaphore opens (exclusive with dependent_job_id)

Logging and monitoring


Tables to store messages and feedback from the execution of the pipeline

../_images/graphviz-f754ebbca2df17f8ed99431559d34111546189ac.svg

log_message


When a Job or a job-less Worker (job_id=NULL) throws a “die” message for any reason, the message is recorded in this table. It may or may not indicate that the job was unsuccessful via is_error flag. Also $self->warning(”…”) messages are recorded with is_error=0.

Column	Type	Default value	Description	Index
log_message_id	INTEGER	not set	an autoincremented primary id of the message	primary key
job_id	INTEGER	NULL	the id of the job that threw the message (or NULL if it was outside of a message)	key: job_id
role_id	INTEGER	NULL	the “current” role
worker_id	INTEGER	NULL	the “current” worker	key: worker_id
beekeeper_id	INTEGER	NULL	beekeeper that generated this message	key: beekeeper_id
when_logged	TIMESTAMP	CURRENT_TIMESTAMP	when the message was thrown
retry	INTEGER	NULL	retry_count of the job when the message was thrown (or NULL if no job)
status	VARCHAR(255)	‘UNKNOWN’	of the job or worker when the message was thrown
msg	MEDIUMTEXT	NULL	string that contains the message
message_class	ENUM: INFO PIPELINE_CAUTION PIPELINE_ERROR WORKER_CAUTION WORKER_ERROR	‘INFO’	type of message

worker_resource_usage


A table with post-mortem resource usage statistics of a Worker. This table is not automatically populated: you first need to run load_resource_usage.pl. Note that some meadows (like LOCAL) do not support post-mortem inspection of resource usage

Column	Type	Default value	Description	Index
worker_id	INTEGER	not set	links to the worker table	primary key
exit_status	VARCHAR(255)	NULL	meadow-dependent, in case of LSF it’s usually “done” (normal) or “exit” (abnormal)
mem_megs	FLOAT	NULL	how much memory the Worker process used
swap_megs	FLOAT	NULL	how much swap the Worker process used
pending_sec	FLOAT	NULL	time spent by the process in the queue before it became a Worker
cpu_sec	FLOAT	NULL	cpu time (in seconds) used by the Worker process. It is often lower than the walltime because of time spent in I/O waits, but it can also be higher if the process is multi-threaded
lifespan_sec	FLOAT	NULL	walltime (in seconds) used by the Worker process. It is often higher than the sum of its jobs’ “runtime_msec” because of the overhead from the Worker itself
exception_status	VARCHAR(255)	NULL	meadow-specific flags, in case of LSF it can be “underrun”, “overrun” or “idle”

Pipeline structure


Tables to store the structure of the pipeline (i.e. the analyses and the dataflow-rules between them)

../_images/graphviz-4c645927de1cc3dea1edf3b2c3a79b8631b2e13d.svg

analysis_base


Each Analysis is a node of the pipeline diagram. It acts both as a “class” to which Jobs belong (and inherit from it certain properties) and as a “container” for them (Jobs of an Analysis can be blocking all Jobs of another Analysis).

Column	Type	Default value	Description	Index
analysis_id	INTEGER	not set	a unique ID that is also a foreign key to most of the other tables	primary key
logic_name	VARCHAR(255)	not set	the name of the Analysis object	unique key: logic_name_idx
module	VARCHAR(255)	not set	the name of the module / package that runs this Analysis
language	VARCHAR(255)	NULL	the language of the module, if not Perl
parameters	MEDIUMTEXT	NULL	a stringified hash of parameters common to all jobs of the Analysis
resource_class_id	INTEGER	not set	link to the resource_class table
failed_job_tolerance	INTEGER	0	% of tolerated failed Jobs
max_retry_count	INTEGER	NULL	how many times a job of this Analysis will be retried (unless there is no point)
can_be_empty	SMALLINT	0	if TRUE, this Analysis will not be blocking if/while it doesn’t have any jobs
priority	SMALLINT	0	an Analysis with higher priority will be more likely chosen on Worker’s specialisation
meadow_type	VARCHAR(255)	NULL	if defined, forces this Analysis to be run only on the given Meadow
analysis_capacity	INTEGER	NULL	if defined, limits the number of Workers of this particular Analysis that are allowed to run in parallel
hive_capacity	INTEGER	NULL	a reciprocal limiter on the number of Workers running at the same time (dependent on Workers of other Analyses)
batch_size	INTEGER	1	how many jobs are claimed in one claiming operation before Worker starts executing them
comment	TEXT	not set	human-readable textual description of the analysis
tags	TEXT	not set	machine-readable list of tags useful for grouping analyses together

analysis_ctrl_rule


These rules define a higher level of control. These rules are used to turn whole analysis nodes on/off (READY/BLOCKED). If any of the condition_analyses are not “DONE” the ctrled_analysis is set to BLOCKED. When all conditions become “DONE” then ctrled_analysis is set to READY The workers switch the analysis.status to “WORKING” and “DONE”. But any moment if a condition goes false, the analysis is reset to BLOCKED.

Column	Type	Default value	Description	Index
analysis_ctrl_rule_id	INTEGER	not set	internal ID	primary key
condition_analysis_url	VARCHAR(255)	‘’	foreign key to net distributed analysis reference	unique: key
ctrled_analysis_id	INTEGER	not set	foreign key to analysis table analysis_id	unique: key

analysis_stats


Parallel table to analysis_base which provides high level statistics on the state of an analysis and it’s jobs. Used to provide a fast overview, and to provide final approval of “DONE” which is used by the blocking rules to determine when to unblock other analyses. Also provides

Column	Type	Default value	Description	Index
analysis_id	INTEGER	not set	foreign-keyed to the corresponding analysis_base entry	primary key
status	ENUM: BLOCKED LOADING SYNCHING EMPTY READY WORKING ALL_CLAIMED DONE FAILED	‘EMPTY’	cached state of the Analysis
total_job_count	INTEGER	0	total number of Jobs of this Analysis
semaphored_job_count	INTEGER	0	number of Jobs of this Analysis that are in SEMAPHORED state
ready_job_count	INTEGER	0	number of Jobs of this Analysis that are in READY state
done_job_count	INTEGER	0	number of Jobs of this Analysis that are in DONE state
failed_job_count	INTEGER	0	number of Jobs of this Analysis that are in FAILED state
num_running_workers	INTEGER	0	number of running Workers of this Analysis
avg_msec_per_job	INTEGER	NULL	weighted average
avg_input_msec_per_job	INTEGER	NULL	weighted average
avg_run_msec_per_job	INTEGER	NULL	weighted average
avg_output_msec_per_job	INTEGER	NULL	weighted average
when_updated	TIMESTAMP	NULL	when this entry was last updated
sync_lock	SMALLINT	0	a binary lock flag to prevent simultaneous updates
is_excluded	SMALLINT	0	set to exclude analysis from beekeeper scheduling

dataflow_rule


Each entry of this table defines a starting point for dataflow (via from_analysis_id and branch_code) to which point a group of dataflow_target entries can be linked. This grouping is used in two ways: (1) dataflow_target entries that link into the same dataflow_rule share the same from_analysis, branch_code and funnel_dataflow_rule (2) to define the conditions for DEFAULT or ELSE case (via excluding all conditions explicitly listed in the group)

Column	Type	Default value	Description	Index
dataflow_rule_id	INTEGER	not set	internal ID	primary key
from_analysis_id	INTEGER	not set	foreign key to analysis table analysis_id
branch_code	INTEGER	1	branch_code of the fan
funnel_dataflow_rule_id	INTEGER	NULL	dataflow_rule_id of the semaphored funnel (is NULL by default, which means dataflow is not semaphored)

dataflow_target


This table links specific conditions with the target object (Analysis/Table/Accu) and optional input_id_template.

Column	Type	Default value	Description	Index
dataflow_target_id	INTEGER	not set	internal ID	primary key
source_dataflow_rule_id	INTEGER	not set	foreign key to the dataflow_rule object that defines grouping (see description of dataflow_rule table)	unique: key
on_condition	VARCHAR(255)	NULL	param-substitutable string evaluated at the moment of dataflow event that defines whether or not this case produces any dataflow; NULL means DEFAULT or ELSE	unique: key
input_id_template	MEDIUMTEXT	NULL	a template for generating a new input_id (not necessarily a hashref) in this dataflow; if undefined is kept original	unique: key
extend_param_stack	SMALLINT	0	the boolean value defines whether the newly created jobs will inherit both the parameters and the accu of the prev_job
to_analysis_url	VARCHAR(255)	‘’	the URL of the dataflow target object (Analysis/Table/Accu)	unique: key

hive_meta


This table keeps several important hive-specific pipeline-wide key-value pairs such as hive_sql_schema_version, hive_use_triggers and hive_pipeline_name.

Column	Type	Default value	Description	Index
meta_key	VARCHAR(255)	not set	the key of key-value pairs (primary key)	primary key
meta_value	MEDIUMTEXT	NULL	the value of key-value pairs

pipeline_wide_parameters


This table contains a simple hash between pipeline_wide_parameter names and their values. The same data used to live in “meta” table until both the schema and the API were finally separated from Ensembl Core.

Column	Type	Default value	Description	Index
param_name	VARCHAR(255)	not set	the key of key-value pairs (primary key)	key: value_idx primary key
param_value	MEDIUMTEXT	NULL	the value of key-value pairs

Resources


Tables used to describe the resources needed by each analysis

../_images/graphviz-4aad1e3c3dc0797a855506698457ab43a23ed99a.svg

resource_class


Maps between resource_class numeric IDs and unique names.

Column	Type	Default value	Description	Index
resource_class_id	INTEGER	not set	unique ID of the ResourceClass	primary key
name	VARCHAR(255)	not set	unique name of the ResourceClass	unique: key

resource_description


Maps (ResourceClass, MeadowType) pair to Meadow-specific resource lines.

Column	Type	Default value	Description	Index
resource_class_id	INTEGER	not set	foreign-keyed to the ResourceClass entry	primary key
meadow_type	VARCHAR(255)	not set	if the Worker is about to be executed on the given Meadow…	primary key
submission_cmd_args	VARCHAR(255)	‘’	… these are the resource arguments (queue, memory,…) to give to the submission command
worker_cmd_args	VARCHAR(255)	‘’	… and these are the arguments that are given to the worker command being submitted

Schema diagram

Table list

Execution

beekeeper

role

worker

Job tracking

accu

analysis_data

job

job_file

semaphore

Logging and monitoring

log_message

worker_resource_usage

Pipeline structure

analysis_base

analysis_ctrl_rule

analysis_stats

dataflow_rule

dataflow_target

hive_meta

pipeline_wide_parameters

Resources

resource_class

resource_description