This document describes the tables that make up the schema. Tables are grouped into categories, and the purpose of each table is explained.

Schema diagram

The schema diagrams are automatically generated as PNG images with Graphviz, and show the links between columns of each table. Here follows the overall schema diagram, while the individual diagrams of each category are available below, together with the table descriptions.

../_images/graphviz-9263c573227c18c9db5c72853c7675dceb8190bc.svg

Table list

Here is the list of all the tables found in a typical database, grouped by categories.

Execution
Job tracking
Logging and monitoring
Pipeline structure
Resources

Execution

Tables to track the agents operating the eHive system
../_images/graphviz-bc6b2a7e454a62348fff0554a5b200c5de32544c.svg

beekeeper

Each row in this table corresponds to a beekeeper process that is running or has run on this pipeline.

Column Type Default value Description Index
beekeeper_id INTEGER not set unique ID for this beekeeper primary key
meadow_host VARCHAR(255) not set hostname of machine where beekeeper started key: beekeeper_signature_idx
meadow_user VARCHAR(255) not set username under which this beekeeper ran or is running key: beekeeper_signature_idx
process_id VARCHAR(255) not set process_id of the beekeeper key: beekeeper_signature_idx
is_blocked SMALLINT 0 beekeeper will not submit workers if this flag is set  
cause_of_death

ENUM:

  • ANALYSIS_FAILED
  • DISAPPEARED
  • JOB_FAILED
  • LOOP_LIMIT
  • NO_WORK
  • TASK_FAILED
NULL reason this beekeeper exited  
sleep_minutes REAL NULL sleep interval in minutes  
analyses_pattern TEXT NULL restricting analyses_pattern, if given  
loop_limit INTEGER NULL loop limit if given  
loop_until

ENUM:

  • ANALYSIS_FAILURE
  • FOREVER
  • JOB_FAILURE
  • NO_WORK
not set beekeeper’s policy for responding to possibly loop-ending events  
options TEXT NULL all options passed to the beekeeper  
meadow_signatures TEXT NULL signatures for all meadows this beekeeper can submit to  

role

Entries of this table correspond to Role objects of the API. When a Worker specialises, it acquires a Role, which is a temporary link between the Worker and a resource-compatible Analysis.

Column Type Default value Description Index
role_id INTEGER not set unique ID of the Role primary key
worker_id INTEGER not set the specialised Worker key: worker
analysis_id INTEGER not set the Analysis into which the Worker specialised key: analysis
when_started TIMESTAMP CURRENT_TIMESTAMP when this Role started  
when_finished TIMESTAMP NULL when this Role finished. NULL may either indicate it is still running or was killed by an external force.  
attempted_jobs INTEGER 0 counter of the number of attempts  
done_jobs INTEGER 0 counter of the number of successful attempts  

worker

Entries of this table correspond to Worker objects of the API. Workers are created by inserting into this table so that there is only one instance of a Worker object in the database. As Workers live and do work, they update this table, and when they die they update again.

Column Type Default value Description Index
worker_id INTEGER not set unique ID of the Worker primary key
meadow_type VARCHAR(255) not set type of the Meadow it is running on key: meadow_process
meadow_name VARCHAR(255) not set name of the Meadow it is running on (for “LOCAL” meadows it is the same as meadow_host) key: meadow_process
meadow_host VARCHAR(255) NULL execution host name  
meadow_user VARCHAR(255) NULL scheduling/execution user name (within the Meadow)  
process_id VARCHAR(255) not set identifies the Worker process on the Meadow (for “LOCAL” is the OS PID) key: meadow_process
resource_class_id INTEGER NULL links to Worker’s resource class  
work_done INTEGER 0 how many jobs the Worker has completed successfully  
status VARCHAR(255) ‘READY’ current status of the Worker  
beekeeper_id INTEGER NULL beekeeper that created this worker  
when_submitted TIMESTAMP CURRENT_TIMESTAMP when the Worker was submitted by a Beekeeper  
when_born TIMESTAMP NULL when the Worker process was started  
when_checked_in TIMESTAMP NULL when the Worker last checked into the database  
when_seen TIMESTAMP NULL when the Worker was last seen by the Meadow  
when_died TIMESTAMP NULL if defined, when the Worker died (or its premature death was first detected by GC)  
cause_of_death VARCHAR(255) NULL if defined, why did the Worker exit (or why it was killed)  
log_dir VARCHAR(255) NULL if defined, a filesystem directory where this Worker’s output is logged  

Job tracking

Tables to list all the jobs of each analysis, and their dependencies
../_images/graphviz-af2ce79fb6dd47e9e8853fc7b83cede8ff08fb57.svg

accu

Accumulator for funneled dataflow.

Column Type Default value Description Index
sending_job_id INTEGER NULL semaphoring job in the “box” key: accu_sending_idx
receiving_semaphore_id INTEGER not set semaphore just outside the “box” key: accu_receiving_idx
struct_name VARCHAR(255) not set name of the structured parameter  
key_signature VARCHAR(255) not set locates the part of the structured parameter  
value MEDIUMTEXT NULL value of the part  

analysis_data

A generic blob-storage hash. Currently the only legitimate use of this table is “overflow” of job.input_ids: when they grow longer than 254 characters the real data is stored in analysis_data instead, and the input_id contains the corresponding analysis_data_id.

Column Type Default value Description Index
analysis_data_id INTEGER not set primary id primary key
md5sum CHAR(32) not set checksum over the data to quickly detect (potential) collisions key: md5sum
data MEDIUMTEXT not set text blob which holds the data  

job

The job table is the heart of this system. It is the kiosk or blackboard where workers find things to do and then post work for other works to do. These jobs are created prior to work being done, are claimed by workers, are updated as the work is done, with a final update on completion.

Column Type Default value Description Index
job_id INTEGER not set autoincrement id primary key
prev_job_id INTEGER NULL previous job which created this one  
analysis_id INTEGER not set the analysis_id needed to accomplish this job. unique key: input_id_stacks_analysis key: analysis_status_retry
input_id CHAR(255) not set input data passed into Analysis:RunnableDB to control the work unique key: input_id_stacks_analysis
param_id_stack CHAR(64) ‘’ a CSV of job_ids whose input_ids contribute to the stack of local variables for the job unique key: input_id_stacks_analysis
accu_id_stack CHAR(64) ‘’ a CSV of job_ids whose accu’s contribute to the stack of local variables for the job unique key: input_id_stacks_analysis
role_id INTEGER NULL links to the Role that claimed this job (NULL means it has never been claimed) key: role_status
status

ENUM:

  • SEMAPHORED
  • READY
  • CLAIMED
  • COMPILATION
  • PRE_CLEANUP
  • FETCH_INPUT
  • RUN
  • WRITE_OUTPUT
  • POST_HEALTHCHECK
  • POST_CLEANUP
  • DONE
  • FAILED
  • PASSED_ON
‘READY’ state the job is in key: analysis_status_retry key: role_status
retry_count INTEGER 0 number times job had to be reset when worker failed to run it key: analysis_status_retry
when_completed TIMESTAMP NULL when the job was completed  
runtime_msec INTEGER NULL how long did it take to execute the job (or until the moment it failed)  
query_count INTEGER NULL how many SQL queries were run during this job  
controlled_semaphore_id INTEGER NULL the dbID of the semaphore that is controlled by this job (and whose counter it will decrement by 1 upon successful completion)  

job_file

For testing/debugging purposes both STDOUT and STDERR streams of each Job can be redirected into a separate log file. This table holds filesystem paths to one or both of those files. There is max one entry per job_id and retry.

Column Type Default value Description Index
job_id INTEGER not set foreign key primary key
retry INTEGER not set copy of retry_count of job as it was run primary key
role_id INTEGER not set links to the Role that claimed this job key: role
stdout_file VARCHAR(255) NULL path to the job’s STDOUT log  
stderr_file VARCHAR(255) NULL path to the job’s STDERR log  

semaphore

The semaphore table is our primary inter-job dependency relationship. Any job may control up to one semaphore, but the semaphore can be controlled by many jobs. This includes remote jobs, so the semaphore keeps two counters - one for local blockers, one for remote ones. As soon as both counters reach zero, the semaphore unblocks one dependent job - either a local one, or through a chain of dependent remote semaphores.

Column Type Default value Description Index
semaphore_id INTEGER not set autoincrement id primary key
local_jobs_counter INTEGER 0 the number of local jobs that control this semaphore  
remote_jobs_counter INTEGER 0 the number of remote semaphores that control this one  
dependent_job_id INTEGER NULL either NULL or points at a local job to be unblocked when the semaphore opens (exclusive with dependent_semaphore_url) unique key: unique_dependent_job_id
dependent_semaphore_url VARCHAR(255) NULL either NULL or points at a remote semaphore to be decremented when this semaphore opens (exclusive with dependent_job_id)  

Logging and monitoring

Tables to store messages and feedback from the execution of the pipeline
../_images/graphviz-d8407b89f95d2cd7ad556422e3ad5f03e58932f2.svg

analysis_stats_monitor

A regular timestamped snapshot of the analysis_stats table.

Column Type Default value Description Index
when_logged TIMESTAMP CURRENT_TIMESTAMP when this snapshot was taken  
analysis_id INTEGER not set foreign-keyed to the corresponding analysis_base entry  
status

ENUM:

  • BLOCKED
  • LOADING
  • SYNCHING
  • EMPTY
  • READY
  • WORKING
  • ALL_CLAIMED
  • DONE
  • FAILED
‘EMPTY’ cached state of the Analysis  
total_job_count INTEGER 0 total number of Jobs of this Analysis  
semaphored_job_count INTEGER 0 number of Jobs of this Analysis that are in SEMAPHORED state  
ready_job_count INTEGER 0 number of Jobs of this Analysis that are in READY state  
done_job_count INTEGER 0 number of Jobs of this Analysis that are in DONE state  
failed_job_count INTEGER 0 number of Jobs of this Analysis that are in FAILED state  
num_running_workers INTEGER 0 number of running Workers of this Analysis  
avg_msec_per_job INTEGER NULL weighted average  
avg_input_msec_per_job INTEGER NULL weighted average  
avg_run_msec_per_job INTEGER NULL weighted average  
avg_output_msec_per_job INTEGER NULL weighted average  
when_updated TIMESTAMP NULL when this entry was last updated  
sync_lock SMALLINT 0 a binary lock flag to prevent simultaneous updates  
is_excluded SMALLINT not set set to exclude analysis from beekeeper scheduling  

log_message

When a Job or a job-less Worker (job_id=NULL) throws a “die” message for any reason, the message is recorded in this table. It may or may not indicate that the job was unsuccessful via is_error flag. Also $self->warning(”…”) messages are recorded with is_error=0.

Column Type Default value Description Index
log_message_id INTEGER not set an autoincremented primary id of the message primary key
job_id INTEGER NULL the id of the job that threw the message (or NULL if it was outside of a message) key: job_id
role_id INTEGER NULL the “current” role  
worker_id INTEGER NULL the “current” worker key: worker_id
beekeeper_id INTEGER NULL beekeeper that generated this message key: beekeeper_id
when_logged TIMESTAMP CURRENT_TIMESTAMP when the message was thrown  
retry INTEGER NULL retry_count of the job when the message was thrown (or NULL if no job)  
status VARCHAR(255) ‘UNKNOWN’ of the job or worker when the message was thrown  
msg MEDIUMTEXT NULL string that contains the message  
message_class

ENUM:

  • INFO
  • PIPELINE_CAUTION
  • PIPELINE_ERROR
  • WORKER_CAUTION
  • WORKER_ERROR
‘INFO’ type of message  

worker_resource_usage

A table with post-mortem resource usage statistics of a Worker. This table is not automatically populated: you first need to run load_resource_usage.pl. Note that some meadows (like LOCAL) do not support post-mortem inspection of resource usage

Column Type Default value Description Index
worker_id INTEGER not set links to the worker table primary key
exit_status VARCHAR(255) NULL meadow-dependent, in case of LSF it’s usually “done” (normal) or “exit” (abnormal)  
mem_megs FLOAT NULL how much memory the Worker process used  
swap_megs FLOAT NULL how much swap the Worker process used  
pending_sec FLOAT NULL time spent by the process in the queue before it became a Worker  
cpu_sec FLOAT NULL cpu time (in seconds) used by the Worker process. It is often lower than the walltime because of time spent in I/O waits, but it can also be higher if the process is multi-threaded  
lifespan_sec FLOAT NULL walltime (in seconds) used by the Worker process. It is often higher than the sum of its jobs’ “runtime_msec” because of the overhead from the Worker itself  
exception_status VARCHAR(255) NULL meadow-specific flags, in case of LSF it can be “underrun”, “overrun” or “idle”  

Pipeline structure

Tables to store the structure of the pipeline (i.e. the analyses and the dataflow-rules between them)
../_images/graphviz-124115d62e005ee86decf82a8abeb0028b362a66.svg

analysis_base

Each Analysis is a node of the pipeline diagram. It acts both as a “class” to which Jobs belong (and inherit from it certain properties) and as a “container” for them (Jobs of an Analysis can be blocking all Jobs of another Analysis).

Column Type Default value Description Index
analysis_id INTEGER not set a unique ID that is also a foreign key to most of the other tables primary key
logic_name VARCHAR(255) not set the name of the Analysis object unique key: logic_name_idx
module VARCHAR(255) not set the name of the module / package that runs this Analysis  
language VARCHAR(255) NULL the language of the module, if not Perl  
parameters MEDIUMTEXT NULL a stringified hash of parameters common to all jobs of the Analysis  
resource_class_id INTEGER not set link to the resource_class table  
failed_job_tolerance INTEGER 0 % of tolerated failed Jobs  
max_retry_count INTEGER NULL how many times a job of this Analysis will be retried (unless there is no point)  
can_be_empty SMALLINT 0 if TRUE, this Analysis will not be blocking if/while it doesn’t have any jobs  
priority SMALLINT 0 an Analysis with higher priority will be more likely chosen on Worker’s specialisation  
meadow_type VARCHAR(255) NULL if defined, forces this Analysis to be run only on the given Meadow  
analysis_capacity INTEGER NULL if defined, limits the number of Workers of this particular Analysis that are allowed to run in parallel  
hive_capacity INTEGER NULL a reciprocal limiter on the number of Workers running at the same time (dependent on Workers of other Analyses)  
batch_size INTEGER 1 how many jobs are claimed in one claiming operation before Worker starts executing them  
comment TEXT not set human-readable textual description of the analysis  
tags TEXT not set machine-readable list of tags useful for grouping analyses together  

analysis_ctrl_rule

These rules define a higher level of control. These rules are used to turn whole analysis nodes on/off (READY/BLOCKED). If any of the condition_analyses are not “DONE” the ctrled_analysis is set to BLOCKED. When all conditions become “DONE” then ctrled_analysis is set to READY The workers switch the analysis.status to “WORKING” and “DONE”. But any moment if a condition goes false, the analysis is reset to BLOCKED.

Column Type Default value Description Index
analysis_ctrl_rule_id INTEGER not set internal ID primary key
condition_analysis_url VARCHAR(255) ‘’ foreign key to net distributed analysis reference unique: key
ctrled_analysis_id INTEGER not set foreign key to analysis table analysis_id unique: key

analysis_stats

Parallel table to analysis_base which provides high level statistics on the state of an analysis and it’s jobs. Used to provide a fast overview, and to provide final approval of “DONE” which is used by the blocking rules to determine when to unblock other analyses. Also provides

Column Type Default value Description Index
analysis_id INTEGER not set foreign-keyed to the corresponding analysis_base entry primary key
status

ENUM:

  • BLOCKED
  • LOADING
  • SYNCHING
  • EMPTY
  • READY
  • WORKING
  • ALL_CLAIMED
  • DONE
  • FAILED
‘EMPTY’ cached state of the Analysis  
total_job_count INTEGER 0 total number of Jobs of this Analysis  
semaphored_job_count INTEGER 0 number of Jobs of this Analysis that are in SEMAPHORED state  
ready_job_count INTEGER 0 number of Jobs of this Analysis that are in READY state  
done_job_count INTEGER 0 number of Jobs of this Analysis that are in DONE state  
failed_job_count INTEGER 0 number of Jobs of this Analysis that are in FAILED state  
num_running_workers INTEGER 0 number of running Workers of this Analysis  
avg_msec_per_job INTEGER NULL weighted average  
avg_input_msec_per_job INTEGER NULL weighted average  
avg_run_msec_per_job INTEGER NULL weighted average  
avg_output_msec_per_job INTEGER NULL weighted average  
when_updated TIMESTAMP NULL when this entry was last updated  
sync_lock SMALLINT 0 a binary lock flag to prevent simultaneous updates  
is_excluded SMALLINT 0 set to exclude analysis from beekeeper scheduling  

dataflow_rule

Each entry of this table defines a starting point for dataflow (via from_analysis_id and branch_code) to which point a group of dataflow_target entries can be linked. This grouping is used in two ways: (1) dataflow_target entries that link into the same dataflow_rule share the same from_analysis, branch_code and funnel_dataflow_rule (2) to define the conditions for DEFAULT or ELSE case (via excluding all conditions explicitly listed in the group)

Column Type Default value Description Index
dataflow_rule_id INTEGER not set internal ID primary key
from_analysis_id INTEGER not set foreign key to analysis table analysis_id  
branch_code INTEGER 1 branch_code of the fan  
funnel_dataflow_rule_id INTEGER NULL dataflow_rule_id of the semaphored funnel (is NULL by default, which means dataflow is not semaphored)  

dataflow_target

This table links specific conditions with the target object (Analysis/Table/Accu) and optional input_id_template.

Column Type Default value Description Index
dataflow_target_id INTEGER not set internal ID primary key
source_dataflow_rule_id INTEGER not set foreign key to the dataflow_rule object that defines grouping (see description of dataflow_rule table) unique: key
on_condition VARCHAR(255) NULL param-substitutable string evaluated at the moment of dataflow event that defines whether or not this case produces any dataflow; NULL means DEFAULT or ELSE unique: key
input_id_template MEDIUMTEXT NULL a template for generating a new input_id (not necessarily a hashref) in this dataflow; if undefined is kept original unique: key
extend_param_stack SMALLINT 0 the boolean value defines whether the newly created jobs will inherit both the parameters and the accu of the prev_job  
to_analysis_url VARCHAR(255) ‘’ the URL of the dataflow target object (Analysis/Table/Accu) unique: key

hive_meta

This table keeps several important hive-specific pipeline-wide key-value pairs such as hive_sql_schema_version, hive_use_triggers and hive_pipeline_name.

Column Type Default value Description Index
meta_key VARCHAR(255) not set the key of key-value pairs (primary key) primary key
meta_value MEDIUMTEXT NULL the value of key-value pairs  

pipeline_wide_parameters

This table contains a simple hash between pipeline_wide_parameter names and their values. The same data used to live in “meta” table until both the schema and the API were finally separated from Ensembl Core.

Column Type Default value Description Index
param_name VARCHAR(255) not set the key of key-value pairs (primary key) key: value_idx primary key
param_value MEDIUMTEXT NULL the value of key-value pairs  

Resources

Tables used to describe the resources needed by each analysis
../_images/graphviz-4aad1e3c3dc0797a855506698457ab43a23ed99a.svg

resource_class

Maps between resource_class numeric IDs and unique names.

Column Type Default value Description Index
resource_class_id INTEGER not set unique ID of the ResourceClass primary key
name VARCHAR(255) not set unique name of the ResourceClass unique: key

resource_description

Maps (ResourceClass, MeadowType) pair to Meadow-specific resource lines.

Column Type Default value Description Index
resource_class_id INTEGER not set foreign-keyed to the ResourceClass entry primary key
meadow_type VARCHAR(255) not set if the Worker is about to be executed on the given Meadow… primary key
submission_cmd_args VARCHAR(255) ‘’ … these are the resource arguments (queue, memory,…) to give to the submission command  
worker_cmd_args VARCHAR(255) ‘’ … and these are the arguments that are given to the worker command being submitted