In this section, we will take you through the basics of setting up a pipeline to run one of the example workflows included with eHive, followed by executing it through to completion.
If eHive hasn’t been installed on your system, follow the instructions in Installation and setup to obtain the code and set up your environment. In particular, confirm that your $PATH includes the eHive scripts directory. There is no need to add the eHive modules directory to your $PERL5LIB, as the eHive scripts automatically look for required eHive modules and add them to the Perl search path.
Some pipelines may have other dependencies beyond eHive (e.g. the Ensembl Core API, BioPerl, etc.). Make sure you have installed them and configured your environment (PATH and PERL5LIB). init_pipeline will try to compile all the Analysis modules, which ensures that most of the dependencies are installed, but some others can only be found at runtime.
You should have a MySQL or PostgreSQL database with CREATE, CREATE ROUTINE, SELECT, INSERT, and UPDATE privileges available. Alternatively, you should have SQLite available on your system.
Each eHive pipeline is a potentially complex computational process. Whether it runs locally, on the farm, or on multiple compute resources, this process is centred around a database where individual Jobs of the pipeline are created, claimed by independent Workers and later recorded as done or failed.
Executing a pipeline involves the following steps:
- Using the init_pipeline script to create an instance pipeline database from a “PipeConfig” file,
- (optionally) Using the seed_pipeline script to add Jobs to be run,
- Running the beekeeper script that will look after the pipeline and maintain a population of Worker processes on the compute resource that will take and perform all the Jobs of the pipeline,
- (optionally) Monitoring the state of the running pipeline:
We’ll start by initialising a pipeline to run one of the example workflows included with eHive, the “long-multiplication pipeline.” This workflow is simple to set up and run, with no dependencies on e.g. external data files.
When we initialise a pipeline, we are setting up an eHive database. This database is then used by the Beekeeper and by Worker processes to coordinate all the work that they need to do. In particular, initialising the pipeline means:
- Creating tables in the database. The table schema is the same in any eHive pipeline – as long as the eHive version is the same. The schema can change between eHive versions (but we provide patch scripts to update your schema should you need to upgrade). The table schema is defined in files in the eHive distribution – you should not edit or change these files.
- Populating some of those tables with data describing the structure of your workflow, along with initial parameters for running it. It’s the data in the tables that defines how a particular pipeline runs, not the structure. This information is loaded from a file known as a PipeConfig file.
- A PipeConfig file is a Perl module conforming to a particular interface (
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf). By convention, these files are given names ending in “_conf.pm”. They must be located someplace that your Perl can find them by class name.
- In general, the eHive database corresponds to a particular run of a pipeline, whereas the PipeConfig file contains the structure for all runs of a pipeline. To make an analogy with software objects, you can think of the PipeConfig file as something like a class, with the database being an instance of that class.
- A PipeConfig file is a Perl module conforming to a particular interface (
Initialising a pipeline is accomplished by running the
init_pipeline.pl script. This script requires a minimum of two arguments to work:
- The classname of the PipeConfig you’re initialising,
- The name of the database to be initialised. This is usually passed in the form of a URL (e.g.
sqlite:///sqlite_filename), given via the
- There are other options to
init_pipeline.plthat will be covered later in this manual. You can see a list of them with
init_pipeline.pl -h. One option you should be aware of is
-hive_force_init 1. Normally, if the database already has data in it, then the
init_pipeline.plcommand will exit leaving the database untouched, and print a warning message. If
-hive_force_init 1is set, however, then the database will be reinitialised from scratch with any data in it erased. This is a safety feature to prevent inadvertently overwriting a database with potentially many days of work in it, so use this option wisely!
- There are other options to
Let’s run an actual
init_pipeline.pl on the command line. We’re going to initialise an eHive database for the “long-multiplication pipeline,” which is defined in
# The following command creates a new SQLite database called 'long_mult_hive_db' # then sets up the tables and data eHive needs for the long-multiplication pipeline init_pipeline.pl Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf \ -pipeline_url 'sqlite:///long_mult_hive_db' # Alternatively, you could initialise a MySQL database for this eHive pipeline # by running a command like this init_pipeline.pl Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf \ -pipeline_url 'mysql://[username]:[password]@[server]:[port]/long_mult_hive_db'
init_pipeline.pl, you should see a list of useful commands printed to the terminal. If something went wrong, you may see an error message. Some common error messages you might see are:
ERROR 1007 (HY000) at line 1: Can't create database 'long_mult_hive_db'; database existsor errors looking like
Error: near line [line number]: table [table name] already exists- means the database you’re trying to initialise already exists. Choose a different database name, or run with
ERROR 1044 (42000) at line 1: Access denied for user [username] to database- means the user given in the url doesn’t have enough privileges to create a database and load it with data.
Can't locate object method "new" via package...- usually means the package name in the Perl file doesn’t match the filename.
This step is optional. Some of these tools may not be available, depending on the software installation in your environment.
eHive is distributed with a number of tools that let you examine the structure of a pipeline, along with its current state and the progress made while working through it. For example,
tweak_pipeline.pl can query pipeline parameters as well as set them while GuiHive allows visualising pipelines in a web browser. Two scripts are included that produce diagrams illustrating a pipeline’s structure and the current progress of work through it:
If a GuiHive server is available and running in your compute environment, open a web browser and connect to that GuiHive server. Enter your pipeline URL into the URL: field and click connect (if you are using a SQLite database, the webserver running GuiHive will need to have access to the filesystem where your SQLite database resides, and you will need to give the full path to the database file: e.g
You can use
visualize_jobs.pl to generate Analysis-level and Job-level diagrams of your pipeline (For a more thorough explanation of these diagrams, see the Long Multiplication pipeline walkthrough).
generate_graph.pl requires a pipeline url or a PipeConfig classname as an argument. You can specify an output file in a variety of graphics formats, or if no output file is specified, an ascii-art diagram will be generated.
visualize_jobs.pl requires a pipeline url and an output filename to be passed as arguments. Both of these scripts require a working graphviz installation. Some usage examples:
# generate an Analysis diagram for the pipeline in sqlite:///long_mult_hive_db and store it as long_mult_diagram.png generate_graph.pl -url sqlite:///long_mult_hive_db -output long_mult_diagram.png # generate an Analysis diagram for the pipeline defined in # Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf and display as ascii-art in the terminal generate_graph.pl -pipeconfig Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf # generate a Job-level diagram for the pipeline in sqlite:///long_mult_hive_db and store it as long_mult_job_diagram.svg visualize_jobs.pl -url sqlite:///long_mult_hive_db -output long_mult_job_diagram.svg
Pipelines are typically run using the
beekeeper.pl script. This is a lightweight script that is designed to run continuously in a loop for as long as your pipeline is running. It checks on the pipeline’s current status, creates Worker processes as needed to perform the pipeline’s actual work, then goes to sleep for a period of time (one minute by default). After each loop, it prints information on the pipeline’s current progress and status. As an aside,
beekeeper.pl can perform a number of pipeline maintenance tasks in addition to its looping function, these are covered elsewhere in the manual.
- The Beekeeper needs to know which eHive database stores the pipeline. This is passed with the parameter
- To run the Beekeeper in loop mode, where it monitors the pipeline (this is the typical use case mentioned above), pass it the
-loopswitch. When looping, you can change the sleep time with the
-sleepflag, passing it a sleep time in minutes (e.g.
-sleep 0.5to shorten the sleep time to 30 seconds).
Let’s run the Beekeeper in loop mode, keeping the default one minute sleep time to provide time to examine the pipeline status messages.
You may notice that was one of the “useful commands” listed after running
init_pipeline.pl, so you could just copy and paste it to the command line.
For this “long multiplication pipeline” the Beekeeper should loop three or four times before stopping and returning you to the command prompt. The exact number of loops will depend on your particular system.
# Here is the Beekeeper command pointing to the SQLite database initialised in the previous step. # Substitute the database url as needed to point to the database you initialised beekeeper.pl -url 'sqlite:///long_mult_hive_db' -loop
One last note about the Beekeeper: as it loops it creates and starts
Workers, but once created these Workers continue their lifespan
independently. It’s analogous to a pump filling a stream. If you
kill the Beekeeper, you stop the pump, but the water is still flowing,
i.e. the Workers are not killed but still running. To actually kill
the Workers, you have to use the specific commands of your grid engine
bkill for Platform LSF).
The Beekeeper’s output can appear dense and a bit cryptic. However, it is organised into logical sections, with some parts useful for monitoring the health of your pipeline, with other parts more useful for advanced techniques such as pipeline optimisation. Let’s deconstruct the output from a typical Beekeeper loop:
- Each loop begins with a “Beekeeper : loop #N ================= line”.
- There will be a couple of lines starting with “GarbageCollector:” - advanced users may find the information here useful for performance tuning or troubleshooting.
- There will then be a table showing work that is pending or in progress. This section is the most important to pay attention to in day-to-day eHive operation. These lines show progress being made through the pipeline, and can also provide an early warning sign of trouble. This table has the following columns:
- The Analysis name and Analysis ID number,
- The status of the Analysis (typically, LOADING, READY, ALL_CLAIMED, possibly FAILED). Analyses that are done are not shown in this table.
- A Job summary, showing the number of [r]eady, [s]emaphored, [i]n-progress, and [d]one Jobs in the Analysis,
- Average runtime for Aobs in the Analysis,
- Number of Workers working on this Analysis,
- Hive_capacity and analysis_capacity settings for this Analysis,
- The last time the Beekeeper performed an internal-bookkeeping synchronization on this Analysis,
- There will then be a summary of progress through the pipeline.
- The next several lines show the Beekeeper’s plan to create new Workers for the pipeline. This can be useful for debugging.
- Finally, the Beekeeper will announce it is going to sleep.