ensembl-hive  2.6
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory Class Reference
+ Inheritance diagram for Bio::EnsEMBL::Hive::RunnableDB::FastaFactory:

Public Member Functions

public param_defaults ()
 
public fetch_input ()
 
public run ()
 
public write_output ()
 
public post_cleanup ()
 
- Public Member Functions inherited from Bio::EnsEMBL::Hive::Process
public new ()
 
public life_cycle ()
 
public say_with_header ()
 
public enter_status ()
 
public warning ()
 
public param_defaults ()
 
public fetch_input ()
 
public run ()
 
public write_output ()
 
public Bio::EnsEMBL::Hive::Worker worker ()
 
public execute_writes ()
 
public Bio::EnsEMBL::Hive::DBSQL::DBAdaptor db ()
 
public Bio::EnsEMBL::Hive::DBSQL::DBConnection dbc ()
 
public Bio::EnsEMBL::Hive::DBSQL::DBConnection data_dbc ()
 
public Returns run_system_command ()
 
public Bio::EnsEMBL::Hive::AnalysisJob input_job ()
 
public input_id ()
 
public param ()
 
public param_required ()
 
public param_exists ()
 
public param_is_defined ()
 
public param_substitute ()
 
public dataflow_output_id ()
 
public dataflow_output_ids_from_json ()
 
public throw ()
 
public This complete_early ()
 
public Int debug ()
 
public worker_temp_directory ()
 
public cleanup_worker_temp_directory ()
 

Detailed Description

Synopsis

standaloneJob.pl Bio::EnsEMBL::Hive::RunnableDB::FastaFactory --inputfile reference.fasta --max_chunk_length 600000
--inputfile reference.fasta \
--max_chunk_length 700000 \
--output_prefix ref_chunk \
--flow_into "{ 2 => ['mysql://ensadmin:${ENSADMIN_PSW}@127.0.0.1/lg4_split_fasta/analysis?logic_name=blast']}"

Description

    This is a Bioinformatics-specific "Factory" Runnable that splits a given Fasta file into smaller chunks
    and dataflows one job per chunk. Note that:
        - the files are created in the current directory.
        - the Runnable does not split the individual sequences, it only groups them in a way that none of the output files will
          be longer than param('max_chunk_length').
        - Thanks to BioPerl's versatility, the Runnable can in fact read many formats. Tune param('input_format') to do so.

    The following parameters are supported:

        param('inputfile');         # The original Fasta file: 'inputfile' => 'my_sequences.fasta'

        param('max_chunk_length');  # Maximum total length of sequences in a chunk: 'max_chunk_length' => '200000'

        param('max_chunk_size');    # Defines the maximum allowed number of sequences to be included in each output file.

        param('seq_filter');        # Can be used to exclude sequences from output files. e.g. '^TF' would exclude all sequences starting with TF.

        param('output_prefix');     # A common prefix for output files: 'output_prefix' => 'my_special_chunk_'

        param('output_suffix');     # A common suffix for output files: 'output_suffix' => '.nt'

        param('hash_directories');  # Boolean (default to 0): should the output files be put in different ("hashed") directories

        param('input_format');      # The format of the input file (defaults to "fasta")

        param('output_format');     # The format of the output file (defaults to the same as param('input_format'))

        param('output_dir');        # Where to create the chunks (defaults to the current directory)

Definition at line 53 of file FastaFactory.pm.

Member Function Documentation

◆ fetch_input()

public Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::fetch_input ( )
    Description : Implements fetch_input() interface method of Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
                    Here we only check the existence of 'inputfile' parameter and try to parse it (all other parameters have defaults).
 
Code:
click to view

◆ param_defaults()

public Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::param_defaults ( )
    Description : Implements param_defaults() interface method of Bio::EnsEMBL::Hive::Process that defines module defaults for parameters.
 
Code:
click to view

◆ post_cleanup()

public Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::post_cleanup ( )
    Description : Close the file handle open in fetch_input() even if the job fails or write_output never runs
 
Code:
click to view

◆ run()

public Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::run ( )
    Description : Implements run() interface method of Bio::EnsEMBL::Hive::Process that is used to perform the main bulk of the job (minus input and output).
                    Because we want to stream the data more efficiently, all functionality is in write_output();
 
Code:
click to view

◆ write_output()

public Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::write_output ( )
    Description : Implements write_output() interface method of Bio::EnsEMBL::Hive::Process that is used to deal with job's output after the execution.
                    The main bulk of this Runnable's functionality is here.
                    Iterates through all sequences in input_seqio, splits them into separate files ("chunks") using a cut-off length and dataflows one job per chunk.
 
Code:
click to view

The documentation for this class was generated from the following file:
Bio::EnsEMBL::Hive::Process::param
public param()
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::param_defaults
public param_defaults()
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::fetch_input
public fetch_input()
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory
Definition: FastaFactory.pm:53
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::run
public run()
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::write_output
public write_output()
Bio::EnsEMBL::Hive::RunnableDB::FastaFactory::post_cleanup
public post_cleanup()