3 min read

A Genbank –> GTF converter derived from BioPerl

BioPerl has a Genbank –> GFF converter (the script is bp_genbank2gff3.pl), however, this converter generated GFF/GTF file could not work in Bioconductor when using the function maketxdbfromGFF to construct a transcript database.

Currently, I have developed such a converter that may extract information from a NCBI Genbank file and write them to GFF version 2, or GTF format. This script is derived from BioPerl, so you need firstly get the module installed in your system. To check whether BioPerl is ready, just refer to this post: http://bio-spring.top/check-bioperl/ .

Here is the synopsis of this script.

SYNOPSIS

NCBI_genbank2gtf.pl [options] filename

Options:
–help -h display this message
–input -i NCBI GenBank file
–out -o filename of GTF output

Examples:

perl NCBI_genbank2gtf.pl -i seq.gb -o seq.gtf
perl NCBI_genbank2gtf.pl seq.gb > seq.gtf

The codes is as follows:
#!/usr/bin/perl

=pod

=head1 NAME

NCBI_genbank2gtf.pl – Genbank -> Bioconductor friendly GTF

=head1 SYNOPSIS

NCBI_genbank2gtf.pl [options] filename

Options:
–help -h display this message
–input -i NCBI GenBank file
–out -o filename of GTF output

Examples:

perl NCBI_genbank2gtf.pl -i seq.gb -o seq.gtf
perl NCBI_genbank2gtf.pl seq.gb > seq.gtf

=head1 DESCRIPTION

Use this program to generate bioconductor friendly GTF files from NCBI GenBank.
The file may be used in *maketxdbfromGFF* function and should work well.

=head1 AUTHOR

Chun-Hui, Gao (gaoch@thelifesciencecentury.com)

Copyright (c) 2015 www.thelifesciencecentury.com.

=cut

use strict;
use Data::Dumper;
use Pod::Usage;
use Bio::SeqIO;
use Getopt::Long;

my ($help, $genbank_input, $gtf_output);

my $ok= GetOptions( ‘help|h’ => \$help,
‘input|i=s’ => \$genbank_input,
‘out|o=s’ => \$gtf_output );
pod2usage(2) if $help || !$ok;

$genbank_input = shift @ARGV unless ($genbank_input );

open *IN, “< $genbank_input " or die pod2usage(2); my $out; if ($gtf_output) { open $out, “> $gtf_output” or die “Can’t open file $gtf_output:$@\n”;
}
else {
$out = *STDOUT;
}

my %seen;
my $in = Bio::SeqIO->new(-fh => \*IN, -format => “genbank”);
while ( my $seq = $in->next_seq() ) {
my $seq_acc = $seq->accession_number;

# abort if there are no features
warn “$seq_acc has no features, skipping\n” and next
if !$seq->all_SeqFeatures;

# construct a GFF header
# add: get source_type from attributes of source feature? chromosome=X tag
# also combine 1st ft line here with source ft from $seq ..
my $header = gff_header($seq);
print $out $header;
warn “# working on $seq_acc\n”;

for my $feature ( $seq->get_all_SeqFeatures ) {

my $method = $feature->primary_tag;
$seen{$method} ++;
$feature = maptags2gff($feature);
print $out $feature->gff_string(), “\n”;
if ($method eq ‘CDS’){
$feature->primary_tag(“exon”);
$feature->add_tag_value(“exon_number”, 1);
$feature->add_tag_value(“exon_id”, $feature->get_tag_values(“transcript_id”));
$feature = maptags2gff($feature);
print $out $feature->gff_string(), “\n”;
}

}

}

warn “# Count of features in $genbank_input:\n”;
map {warn sprintf("# %20s:%6d\n”, $_, $seen{$_});} keys %seen;

sub gff_header {
my ($seq) = @_;
my $head = sprintf("##gff-version 2\n# sequence-region: %s (1..%d)\n", $seq->accession_number, $seq->length);
$head .= sprintf( “# %s\n”, $seq->desc);
$head .= “# converted by NCBI_genbank2gtf.pl\n”;

#~ print Dumper $acc, $desc, $end;
return $head;

}

sub maptags2gff {
my $f = shift;
my $method = $f->primary_tag;
my %TAG_MAP;
if ($method eq ‘gene’ || $method eq ‘misc_RNA’ || $method eq ‘rRNA’ || $method eq ‘tRNA’){
$f->add_tag_value(‘transcript_id’,$f->get_tag_values(‘locus_tag’));
%TAG_MAP = ( gene => ‘gene_name’,
transcript_id => ‘transcript_id’,
locus_tag => ‘gene_id’ );
}
elsif ($method eq ‘CDS’) {
$f->add_tag_value(‘transcript_id’,$f->get_tag_values(‘locus_tag’));
%TAG_MAP = ( locus_tag => ‘gene_id’,
protein_id => ‘protein_id’,
transcript_id => ‘transcript_id’,
gene => ‘gene_name’,
product => ‘product’);

}
elsif ($method eq ‘exon’) {

%TAG_MAP = ( gene_id => ‘gene_id’,
protein_id => ‘protein_id’,
transcript_id => ‘transcript_id’,
exon_number => ‘exon_number’,
exon_id => ‘exon_id’,
gene_name => ‘gene_name’);

}
elsif ($method eq ‘source’) {
%TAG_MAP = ( organism => ‘organism’ );

}
else {

}

my @all_tags = $f->get_all_tags();
#~ print Dumper \@all_tags;
for my $tag (@all_tags){

if (exists $TAG_MAP{$tag}){
my $newtag= $TAG_MAP{$tag};
my @v= $f->get_tag_values($tag);
$f->remove_tag($tag);
$f->add_tag_value($newtag,@v);
}
else {
$f->remove_tag($tag);
}
}

return $f;
}

If you would like to access the full document, run:
perldoc NCBI_genbank2gtf.pl

Feedback is welcome.

Here is also a GenBank -> BED converter: