您好,登錄后才能下訂單哦!
這篇文章主要為大家展示了“perl如何分離NR NT庫”,內容簡而易懂,條理清晰,希望能夠幫助大家解決疑惑,下面讓小編帶領大家一起研究并學習一下“perl如何分離NR NT庫”這篇文章吧。
分離NR NT庫,快速blast本地比對同源注釋基因
我們需要對自己的基因做注釋,需要blast同源比對NCBI當中的NR NT庫;通常做無參轉錄組,會組織出10幾萬的unigene ,如果比對全庫的話,就太浪費時間了,我們可以根據NCBI的分類數據庫將數據庫分開,可下載如下文件,然后利用下面的perl腳本就可以把NR或者NT庫分開成小庫:
wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
perl腳本如下,由于腳本會把分類文件全部讀入內存,所有腳本內存消耗較大,建議確保100G以上的內存空間:
#The gi_taxid_nucl.dmp is about 160 MB and contains two columns: the nucleotide's gi and taxid. #The gi_taxid_prot.dmp is about 17 MB and contains two columns: the protein's gi and taxid. #Divisions file (division.dmp): # division id -- taxonomy database division id # division cde -- GenBank division code (three characters) # division name -- e.g. BCT, PLN, VRT, MAM, PRI... # comments #0 | BCT | Bacteria | | #1 | INV | Invertebrates | | #2 | MAM | Mammals | | #3 | PHG | Phages | | #4 | PLN | Plants and Fungi | | #5 | PRI | Primates | | #6 | ROD | Rodents | | #7 | SYN | Synthetic and Chimeric | | #8 | UNA | Unassigned | No species nodes should inherit this division assignment | #9 | VRL | Viruses | | #10 | VRT | Vertebrates | | #11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment | #nodes.dmp file consists of taxonomy nodes. The description for each node includes the following #fields: # tax_id -- node id in GenBank taxonomy database # parent tax_id -- parent node id in GenBank taxonomy database # rank -- rank of this node (superkingdom, kingdom, ...) # embl code -- locus-name prefix; not unique # division id -- see division.dmp file # inherited div flag (1 or 0) -- 1 if node inherits division from parent # genetic code id -- see gencode.dmp file # inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent # mitochondrial genetic code id -- see gencode.dmp file # inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent # GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage # hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet # comments -- free-text comments and citations die "perl $0 <division.dmp> <nodes.dmp> <gi_taxid_n/p.dmp> <nr.fa/nt.fa> <od>" unless(@ARGV==5); use Math::BigFloat; use Bio::SeqIO; use Bio::Seq; use Data::Dumper; use PerlIO::gzip; use FileHandle; use Cwd qw(abs_path getcwd); if ($ARGV[3]=~/gz$/){ open $Fa, "<:gzip", "$ARGV[3]" or die "$!"; }else{ open $Fa, "$ARGV[3]" or die "$!"; } my$od=$ARGV[4]; $od||=getcwd; $od=abs_path($od); unless(-d $od){ mkdir $od;} my $in = Bio::SeqIO->new(-fh => $Fa , -format => 'Fasta'); my %division2name=(); open IN,"$ARGV[0]" or die "$!"; while(<IN>){ chomp; my($taxid,$name)=split(/\s+\|\s+/); $division2name{$taxid}=$name; } close(IN); my%fout=(); for my$i (keys %division2name){ my$FO=FileHandle->new("> $od/$division2name{$i}.fa"); my $out = Bio::SeqIO->new(-fh => $FO , -format => 'Fasta'); $fout{$i}=$out; } print "$ARGV[0] readed\n"; #print Dumper(\%fout); #print Dumper(\%division2name); open IN,"$ARGV[1]" or die "$!"; my%taxid2division=(); while(<IN>){ chomp; my@tmp=split(/\s+\|\s+/); $taxid2division{$tmp[0]}=$tmp[4]; #last if $.>100; } close(IN); print "$ARGV[1] readed\n"; my %gi2taxid=(); open IN,"$ARGV[2]" or die "$!"; while(<IN>){ chomp; my@tmp=split(/\s+/); $gi2taxid{$tmp[0]}=$tmp[1]; #last if $.>100; } close(IN); print "$ARGV[2] readed\n"; #print Dumper(\%gi2taxid); while( my $seq = $in->next_seq()){ my $id=$seq->id; my ($gi)=($id=~/gi\|(\d+)\|ref\|/); if(exists($gi2taxid{$gi}) and exists($taxid2division{$gi2taxid{$gi}})){ $fout{$taxid2division{$gi2taxid{$gi}}}->write_seq($seq); }else{ print "unknown gi:$gi\n"; } } for my $i (keys %fout){ $fout{$i}->close(); }
以上是“perl如何分離NR NT庫”這篇文章的所有內容,感謝各位的閱讀!相信大家都有了一定的了解,希望分享的內容對大家有所幫助,如果還想學習更多知識,歡迎關注億速云行業資訊頻道!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。