從序列中提取基因間區域。有時候,閱讀基因之間的區域會更有趣....
以下可直接執行的腳本讀取一個 genbank 檔案,這通常是基因組或染色體檔案。它使用 CDS 特徵來找出 ORF 的 5' 端和 3' 端。是的,ORF 不完全是基因的同義詞,但這是我們處理的方式。此外,如果您也對 RNA 編碼基因感興趣,您可能希望將 CDS 特徵替換為「基因」特徵。「intergene_length」變數是待分析的基因間區域的最小長度閾值,預設值為 1。程式的輸出檔案的字尾為「_ign.fasta」。程式會根據 genbank 檔案的註解輸出 + 鏈或反向互補鏈。輸出為 FASTA 格式,標頭包含基因間區域的坐標、唯一 ID 以及序列是否來自 + 鏈或 - 鏈。
#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os
# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path, intergene_length=1):
seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
cds_list_plus = []
cds_list_minus = []
intergenic_records = []
# Loop over the genome file, get the CDS features on each of the strands
for feature in seq_record.features:
if feature.type == "CDS":
mystart = feature.location._start.position
myend = feature.location._end.position
if feature.strand == -1:
cds_list_minus.append((mystart, myend, -1))
elif feature.strand == 1:
cds_list_plus.append((mystart, myend, 1))
else:
sys.stderr.write(
"No strand indicated %d-%d. Assuming +\n" % (mystart, myend)
)
cds_list_plus.append((mystart, myend, 1))
for i, pospair in enumerate(cds_list_plus[1:]):
# Compare current start position to previous end position
last_end = cds_list_plus[i][1]
this_start = pospair[0]
strand = pospair[2]
if this_start - last_end >= intergene_length:
intergene_seq = seq_record.seq[last_end:this_start]
strand_string = "+"
intergenic_records.append(
SeqRecord(
intergene_seq,
id="%s-ign-%d" % (seq_record.name, i),
description="%s %d-%d %s"
% (seq_record.name, last_end + 1, this_start, strand_string),
)
)
for i, pospair in enumerate(cds_list_minus[1:]):
last_end = cds_list_minus[i][1]
this_start = pospair[0]
strand = pospair[2]
if this_start - last_end >= intergene_length:
intergene_seq = seq_record.seq[last_end:this_start]
strand_string = "-"
intergenic_records.append(
SeqRecord(
intergene_seq,
id="%s-ign-%d" % (seq_record.name, i),
description="%s %d-%d %s"
% (seq_record.name, last_end + 1, this_start, strand_string),
)
)
outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
SeqIO.write(intergenic_records, open(outpath, "w"), "fasta")
if __name__ == "__main__":
if len(sys.argv) == 2:
get_interregions(sys.argv[1])
elif len(sys.argv) == 3:
get_interregions(sys.argv[1], int(sys.argv[2]))
else:
print("Usage: get_intergenic.py gb_file [intergenic_length]")
sys.exit(0)
有關程式碼的完整說明,請參閱此處:bytesizebio.net
./get\_intergene mygenbankfile.gb 1
在 genbank 檔案上執行。基因間序列將以 FASTA 格式出現在 mygenbank_ign.fasta 檔案中。