Wiki 文件

Bio.Alphabet 的歷史與替換

簡介

本頁旨在幫助使用 Biopython 更新現有程式碼的使用者，以應對 Biopython 1.78（2020 年 9 月）中移除 Bio.Alphabet 模組的情況。

Bio.Alphabet 中的物件有兩個主要用途

記錄序列的分子類型（DNA、RNA 或蛋白質），
宣告序列、比對、模體等中預期的字元。

動機

字母物件的預期用途從未明確定義，而且這個二十年前的設計存在缺點。特別是，AlphabetEncoder 類別（用於新增關於間隙或終止符號的資訊）過於複雜，甚至難以確定分子類型。取得多個字母物件的共識（例如在字串相加期間）也很複雜。雖然你可以為序列指定一個嚴格的字母表，例如明確的 IUPAC DNA，但這並未強制只使用 A、C、G 和 T 這些字母。

程式碼變更

由於沒有關於如何改進或替換現有系統的明確提案，因此一致同意將其移除。一般而言，你可以簡單地移除程式碼中任何明確使用 Bio.Alphabet 的部分。

Seq 變更

Seq 物件不再具有 .alphabet 屬性，也不再對 Seq 操作執行類型檢查，例如將蛋白質添加到 DNA。首先移除 alphabet 引數

# Old style
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq

my_dna = Seq("ACGTTT", generic_dna)

# New style
from Bio.Seq import Seq

my_dna = Seq("ACGTTT")

請參閱下文，瞭解字母表在何處用於設定輸出檔案格式的分子類型。

字母表使用的另一個案例是在宣告間隙字元，在各種 Biopython 序列和比對剖析器中預設為 -。如果你使用的是不同的字元，現在需要明確地將其傳遞給 Seq 物件的 .replace() 方法

# Old style
from Bio.Alphabet import generic_dna, Gapped
from Bio.Seq import Seq

my_dna = Seq("ACGT=TT", Gapped(generic_dna, "="))
print(my_dna.ungap())

# New style
from Bio.Seq import Seq

my_dna = Seq("ACGT=TT")
print(my_dna.replace("=", ""))

SeqRecord 變更

一些序列檔案格式在寫入檔案時需要分子類型，先前該分子類型是使用 Bio.Alphabet 物件作為 Seq 物件的 .alphabet 屬性記錄的。現在，它作為分子類型字串記錄在 SeqRecord 物件的 annotation 字典中。

# Old style
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("ATGCGTGCAT", generic_dna)
record = SeqRecord(seq, id="test")
SeqIO.write(record, "test_write.gb", "genbank")

# New style
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("ATGCGTGCAT")
record = SeqRecord(seq, id="test", annotations={"molecule_type": "DNA"})
SeqIO.write(record, "test_write.gb", "genbank")

# Compatible with both pre- and post Biopython 1.78:
try:
    from Bio.Alphabet import generic_dna
except ImportError:
    generic_dna = None
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

if generic_dna:
    # Newer Biopython refuses second argument
    seq = Seq("ATGCGTGCAT", generic_dna)
else:
    seq = Seq("ATGCGTGCAT")
record = SeqRecord(seq, id="test", annotations={"molecule_type": "DNA"})
SeqIO.write(record, "test_write.gb", "genbank")

Bio.SeqIO 剖析函式過去會接受一個可選的字母表引數，以便在無法從檔案格式中確定該值時設定此值。現在已不可能這樣做了

# Old style
from Bio.Alphabet import generic_dna
from Bio import SeqIO

# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta", generic_dna)
rec_start = record[:20]
SeqIO.write(rec_start, "start_only.xml", "seqxml")

在像這樣的範例中，你必須在寫入記錄之前，在記錄的 annotation 中明確設定分子類型

# New style
from Bio import SeqIO

# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta")
rec_start = record[:20]
rec_start.annotations["molecule_type"] = "DNA"
SeqIO.write(rec_start, "start_only.xml", "seqxml")

同樣地，Bio.SeqIO.convert 函式的可選字母表引數已替換為可選的分子類型引數

# Old style
from Bio.Alphabet import generic_dna
from Bio import SeqIO

SeqIO.convert("example.fasta", "fasta", "example.xml", "seqxml", generic_dna)

# New style
from Bio import SeqIO

SeqIO.convert("example.fasta", "fasta", "example.xml", "seqxml", "DNA")

這是一種編寫向後相容版本的方式

# Compatible with both pre- and post Biopython 1.78:
try:
    from Bio.Alphabet import generic_dna
except ImportError:
    generic_dna = "DNA"
from Bio import SeqIO

SeqIO.convert("example.fasta", "fasta", "example.xml", "seqxml", generic_dna)

其他變更

過去使用字母表來指定預期的字母或符號列表的程式碼，現在通常會將有效字元視為字串（例如 Bio.motifs）。