食譜

表示來自 ACE 檔案中存檔的重疊群比對。

問題

有時能夠將作為基因組或 EST 組裝一部分產生的重疊群表示為比對 (例如，在混合樣本的序列中搜尋潛在的 SNP，或能夠以更易於檢視的方式寫出重疊群) 會很有用。對於使用 ACE 檔案格式的組裝，我們可以利用 Biopython 的 ACE 處理功能，將構成重疊群的讀取序列添加到通用比對中。

解決方案

讓我們用 Biopython 測試框架中使用的 ACE 檔案中的重疊群來表示：Tests/Ace/contig1.ace 作為範例

from Bio.Sequencing import Ace
from Bio.Align import MultipleSeqAlignment


def cut_ends(read, start, end):
    """Replace residues on either end of a sequence with gaps.

    In this case we want to cut out the sections of each read which the assembler has
    decided are not good enough to include in the contig and replace them with gaps
    so the other information about positions in the read is maintained
    """
    return (start - 1) * "-" + read[start - 1 : end] + (len(read) - end) * "-"


def pad_read(read, start, conlength):
    """Pad out either end of a read so it fits into an alignment.

    The start argument is the position of the first base of the reads sequence in
    the contig it is part of. If the start value is negative (or 0 since ACE
    files count from 1, not 0) we need to take some sequence off the start
    otherwise each end is padded to the length of the consensus with gaps.
    """
    if start < 1:
        seq = read[-1 * start + 1 :]
    else:
        seq = (start - 1) * "-" + read
    seq = seq + (conlength - len(seq)) * "-"
    return seq


# We will use the Ace parser to read individual contigs from file. Be aware
# that using this iterator can miss WA, CT, RT and WR tags (which can be
# anywhere in the file, e.g. the end). Read the file specification here:
# http://bozeman.mbt.washington.edu/consed/distributions/README.14.0.txt
# If you need these tags you'll need to use  Ace.read() (and lots of RAM).

ace_gen = Ace.parse(open("contig1.ace", "r"))
contig = next(ace_gen)  # Looking only at first contig
align = MultipleSeqAlignment([])

# Now we have started our alignment we can add sequences to it,
# we will loop through contig's reads and get quality clipping from
# .reads[readnumber].qa and the position of each read in the contig
# .af[readnumber].padded_start and use the functions above to cut and
# pad the sequences before they are added

for readn in range(len(contig.reads)):
    clipst = contig.reads[readn].qa.qual_clipping_start
    clipe = contig.reads[readn].qa.qual_clipping_end
    start = contig.af[readn].padded_start
    seq = cut_ends(contig.reads[readn].rd.sequence, clipst, clipe)
    seq = pad_read(seq, start, len(contig.sequence))
    align.add_sequence("read%i" % (readn + 1), seq)

當你印出比對或其中的序列時

>>> print(align)
Alignment with 2 rows and 856 columns
--------------------------------------------...--- read1
------GGATTGCCCTagtaacGGCGAGTGAAGCGGCAACAGCT...--- read2

>>> for read in align:
...     print(read.seq[80:159])
...
tt*gtagagggaTGCTTCTGGGTAGCGACCGGTCTAAGTTCCTCGGAACAGGACGTCATAGAGGGTGAGAATCCCGTAT
TTTGTAGAGG*ATGCTTCTGGGTAGCGACCGGTCTAAGTTCCTCGGAACAGGACGTCATAGAGGGTGAGAATCCCGTAT

你現在也可以使用 AlignIO 將比對寫入任何你想要的格式。

討論

詳細資訊在上面的註解中給出，概括來說，ACE 重疊群是用 Ace.parse() 讀取的，然後啟動通用比對，接著將重疊群的讀取序列添加到新的比對中。