bash 如何从 PDB 文件中提取链?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11685716/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 02:53:26  来源:igfitidea点击:

How to extract chains from a PDB file?

pythonbashbioinformaticsbiopython

提问by user1545114

I would like to extract chains from pdb files. I have a file named pdb.txt which contains pdb IDs as shown below. The first four characters represent PDB IDs and last character is the chain IDs.

我想从 pdb 文件中提取链。我有一个名为 pdb.txt 的文件,其中包含 pdb ID,如下所示。前四个字符代表 PDB ID,最后一个字符是链 ID。

1B68A 
1BZ4B
4FUTA

I would like to 1) read the file line by line 2) download the atomic coordinates of each chain from the corresponding PDB files.
3) save the output to a folder.

我想 1) 逐行读取文件 2) 从相应的 PDB 文件中下载每个链的原子坐标。
3) 将输出保存到一个文件夹中。

I used the following script to extract chains. But this code prints only A chains from pdb files.

我使用以下脚本来提取链。但是此代码仅打印来自 pdb 文件的 A 链。

for i in 1B68 1BZ4 4FUT
do 
wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="$i -O $i.pdb
grep  ATOM $i.pdb | grep 'A' > $i\_A.pdb
done

回答by David Cain

The following BioPython code should suit your needs well.

下面的 BioPython 代码应该可以很好地满足您的需求。

It uses PDB.Selectto only select the desired chains (in your case, one chain) and PDBIO()to create a structure containing just the chain.

它用于PDB.Select仅选择所需的链(在您的情况下,是一个链)并PDBIO()创建一个仅包含该链的结构。

import os
from Bio import PDB


class ChainSplitter:
    def __init__(self, out_dir=None):
        """ Create parsing and writing objects, specify output directory. """
        self.parser = PDB.PDBParser()
        self.writer = PDB.PDBIO()
        if out_dir is None:
            out_dir = os.path.join(os.getcwd(), "chain_PDBs")
        self.out_dir = out_dir

    def make_pdb(self, pdb_path, chain_letters, overwrite=False, struct=None):
        """ Create a new PDB file containing only the specified chains.

        Returns the path to the created file.

        :param pdb_path: full path to the crystal structure
        :param chain_letters: iterable of chain characters (case insensitive)
        :param overwrite: write over the output file if it exists
        """
        chain_letters = [chain.upper() for chain in chain_letters]

        # Input/output files
        (pdb_dir, pdb_fn) = os.path.split(pdb_path)
        pdb_id = pdb_fn[3:7]
        out_name = "pdb%s_%s.ent" % (pdb_id, "".join(chain_letters))
        out_path = os.path.join(self.out_dir, out_name)
        print "OUT PATH:",out_path
        plural = "s" if (len(chain_letters) > 1) else ""  # for printing

        # Skip PDB generation if the file already exists
        if (not overwrite) and (os.path.isfile(out_path)):
            print("Chain%s %s of '%s' already extracted to '%s'." %
                    (plural, ", ".join(chain_letters), pdb_id, out_name))
            return out_path

        print("Extracting chain%s %s from %s..." % (plural,
                ", ".join(chain_letters), pdb_fn))

        # Get structure, write new file with only given chains
        if struct is None:
            struct = self.parser.get_structure(pdb_id, pdb_path)
        self.writer.set_structure(struct)
        self.writer.save(out_path, select=SelectChains(chain_letters))

        return out_path


class SelectChains(PDB.Select):
    """ Only accept the specified chains when saving. """
    def __init__(self, chain_letters):
        self.chain_letters = chain_letters

    def accept_chain(self, chain):
        return (chain.get_id() in self.chain_letters)


if __name__ == "__main__":
    """ Parses PDB id's desired chains, and creates new PDB structures. """
    import sys
    if not len(sys.argv) == 2:
        print "Usage: $ python %s 'pdb.txt'" % __file__
        sys.exit()

    pdb_textfn = sys.argv[1]

    pdbList = PDB.PDBList()
    splitter = ChainSplitter("/home/steve/chain_pdbs")  # Change me.

    with open(pdb_textfn) as pdb_textfile:
        for line in pdb_textfile:
            pdb_id = line[:4].lower()
            chain = line[4]
            pdb_fn = pdbList.retrieve_pdb_file(pdb_id)
            splitter.make_pdb(pdb_fn, chain)


One final note: don't write your own parserfor PDB files. The format specification is ugly (really ugly), and the amount of faulty PDB files out there is staggering. Use a tool like BioPython that will handle parsing for you!

最后一点:不要为 PDB 文件编写自己的解析器。格式规范很丑(真的很丑),而且有问题的 PDB 文件数量惊人。使用像 BioPython 这样的工具来为您处理解析!

Furthermore, instead of using wget, you should use tools that interact with the PDB database for you. They take FTP connection limitations into account, the changing nature of the PDB database, and more. I should know - I updated Bio.PDBListto account for changes in the database. =)

此外,wget您应该使用为您与 PDB 数据库交互的工具,而不是使用。他们考虑了 FTP 连接限制、PDB 数据库不断变化的性质等。我应该知道 - 我更新Bio.PDBList以说明数据库中的更改。=)

回答by Carlos

It is probably a little late for asnwering this question, but I will give my opinion. Biopythonhas some really handy features that would help you achieve such a think easily. You could use something like a custom selection class and then call it for each one of the chains you want to select inside a for loop with the original pdb file.

回答这个问题可能有点晚了,但我会给出我的意见。 Biopython有一些非常方便的功能,可以帮助您轻松实现这样的想法。您可以使用自定义选择类之类的东西,然后为要在原始 pdb 文件的 for 循环中选择的每个链调用它。

    from Bio.PDB import Select, PDBIO
    from Bio.PDB.PDBParser import PDBParser

    class ChainSelect(Select):
        def __init__(self, chain):
            self.chain = chain

        def accept_chain(self, chain):
            if chain.get_id() == self.chain:
                return 1
            else:          
                return 0

    chains = ['A','B','C']
    p = PDBParser(PERMISSIVE=1)       
    structure = p.get_structure(pdb_file, pdb_file)

    for chain in chains:
        pdb_chain_file = 'pdb_file_chain_{}.pdb'.format(chain)                                 
        io_w_no_h = PDBIO()               
        io_w_no_h.set_structure(structure)
        io_w_no_h.save('{}'.format(pdb_chain_file), ChainSelect(chain))

回答by Theodros Zelleke

Lets say you have the following file pdb_structures

假设您有以下文件 pdb_structures

1B68A 
1BZ4B
4FUTA

Then have your code in load_pdb.sh

然后在 load_pdb.sh 中有你的代码

while read name
do
    chain=${name:4:1}
    name=${name:0:4}
    wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="$name -O $name.pdb
    awk -v chain=$chain '
cat pdb_structures | ./load_pdb.sh
~/^ATOM/ && substr(##代码##,20,1)==chain {print}' $name.pdb > $name\_$chain.pdb # rm $name.pdb done

uncomment the last line if you don't need the original pdb's.
execute

如果您不需要原始 pdb,请取消注释最后一行。
执行

##代码##