C++ 中的简单字典

Question

提问by y2k

Moving some code from Python to C++.

将一些代码从 Python 迁移到 C++。

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }

Thinking maps might be overkill? What would you use?

思维地图可能有点矫枉过正？你会用什么？

Answer 1

采纳答案by jogojapan

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

如果您正在进行优化，并且假设输入始终是四个字符之一，那么下面的函数可能值得尝试作为地图的替代品：

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

它的工作原理是您正在处理两个对称对。条件用于区分 A/T 对与 G/C 对（'G' 和 'C' 碰巧具有共同的第二低有效位）。其余算术执行对称映射。它基于这样一个事实，即 a = (a + b) - b 对任何 a,b 都成立。

Answer 2

回答by StackedCrooked

You can use the following syntax:

您可以使用以下语法：

#include <map>

std::map<char, char> my_map = {
    { 'A', '1' },
    { 'B', '2' },
    { 'C', '3' }
};

Answer 3

回答by jogojapan

While using a std::mapis fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum classfor strong-typing:

虽然使用 astd::map很好，或者使用 256 大小的字符表也可以，但您可以通过简单地使用enum. 如果你有 C++11 特性，你可以使用enum class强类型：

// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class". 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};

Usage becomes simple:

用法变得简单：

int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}

If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int)casts all the time:

如果这对您来说太多了，您可以定义一些帮助程序来获得人类可读的 ASCII 字符，并获得碱基对的补充，这样您就不会一直进行(int)强制转换：

BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}

It's clean, it's simple, and its efficient.

它干净、简单且高效。

Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generouswhen accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

现在，突然之间，您没有 256 字节的表。您也不存储字符（每个 1 个字节），因此如果您将其写入文件，您可以为每个碱基对写入 2 位，而不是每个碱基对写入 1 个字节（8 位）。我不得不使用生物信息学文件，每个文件将数据存储为 1 个字符。好处是它是人类可读的。缺点是本来应该是 250 MB 的文件最终占用了 1 GB 的空间。移动、存储和使用是一场噩梦。当然，即使考虑到蠕虫 DNA ，250 MB 也是慷慨的。无论如何，没有人会阅读价值 1 GB 的碱基对。

Answer 4

回答by Benjamin Lindley

Until I was really concerned about performance, I would use a function, that takes a base and returns its match:

在我真正关心性能之前，我会使用一个函数，它接受一个基数并返回它的匹配项：

char base_pair(char base)
{
    switch(base) {
        case 'T': return 'A';
        ... etc
        default: // handle error
    }
}

If I was concerned about performance, I would define a base as one fourth of a byte. 0 would represent A, 1 would represent G, 2 would represent C, and 3 would represent T. Then I would pack 4 bases into a byte, and to get their pairs, I would simply take the complement.

如果我关心性能，我会将基数定义为四分之一字节。0 代表 A，1 代表 G，2 代表 C，3 代表 T。然后我将 4 个碱基打包成一个字节，为了得到它们的对，我只需取补码。

Answer 5

回答by perreal

A table out of char array:

一个字符数组表：

char map[256] = { 0 };
map['T'] = 'A'; 
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */

Answer 6

回答by congusbongus

Here's the map solution:

这是地图解决方案：

#include <iostream>
#include <map>

typedef std::map<char, char> BasePairMap;

int main()
{
    BasePairMap m;
    m['A'] = 'T';
    m['T'] = 'A';
    m['C'] = 'G';
    m['G'] = 'C';

    std::cout << "A:" << m['A'] << std::endl;
    std::cout << "T:" << m['T'] << std::endl;
    std::cout << "C:" << m['C'] << std::endl;
    std::cout << "G:" << m['G'] << std::endl;

    return 0;
}

Answer 7

回答by Kerri Chandler

This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

这是我能想到的最快、最简单、空间最小的解决方案。一个好的优化编译器甚至会消除访问对和名称数组的成本。该解决方案在 C 中同样有效。

#include <iostream>

enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] = 
  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
    -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1,  T };

const Base
base2 (const char b)
{
  switch (b)
    {
    case 'A': return A;
    case 'C': return C;
    case 'T': return T;
    case 'G': return G;
    default: abort ();
    }
}

int
main (int argc, char *args) 
{
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[b] << ":" 
                << name[pair[b]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base[name[b]]] << ":" 
                << name[pair[base[name[b]]]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base2(name[b])] << ":" 
                << name[pair[base2(name[b])]] << std::endl;
    }
};

base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

base[] 是 Base 的快速 ascii 字符（即 0 和 3 之间的整数）查找，有点难看。一个好的优化编译器应该能够处理 base2() 但我不确定是否有。

Answer 8

回答by Tony Delroy

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } 你会用什么？

Maybe:

也许：

static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
    // use p[1]

;-)

C++ 中的简单字典

提问by y2k

采纳答案by jogojapan

回答by StackedCrooked

回答by jogojapan

回答by Benjamin Lindley

回答by perreal

回答by congusbongus

回答by Kerri Chandler

回答by Tony Delroy

相关推荐

最近更新

标签

C++ 中的简单字典

提问by y2k

采纳答案by jogojapan

回答by StackedCrooked

回答by jogojapan

回答by Benjamin Lindley

回答by perreal

回答by congusbongus

回答by Kerri Chandler

回答by Tony Delroy

相关推荐

C++ 否定一个数字的最快方法

C++ 什么是 _WIN32_WINNT，它是如何工作的？

C++ 如何在不使用 ++ 或 + 或其他算术运算符的情况下将两个数字相加

C++ pkg-config 找不到 opencv

相关推荐

最近更新

标签