C++ 中的简单字典

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15151480/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 19:05:43  来源:igfitidea点击:

Simple dictionary in C++

c++mapdictionary

提问by y2k

Moving some code from Python to C++.

将一些代码从 Python 迁移到 C++。

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }

Thinking maps might be overkill? What would you use?

思维地图可能有点矫枉过正?你会用什么?

采纳答案by jogojapan

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

如果您正在进行优化,并且假设输入始终是四个字符之一,那么下面的函数可能值得尝试作为地图的替代品:

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

它的工作原理是您正在处理两个对称对。条件用于区分 A/T 对与 G/C 对('G' 和 'C' 碰巧具有共同的第二低有效位)。其余算术执行对称映射。它基于这样一个事实,即 a = (a + b) - b 对任何 a,b 都成立。

回答by StackedCrooked

You can use the following syntax:

您可以使用以下语法:

#include <map>

std::map<char, char> my_map = {
    { 'A', '1' },
    { 'B', '2' },
    { 'C', '3' }
};

回答by jogojapan

While using a std::mapis fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum classfor strong-typing:

虽然使用 astd::map很好,或者使用 256 大小的字符表也可以,但您可以通过简单地使用enum. 如果你有 C++11 特性,你可以使用enum class强类型:

// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class". 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};

Usage becomes simple:

用法变得简单:

int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}

If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int)casts all the time:

如果这对您来说太多了,您可以定义一些帮助程序来获得人类可读的 ASCII 字符,并获得碱基对的补充,这样您就不会一直进行(int)强制转换:

BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}

It's clean, it's simple, and its efficient.

它干净、简单且高效。

Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generouswhen accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

现在,突然之间,您没有 256 字节的表。您也不存储字符(每个 1 个字节),因此如果您将其写入文件,您可以为每个碱基对写入 2 位,而不是每个碱基对写入 1 个字节(8 位)。我不得不使用生物信息学文件,每个文件将数据存储为 1 个字符。好处是它是人类可读的。缺点是本来应该是 250 MB 的文件最终占用了 1 GB 的空间。移动、存储和使用是一场噩梦。当然,即使考虑到蠕虫 DNA ,250 MB 也是慷慨的。无论如何,没有人会阅读价值 1 GB 的碱基对。

回答by Benjamin Lindley

Until I was really concerned about performance, I would use a function, that takes a base and returns its match:

在我真正关心性能之前,我会使用一个函数,它接受一个基数并返回它的匹配项:

char base_pair(char base)
{
    switch(base) {
        case 'T': return 'A';
        ... etc
        default: // handle error
    }
}

If I was concerned about performance, I would define a base as one fourth of a byte. 0 would represent A, 1 would represent G, 2 would represent C, and 3 would represent T. Then I would pack 4 bases into a byte, and to get their pairs, I would simply take the complement.

如果我关心性能,我会将基数定义为四分之一字节。0 代表 A,1 代表 G,2 代表 C,3 代表 T。然后我将 4 个碱基打包成一个字节,为了得到它们的对,我只需取补码。

回答by perreal

A table out of char array:

一个字符数组表:

char map[256] = { 0 };
map['T'] = 'A'; 
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */

回答by congusbongus

Here's the map solution:

这是地图解决方案:

#include <iostream>
#include <map>

typedef std::map<char, char> BasePairMap;

int main()
{
    BasePairMap m;
    m['A'] = 'T';
    m['T'] = 'A';
    m['C'] = 'G';
    m['G'] = 'C';

    std::cout << "A:" << m['A'] << std::endl;
    std::cout << "T:" << m['T'] << std::endl;
    std::cout << "C:" << m['C'] << std::endl;
    std::cout << "G:" << m['G'] << std::endl;

    return 0;
}

回答by Kerri Chandler

This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

这是我能想到的最快、最简单、空间最小的解决方案。一个好的优化编译器甚至会消除访问对和名称数组的成本。该解决方案在 C 中同样有效。

#include <iostream>

enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] = 
  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
    -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1,  T };

const Base
base2 (const char b)
{
  switch (b)
    {
    case 'A': return A;
    case 'C': return C;
    case 'T': return T;
    case 'G': return G;
    default: abort ();
    }
}

int
main (int argc, char *args) 
{
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[b] << ":" 
                << name[pair[b]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base[name[b]]] << ":" 
                << name[pair[base[name[b]]]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base2(name[b])] << ":" 
                << name[pair[base2(name[b])]] << std::endl;
    }
};

base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

base[] 是 Base 的快速 ascii 字符(即 0 和 3 之间的整数)查找,有点难看。一个好的优化编译器应该能够处理 base2() 但我不确定是否有。

回答by Tony Delroy

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } 你会用什么?

Maybe:

也许:

static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
    // use p[1]

;-)

;-)