C++ 如何将 std::string 转换为小写?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/313970/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:34:34  来源:igfitidea点击:

How to convert std::string to lower case?

c++stringc++-standard-librarytolower

提问by Konrad

I want to convert a std::stringto lowercase. I am aware of the function tolower(), however in the past I have had issues with this function and it is hardly ideal anyway as use with a std::stringwould require iterating over each character.

我想将 a 转换std::string为小写。我知道这个函数tolower(),但是在过去我遇到过这个函数的问题,无论如何它并不理想,因为与 astd::string一起使用需要迭代每个字符。

Is there an alternative which works 100% of the time?

有没有一种 100% 有效的替代方法?

回答by Stefan Mai

Adapted from Not So Frequently Asked Questions:

改编自不那么常见的问题

#include <algorithm>
#include <cctype>
#include <string>

std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
    [](unsigned char c){ return std::tolower(c); });

You're really not going to get away without iterating through each character. There's no way to know whether the character is lowercase or uppercase otherwise.

如果不遍历每个角色,您真的不会逃脱。否则无法知道字符是小写还是大写。

If you really hate tolower(), here's a specialized ASCII-only alternative that I don't recommend you use:

如果你真的讨厌tolower(),这里有一个专门的 ASCII-only 替代方案,我不建议你使用:

char asciitolower(char in) {
    if (in <= 'Z' && in >= 'A')
        return in - ('Z' - 'z');
    return in;
}

std::transform(data.begin(), data.end(), data.begin(), asciitolower);

Be aware that tolower()can only do a per-single-byte-character substitution, which is ill-fitting for many scripts, especially if using a multi-byte-encoding like UTF-8.

请注意,tolower()它只能执行每个单字节字符的替换,这对于许多脚本来说是不合适的,尤其是在使用多字节编码(如 UTF-8)时。

回答by Rob

Boost provides a string algorithm for this:

Boost 为此提供了一个字符串算法

#include <boost/algorithm/string.hpp>

std::string str = "HELLO, WORLD!";
boost::algorithm::to_lower(str); // modifies str

Or, for non-in-place:

或者,对于非就地

#include <boost/algorithm/string.hpp>

const std::string str = "HELLO, WORLD!";
const std::string lower_str = boost::algorithm::to_lower_copy(str);

回答by DevSolar

tl;dr

tl;博士

Use the ICU library.If you don't, your conversion routine will silently break on cases you are probably not aware of even existing.

使用ICU 库如果您不这样做,您的转换例程将在您可能不知道甚至存在的情况下无声地中断。



First you have to answer a question: What is the encodingof your std::string? Is it ISO-8859-1? Or perhaps ISO-8859-8? Or Windows Codepage 1252? Does whatever you're using to convert upper-to-lowercase know that?(Or does it fail miserably for characters over 0x7f?)

首先,你必须回答一个问题:什么是编码你的std::string?是 ISO-8859-1 吗?或者也许是 ISO-8859-8?还是 Windows 代码页 1252?您用于将大写转换为小写的任何内容都知道吗?(或者它是否会因为超过的字符而惨遭失败0x7f?)

If you are using UTF-8 (the only sane choice among the 8-bit encodings) with std::stringas container, you are already deceiving yourself into believing that you are still in control of things, because you are storing a multibyte character sequence in a container that is not aware of the multibyte concept. Even something as simple as .substr()is a ticking timebomb. (Because splitting a multibyte sequence will result in an invalid (sub-) string.)

如果您使用 UTF-8(8 位编码中唯一合理的选择)std::string作为容器,您已经在欺骗自己相信您仍然可以控制事物,因为您在容器中存储了多字节字符序列不知道多字节概念。即使是像.substr()滴答作响的定时炸弹这样简单的事情。(因为拆分多字节序列将导致无效的(子)字符串。)

And as soon as you try something like std::toupper( '?' ), in anyencoding, you are in deep trouble. (Because it's simply not possible to do this "right" with the standard library, which can only deliver oneresult character, not the "SS"needed here.) [1] Another example would be std::tolower( 'I' ), which should yield different results depending on the locale. In Germany, 'i'would be correct; in Turkey, '?'(LATIN SMALL LETTER DOTLESS I) is the expected result (which, again, is more than one byte in UTF-8 encoding). Yet another example is the Greek Sigma, uppercase '∑', lowercase 'σ'... except at the end of a word, where it is '?'.

一旦你std::toupper( '?' )任何编码中尝试类似的东西,你就会陷入困境。(因为使用标准库根本不可能做到这一点“正确”,它只能提供一个结果字符,而不是"SS"这里需要的字符。) [1] 另一个例子是std::tolower( 'I' ),它应该根据区域设置产生不同的结果。在德国,'i'是正确的;在土耳其,'?'(LATIN SMALL LETTER DOTLESS I) 是预期的结果(在 UTF-8 编码中再次超过一个字节)。另一个例子是希腊语Sigma,大写'∑',小写'σ'...除了在单词的末尾,它是'?'

So, anycase conversion that works on a character at a time, or worse, a byteat a time, is broken by design.

因此,任何一次处理一个字符,或更糟糕的是一次处理一个字节的大小写转换,都被设计破坏了。

Then there is the point that the standard library, for what it iscapable of doing, is depending on which locales are supportedon the machine your software is running on... and what do you do if it isn't?

再有就是点的标准库,它能够做,取决于其语言环境的支持你的软件上运行的计算机上...和你做什么,如果它是不是?

So what you are reallylooking for is a string class that is capable of dealing with all this correctly, and that is notany of the std::basic_string<>variants.

因此,您真正要寻找的是能够正确处理所有这些的字符串类,不是任何std::basic_string<>变体

(C++11 note: std::u16stringand std::u32stringare better, but still not perfect. C++20 brought std::u8string, but all these do is specify the encoding. In many other respects they still remain ignorant of Unicode mechanics, like normalization, collation, ...)

(C ++ 11注:std::u16stringstd::u32string更好的,但还不够完善C ++ 20带来的std::u8string,但这一切都为指定编码在其他许多方面,他们仍然一无所知的Unicode力学,像正常化,整理,...。 .)

While Boost looksnice, API wise, Boost.Locale is basically a wrapper around ICU. IfBoost is compiledwith ICU support... if it isn't, Boost.Locale is limited to the locale support compiled for the standard library.

虽然 Boost看起来不错,但 API 明智,Boost.Locale 基本上是ICU的包装器。如果Boost 是使用 ICU 支持编译的……如果不是,则 Boost.Locale 仅限于为标准库编译的语言环境支持。

And believe me, gettingBoost to compile with ICU can be a real pain sometimes. (There are no pre-compiled binaries for Windows, so you'd have to supply them together with your application, and thatopens a whole new can of worms...)

相信我,Boost 与 ICU 一起编译有时会很痛苦。(没有用于 Windows 的预编译二进制文件,因此您必须将它们与您的应用程序一起提供,这会打开一个全新的蠕虫罐......)

So personally I would recommend getting full Unicode support straight from the horse's mouth and using the ICUlibrary directly:

所以我个人建议直接从马口中获得完整的 Unicode 支持并直接使用ICU库:

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/locid.h>

#include <iostream>

int main()
{
    /*                          "Odysseus" */
    char const * someString = u8"ΟΔΥΣΣΕΥΣ";
    icu::UnicodeString someUString( someString, "UTF-8" );
    // Setting the locale explicitly here for completeness.
    // Usually you would use the user-specified system locale,
    // which *does* make a difference (see ? vs. i above).
    std::cout << someUString.toLower( "el_GR" ) << "\n";
    std::cout << someUString.toUpper( "el_GR" ) << "\n";
    return 0;
}

Compile (with G++ in this example):

编译(在本例中使用 G++):

g++ -Wall example.cpp -licuuc -licuio

This gives:

这给出:

?δυσσε??

Note that the Σ<->σ conversion in the middle of the word, and the Σ<->? conversion at the end of the word. No <algorithm>-based solution can give you that.

注意单词中间的 Σ<->σ 转换,以及 Σ<->? 词尾的转换。没有<algorithm>基于的解决方案可以给你。



[1] In 2017, the Council for German Orthography ruled that "?" U+1E9E LATIN CAPITAL LETTER SHARP S could be used officially, as an option beside the traditional "SS" conversion to avoid ambiguity e.g. in passports (where names are capitalized). My beautiful go-to example, made obsolete by committee decision...

[1] 2017 年,德国正字法委员会裁定“?” U+1E9E LATIN CAPITAL LETTER SHARP S 可以正式使用,作为传统“SS”转换之外的一个选项,以避免歧义,例如在护照中(姓名大写)。我美丽的首选示例,因委员会决定而过时...

回答by incises

Using range-based for loop of C++11 a simpler code would be :

使用基于范围的 C++11 for 循环,一个更简单的代码是:

#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

int main ()
{
  std::locale loc;
  std::string str="Test String.\n";

 for(auto elem : str)
    std::cout << std::tolower(elem,loc);
}

回答by Patrick Ohly

If the string contains UTF-8 characters outside of the ASCII range, then boost::algorithm::to_lower will not convert those. Better use boost::locale::to_lower when UTF-8 is involved. See http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html

如果字符串包含 ASCII 范围之外的 UTF-8 字符,则 boost::algorithm::to_lower 将不会转换这些字符。当涉及 UTF-8 时,最好使用 boost::locale::to_lower。请参阅http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html

回答by user2218467

This is a follow-up to Stefan Mai's response: if you'd like to place the result of the conversion in another string, you need to pre-allocate its storage space prior to calling std::transform. Since STL stores transformed characters at the destination iterator (incrementing it at each iteration of the loop), the destination string will not be automatically resized, and you risk memory stomping.

这是 Stefan Mai 回应的后续:如果您想将转换的结果放在另一个字符串中,则需要在调用 之前预先分配其存储空间std::transform。由于 STL 将转换后的字符存储在目标迭代器中(在循环的每次迭代中递增),目标字符串不会自动调整大小,并且您可能会面临内存踩踏的风险。

#include <string>
#include <algorithm>
#include <iostream>

int main (int argc, char* argv[])
{
  std::string sourceString = "Abc";
  std::string destinationString;

  // Allocate the destination space
  destinationString.resize(sourceString.size());

  // Convert the source string to lower case
  // storing the result in destination string
  std::transform(sourceString.begin(),
                 sourceString.end(),
                 destinationString.begin(),
                 ::tolower);

  // Output the result of the conversion
  std::cout << sourceString
            << " -> "
            << destinationString
            << std::endl;
}

回答by Gilson PJ

Another approach using range based for loop with reference variable

另一种使用基于范围的 for 循环和参考变量的方法

string test = "Hello World";
for(auto& c : test)
{
   c = tolower(c);
}

cout<<test<<endl;

回答by Atul Rokade

Simplest way to convert string into loweercase without bothering about std namespace is as follows

将字符串转换为小写而不关心 std 命名空间的最简单方法如下

1:string with/without spaces

1:带/不带空格的字符串

#include <algorithm>
#include <iostream>
#include <string>
using namespace std;
int main(){
    string str;
    getline(cin,str);
//------------function to convert string into lowercase---------------
    transform(str.begin(), str.end(), str.begin(), ::tolower);
//--------------------------------------------------------------------
    cout<<str;
    return 0;
}

2:string without spaces

2:没有空格的字符串

#include <algorithm>
#include <iostream>
#include <string>
using namespace std;
int main(){
    string str;
    cin>>str;
//------------function to convert string into lowercase---------------
    transform(str.begin(), str.end(), str.begin(), ::tolower);
//--------------------------------------------------------------------
    cout<<str;
    return 0;
}

回答by Etherealone

As far as I see Boost libraries are really bad performance-wise. I have tested their unordered_map to STL and it was average 3 times slower (best case 2, worst was 10 times). Also this algorithm looks too low.

据我所知,Boost 库在性能方面非常糟糕。我已经测试了他们的 unordered_map 到 STL,它平均慢了 3 倍(最好的情况 2,最坏的情况是 10 倍)。而且这个算法看起来太低了。

The difference is so big that I am sure whatever addition you will need to do to tolowerto make it equal to boost "for your needs" will be way fasterthan boost.

差异如此之大,我相信无论您需要做什么添加tolower才能使其等于“满足您的需求”,都将比 boost快得多

I have done these tests on an Amazon EC2, therefore performance varied during the test but you still get the idea.

我已经在 Amazon EC2 上完成了这些测试,因此在测试期间性能会有所不同,但您仍然可以理解。

./test
Elapsed time: 12365milliseconds
Elapsed time: 1640milliseconds
./test
Elapsed time: 26978milliseconds
Elapsed time: 1646milliseconds
./test
Elapsed time: 6957milliseconds
Elapsed time: 1634milliseconds
./test
Elapsed time: 23177milliseconds
Elapsed time: 2421milliseconds
./test
Elapsed time: 17342milliseconds
Elapsed time: 14132milliseconds
./test
Elapsed time: 7355milliseconds
Elapsed time: 1645milliseconds

-O2made it like this:

-O2做成这样:

./test
Elapsed time: 3769milliseconds
Elapsed time: 565milliseconds
./test
Elapsed time: 3815milliseconds
Elapsed time: 565milliseconds
./test
Elapsed time: 3643milliseconds
Elapsed time: 566milliseconds
./test
Elapsed time: 22018milliseconds
Elapsed time: 566milliseconds
./test
Elapsed time: 3845milliseconds
Elapsed time: 569milliseconds

Source:

来源:

string str;
bench.start();
for(long long i=0;i<1000000;i++)
{
    str="DSFZKMdskfdsjfsdfJDASFNSDJFXCKVdnjsafnjsdfjdnjasnJDNASFDJDSFSDNJjdsanjfsdnfjJNFSDJFSD";
    boost::algorithm::to_lower(str);
}
bench.end();

bench.start();
for(long long i=0;i<1000000;i++)
{
    str="DSFZKMdskfdsjfsdfJDASFNSDJFXCKVdnjsafnjsdfjdnjasnJDNASFDJDSFSDNJjdsanjfsdnfjJNFSDJFSD";
    for(unsigned short loop=0;loop < str.size();loop++)
    {
        str[loop]=tolower(str[loop]);
    }
}
bench.end();

I guess I should to the tests on a dedicated machine but I will be using this EC2 so I do not really need to test it on my machine.

我想我应该在专用机器上进行测试,但我将使用这个 EC2,所以我真的不需要在我的机器上测试它。

回答by Sameer

std::ctype::tolower()from the standard C++ Localization library will correctly do this for you. Here is an example extracted from the tolower reference page

std::ctype::tolower()来自标准的 C++ 本地化库将为您正确执行此操作。这是从tolower 参考页面中提取的示例

#include <locale>
#include <iostream>

int main () {
  std::locale::global(std::locale("en_US.utf8"));
  std::wcout.imbue(std::locale());
  std::wcout << "In US English UTF-8 locale:\n";
  auto& f = std::use_facet<std::ctype<wchar_t>>(std::locale());
  std::wstring str = L"HELLo, wORLD!";
  std::wcout << "Lowercase form of the string '" << str << "' is ";
  f.tolower(&str[0], &str[0] + str.size());
  std::wcout << "'" << str << "'\n";
}