pandas 仅当列值是字符串时才将列值转换为小写

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45815723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:17:38  来源:igfitidea点击:

Convert column values to lower case only if they are string

pythonstringpandasdataframe

提问by user4896331

I'm having real trouble converting a column into lowercase. It's not as simple as just using:

我在将列转换为小写时遇到了真正的麻烦。这并不像仅仅使用那么简单:

df['my_col'] = df['my_col'].str.lower()

because I'm iterating over a lot of dataframes, and some of them (but not all) have both strings and integers in the column of interest. This causes the lower function, if applied like above, to throw an exception:

因为我正在迭代很多数据帧,其中一些(但不是全部)在感兴趣的列中同时具有字符串和整数。如果像上面一样应用,这会导致下层函数抛出异常:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Rather than forcing the type to be a string, I'd like to assess whether values are strings and then - if they are - convert them to lowercase, and - if they are not strings - leave them as they are. I thought this would work:

我不想强制类型为字符串,而是想评估值是否为字符串,然后 - 如果是 - 将它们转换为小写,并且 - 如果它们不是字符串 - 保持原样。我认为这会奏效:

df = df.apply(lambda x: x.lower() if(isinstance(x, str)) else x)

But it doesn't work... probably because I'm overlooking something obvious, but I can't see what it is!

但它不起作用......可能是因为我忽略了一些明显的东西,但我看不到它是什么!

My data looks something like this:

我的数据看起来像这样:

                          OS    Count
0          Microsoft Windows     3
1                   Mac OS X     4
2                      Linux     234
3    Don't have a preference     0
4  I prefer Windows and Unix     3
5                       Unix     2
6                        VMS     1
7         DOS or ZX Spectrum     2

回答by ysearka

The test in your lambda function isn't quite right, you weren't far from the truth though:

您的 lambda 函数中的测试不太正确,但您离真相不远:

df.apply(lambda x: x.str.lower() if(x.dtype == 'object') else x)

With the data frame and output:

使用数据框和输出:

>>> df = pd.DataFrame(
    [
        {'OS': 'Microsoft Windows', 'Count': 3},
        {'OS': 'Mac OS X', 'Count': 4},
        {'OS': 'Linux', 'Count': 234},
        {'OS': 'Dont have a preference', 'Count': 0},
        {'OS': 'I prefer Windows and Unix', 'Count': 3},
        {'OS': 'Unix', 'Count': 2},
        {'OS': 'VMS', 'Count': 1},
        {'OS': 'DOS or ZX Spectrum', 'Count': 2},
    ]
)
>>> df = df.apply(lambda x: x.str.lower() if x.dtype=='object' else x)
>>> print(df)
                          OS  Count
0          microsoft windows      3
1                   mac os x      4
2                      linux    234
3     dont have a preference      0
4  i prefer windows and unix      3
5                       unix      2
6                        vms      1
7         dos or zx spectrum      2

回答by cs95

What is the type of these columns to begin with? object? If so, you should just convert them:

这些列的类型是什么?object? 如果是这样,你应该只转换它们:

df['my_col'] = df.my_col.astype(str).str.lower()

MVCE:

MVCE:

In [1120]: df
Out[1120]: 
   Col1
0   VIM
1   Foo
2  test
3     1
4     2
5     3
6   4.5
7   OSX

In [1121]: df.astype(str).Col1.str.lower()
Out[1121]: 
0     vim
1     foo
2    test
3       1
4       2
5       3
6     4.5
7     osx
Name: Col1, dtype: object

In [1118]: df.astype(str).Col1.str.lower().dtype
Out[1118]: dtype('O')

If you want to do arithmetic on these rows, you probably shouldn't be mixing strs and numeric types.

如果您想对这些行进行算术运算,您可能不应该混合使用strs 和数字类型。

However, if that is indeed your case, you may typecast to numeric using pd.to_numeric(..., errors='coerce'):

但是,如果这确实是您的情况,您可以使用pd.to_numeric(..., errors='coerce')以下命令将类型转换为数字:

In [1123]: pd.to_numeric(df.Col1, errors='coerce')
Out[1123]: 
0    NaN
1    NaN
2    NaN
3    1.0
4    2.0
5    3.0
6    4.5
7    NaN
Name: Col1, dtype: float64

You can work with the NaNs, but notice the dtypenow.

您可以使用 NaN,但请注意dtype现在。

回答by Narahari B M

From the above two answers I think doing this is a bit more safer way:

从上面的两个答案中,我认为这样做是一种更安全的方法:

Note the astype(str)

请注意 astype(str)

df_lower=df.apply(lambda x: x.astype(str).str.lower() if(x.dtype == 'object') else x)

Because if your string column by chance contains only numbers in some rows, not doing astype(str)converts them to nan. This might be a bit slower but it wont convert rows with just numbers to nan.

因为如果您的字符串列偶然仅包含某些行中的数字,则不执行astype(str)会将它们转换为 nan。这可能会慢一点,但它不会将只有数字的行转换为 nan。

回答by Felipe S. S. Schneider

This also works and is very readable:

这也有效并且非常可读:

for column in df.select_dtypes("object").columns:
    df[column] = df[column].str.lower()

A possible drawback might be the forloop over a subset of columns.

一个可能的缺点可能是对for列子集的循环。