从每个包含类似字符串对象的 Pandas 数据框单元中去除空格的 Pythonic/有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33788913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:59:50  来源:igfitidea点击:

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

pythonpandasdataframe

提问by deadcode

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.

我正在将 CSV 文件读入 DataFrame。我需要从所有类似字符串的单元格中去除空格,让其他单元格在 Python 2.7 中保持不变。

Here is what I'm doing:

这是我在做什么:

def remove_whitespace( x ):
    if isinstance( x, basestring ):
        return x.strip()
    else:
        return x

my_data = my_data.applymap( remove_whitespace )

Is there a better or more idiomatic to Pandas way to do this?

Pandas 是否有更好或更惯用的方法来做到这一点?

Is there a more efficient way (perhaps by doing things column wise)?

有没有更有效的方法(也许通过按列做事)?

I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.

我试图寻找一个明确的答案,但关于这个主题的大多数问题似乎是如何从列名本身中去除空格,或者假设单元格都是字符串。

回答by jakevdp

You could use pandas' Series.str.strip()method to do this quickly for each string-like column:

您可以使用 pandas 的Series.str.strip()方法为每个类似字符串的列快速执行此操作:

>>> data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
>>> data
      values
0     ABC   
1        DEF
2      GHI  

>>> data['values'].str.strip()
0    ABC
1    DEF
2    GHI
Name: values, dtype: object

回答by Warren Weckesser

When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.

当您调用 时pandas.read_csv,您可以使用匹配零个或多个空格后跟逗号后跟零个或多个空格作为分隔符的正则表达式。

For example, here's "data.csv":

例如,这里是"data.csv"

In [19]: !cat data.csv
1.5, aaa,  bbb ,  ddd     , 10 ,  XXX   
2.5, eee, fff  ,       ggg, 20 ,     YYY

(The first line ends with three spaces after XXX, while the second line ends at the last Y.)

(第一行以 3 个空格结束XXX,而第二行以最后一个Y.结束。)

The following uses pandas.read_csv()to read the files, with the regular expression ' *, *'as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)

以下用于pandas.read_csv()读取文件,以正则表达式' *, *'为分隔符。(使用正则表达式作为分隔符仅在 .的“python”引擎中可用read_csv()。)

In [20]: import pandas as pd

In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')

In [22]: df
Out[22]: 
     0    1    2    3   4    5
0  1.5  aaa  bbb  ddd  10  XXX
1  2.5  eee  fff  ggg  20  YYY

回答by S. Herron

The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.

上面的“data['values'].str.strip()”答案对我不起作用,但我找到了一个简单的解决方法。我相信有更好的方法来做到这一点。str.strip() 函数适用于系列。因此,我将数据帧列转换为系列,去除空格,将转换后的列替换回数据帧。下面是示例代码。

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
print ('-----')
print (data)

data['values'].str.strip()
print ('-----')
print (data)

new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)

回答by Blake

Here is a column-wise solution with pandas apply:

这是一个适用于熊猫的列式解决方案:

import numpy as np

def strip_obj(col):
    if col.dtypes == object:
        return (col.astype(str)
                   .str.strip()
                   .replace({'nan': np.nan}))
    return col

df = df.apply(strip_obj, axis=0)

This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.

这会将对象类型列中的值转换为字符串。混合类型的列应该小心。例如,如果您的列是带有 20001 和 '21110' 的邮政编码,您将得到 '20001' 和 '21110'。

回答by Adam Owczarczyk

Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:

在寻找我可以使用的快速且简约的片段时偶然发现了这个问题。必须自己从上面的帖子中组装一个。也许有人会发现它很有用:

data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

回答by FunnyChef

I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.

我发现以下代码很有用,并且可能对其他人有帮助。根据您的用例,此代码段将允许您删除列中以及整个 DataFrame 中的空格。

import pandas as pd

def remove_whitespace(x):
    try:
        # remove spaces inside and outside of string
        x = "".join(x.split())

    except:
        pass
    return x

# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)


# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

回答by Michael Silverstein

We want to:

我们想:

  1. Apply our function to each element in our dataframe - use applymap.

  2. Use type(x)==str(versus x.dtype == 'object') because Pandas will label columns as objectfor columns of mixed datatypes (an objectcolumn may contain intand/or str).

  3. Maintain the datatype of each element (we don't want to convert everything to a strand then strip whitespace).
  1. 将我们的函数应用于数据帧中的每个元素 - 使用applymap

  2. 使用type(x)==str(与x.dtype == 'object'),因为 Pandas 会将列标记为object混合数据类型的object列(一列可能包含int和/或str)。

  3. 维护每个元素的数据类型(我们不想将所有内容都转换为 astr然后去除空格)。

Therefore, I've found the following to be the easiest:

因此,我发现以下是最简单的:

df.applymap(lambda x: x.strip() if type(x)==str else x)

df.applymap(lambda x: x.strip() if type(x)==str else x)

回答by Saul Frank

This worked for me - applies it to the whole dataframe:

这对我有用 - 将其应用于整个数据帧:

def panda_strip(x):
    r =[]
    for y in x:
        if isinstance(y, str):
            y = y.strip()

        r.append(y)
    return pd.Series(r)

df = df.apply(lambda x: panda_strip(x))