在 pandas/python 的数据框中合并两列文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19377969/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:36:31  来源:igfitidea点击:

Combine two columns of text in dataframe in pandas/python

pythonpandasnumpydataframe

提问by user2866103

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Yearand quarter. I'd like to create a variable called periodthat makes Year = 2000and quarter= q2into 2000q2.

我使用 Pandas 在 Python 中有一个 20 x 4000 的数据帧。其中两列名为Yearquarter。我想创建一个名为periodmakeYear = 2000quarter= q2into的变量2000q2

Can anyone help with that?

任何人都可以帮忙吗?

回答by silvado

if both columns are strings, you can concatenate them directly:

如果两列都是字符串,则可以直接连接它们:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

如果其中一列(或两列)不是字符串类型的,您应该先转换它(它们),

df["period"] = df["Year"].astype(str) + df["quarter"]

Beware of NaNs when doing this!

这样做时要小心 NaN!



If you need to join multiple string columns, you can use agg:

如果需要加入多个字符串列,可以使用agg

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where "-" is the separator.

其中“-”是分隔符。

回答by Russ

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

产生这个数据框

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']]with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

此方法通过替换df[['Year', 'quarter']]为数据帧的任何列切片,例如df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here

您可以在此处查看有关 apply() 方法的更多信息

回答by Anton Protopopov

Although the @silvado answer is good if you change df.map(str)to df.astype(str)it will be faster:

虽然@silvado答案是好的,如果你改变df.map(str)df.astype(str)它会更快:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

回答by LeoRochael

The method cat()of the .straccessorworks really well for this:

该方法cat()的的.str访问可以很好地表现这一点:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat()even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:

cat()甚至允许您添加分隔符,例如,假设您只有年份和期间的整数,您可以这样做:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat()invoked on the first column (Series):

加入多列只是传递一个系列列表或一个包含除第一列以外的所有列的数据框作为参数以str.cat()在第一列(系列)上调用的问题:

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.

请注意,如果您的 Pandas 数据帧/系列具有空值,则需要包含参数 na_rep 以将 NaN 值替换为字符串,否则组合列将默认为 NaN。

回答by Bill Gale

Use of a lamba function this time with string.format().

这次在 string.format() 中使用了 Lamba 函数。

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

This allows you to work with non-strings and reformat values as needed.

这允许您根据需要使用非字符串并重新格式化值。

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

回答by MaxU

Small data-sets (< 150rows)

小数据集(< 150 行)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

或稍慢但更紧凑:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

更大的数据集(> 150 行)

df['Year'].astype(str) + df['quarter']


UPDATE:Timing graph Pandas 0.23.4

更新:时序图 Pandas 0.23.4

enter image description here

在此处输入图片说明

Let's test it on 200K rows DF:

让我们在 200K 行 DF 上测试它:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

UPDATE:new timings using Pandas 0.19.0

更新:使用 Pandas 0.19.0 的新计时

Timingwithout CPU/GPU optimization (sorted from fastest to slowest):

没有 CPU/GPU 优化的时序(从最快到最慢排序):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

Timingusing CPU/GPU optimization:

使用 CPU/GPU 优化的时序

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr

@anton-vbr 的回答贡献

回答by Pedro M Duarte

Here is an implementation that I find very versatile:

这是一个我发现非常通用的实现:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

回答by BMW

def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

For example:

例如:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4

回答by VickyK

As your data are inserted to a dataframe, this command should solve your problem:

当您的数据插入到数据帧时,此命令应该可以解决您的问题:

df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

回答by Ted Petrou

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.

如前所述,您必须将每一列转换为字符串,然后使用加号运算符来组合两个字符串列。使用 NumPy 可以获得很大的性能提升。

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)