在 pandas/python 的数据框中合并两列文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19377969/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Combine two columns of text in dataframe in pandas/python
提问by user2866103
I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year
and quarter
. I'd like to create a variable called period
that makes Year = 2000
and quarter= q2
into 2000q2
.
我使用 Pandas 在 Python 中有一个 20 x 4000 的数据帧。其中两列名为Year
和quarter
。我想创建一个名为period
makeYear = 2000
和quarter= q2
into的变量2000q2
。
Can anyone help with that?
任何人都可以帮忙吗?
回答by silvado
if both columns are strings, you can concatenate them directly:
如果两列都是字符串,则可以直接连接它们:
df["period"] = df["Year"] + df["quarter"]
If one (or both) of the columns are not string typed, you should convert it (them) first,
如果其中一列(或两列)不是字符串类型的,您应该先转换它(它们),
df["period"] = df["Year"].astype(str) + df["quarter"]
Beware of NaNs when doing this!
这样做时要小心 NaN!
If you need to join multiple string columns, you can use agg
:
如果需要加入多个字符串列,可以使用agg
:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
Where "-" is the separator.
其中“-”是分隔符。
回答by Russ
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
Yields this dataframe
产生这个数据框
Year quarter period
0 2014 q1 2014q1
1 2015 q2 2015q2
This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']]
with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1)
.
此方法通过替换df[['Year', 'quarter']]
为数据帧的任何列切片,例如df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1)
.
You can check more information about apply() method here
您可以在此处查看有关 apply() 方法的更多信息
回答by Anton Protopopov
Although the @silvado answer is good if you change df.map(str)
to df.astype(str)
it will be faster:
虽然@silvado答案是好的,如果你改变df.map(str)
到df.astype(str)
它会更快:
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop
In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop
回答by LeoRochael
The method cat()
of the .str
accessorworks really well for this:
该方法cat()
的的.str
访问可以很好地表现这一点:
>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"],
... ["2015", "q3"]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 q1
1 2015 q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
Year Quarter Period
0 2014 q1 2014q1
1 2015 q3 2015q3
cat()
even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:
cat()
甚至允许您添加分隔符,例如,假设您只有年份和期间的整数,您可以这样做:
>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
... [2015, 3]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 1
1 2015 3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
Year Quarter Period
0 2014 1 2014q1
1 2015 3 2015q3
Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat()
invoked on the first column (Series):
加入多列只是传递一个系列列表或一个包含除第一列以外的所有列的数据框作为参数以str.cat()
在第一列(系列)上调用的问题:
>>> df = pd.DataFrame(
... [['USA', 'Nevada', 'Las Vegas'],
... ['Brazil', 'Pernambuco', 'Recife']],
... columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
Country State City AllTogether
0 USA Nevada Las Vegas USA - Nevada - Las Vegas
1 Brazil Pernambuco Recife Brazil - Pernambuco - Recife
Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.
请注意,如果您的 Pandas 数据帧/系列具有空值,则需要包含参数 na_rep 以将 NaN 值替换为字符串,否则组合列将默认为 NaN。
回答by Bill Gale
Use of a lamba function this time with string.format().
这次在 string.format() 中使用了 Lamba 函数。
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df
Quarter Year
0 q1 2014
1 q2 2015
Quarter Year YearQuarter
0 q1 2014 2014q1
1 q2 2015 2015q2
This allows you to work with non-strings and reformat values as needed.
这允许您根据需要使用非字符串并重新格式化值。
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df
Quarter int64
Year object
dtype: object
Quarter Year
0 1 2014
1 2 2015
Quarter Year YearQuarter
0 1 2014 2014q1
1 2 2015 2015q2
回答by MaxU
Small data-sets (< 150rows)
小数据集(< 150 行)
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
or slightly slower but more compact:
或稍慢但更紧凑:
df.Year.str.cat(df.quarter)
Larger data sets (> 150rows)
更大的数据集(> 150 行)
df['Year'].astype(str) + df['quarter']
UPDATE:Timing graph Pandas 0.23.4
更新:时序图 Pandas 0.23.4
Let's test it on 200K rows DF:
让我们在 200K 行 DF 上测试它:
In [250]: df
Out[250]:
Year quarter
0 2014 q1
1 2015 q2
In [251]: df = pd.concat([df] * 10**5)
In [252]: df.shape
Out[252]: (200000, 2)
UPDATE:new timings using Pandas 0.19.0
更新:使用 Pandas 0.19.0 的新计时
Timingwithout CPU/GPU optimization (sorted from fastest to slowest):
没有 CPU/GPU 优化的时序(从最快到最慢排序):
In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop
In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop
In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop
In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop
In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop
In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop
Timingusing CPU/GPU optimization:
使用 CPU/GPU 优化的时序:
In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop
In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop
In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop
In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop
Answer contribution by @anton-vbr
@anton-vbr 的回答贡献
回答by Pedro M Duarte
Here is an implementation that I find very versatile:
这是一个我发现非常通用的实现:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
...: [1, 'fox', 'jumps', 'over'],
...: [2, 'the', 'lazy', 'dog']],
...: columns=['c0', 'c1', 'c2', 'c3'])
In [3]: def str_join(df, sep, *cols):
...: from functools import reduce
...: return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep),
...: [df[col] for col in cols])
...:
In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')
In [5]: df
Out[5]:
c0 c1 c2 c3 cat
0 0 the quick brown 0-the-quick-brown
1 1 fox jumps over 1-fox-jumps-over
2 2 the lazy dog 2-the-lazy-dog
回答by BMW
def madd(x):
"""Performs element-wise string concatenation with multiple input arrays.
Args:
x: iterable of np.array.
Returns: np.array.
"""
for i, arr in enumerate(x):
if type(arr.item(0)) is not str:
x[i] = x[i].astype(str)
return reduce(np.core.defchararray.add, x)
For example:
例如:
data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])
df
Year quarter period
0 2000 q1 2000q1
1 2000 q2 2000q2
2 2000 q3 2000q3
3 2000 q4 2000q4
回答by VickyK
As your data are inserted to a dataframe, this command should solve your problem:
当您的数据插入到数据帧时,此命令应该可以解决您的问题:
df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)
回答by Ted Petrou
As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.
如前所述,您必须将每一列转换为字符串,然后使用加号运算符来组合两个字符串列。使用 NumPy 可以获得很大的性能提升。
%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)