Python 连接熊猫数据框中的所有列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42786804/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Concatenate all columns in a pandas dataframe
提问by Clock Slave
I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this
我有多个 Pandas 数据框,它们可能有不同的列数,这些列的数量通常从 50 到 100 不等。我需要创建一个简单的连接所有列的最后一列。基本上列第一行中的字符串应该是所有列第一行上字符串的总和(串联)。我写了下面的循环,但我觉得可能有更好更有效的方法来做到这一点。关于如何做到这一点的任何想法
num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
print('Concatenating', col_names[each_col_ind])
df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]
回答by jezrael
Solution with sum
, but output is float
, so convert to int
and str
is necessary:
解决方案sum
,但输出为float
,因此转换为int
和str
是必要的:
df['new'] = df.sum(axis=1).astype(int).astype(str)
Another solution with apply
function join
, but it the slowiest:
另一个具有apply
function 的解决方案join
,但它是最慢的:
df['new'] = df.apply(''.join, axis=1)
Last very fast numpy solution
- convert to numpy array
and then 'sum':
最后非常快numpy solution
- 转换为numpy array
然后'sum':
df['new'] = df.values.sum(axis=1)
Timings:
时间:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
cols = list('ABC')
#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop
In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop
In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop
In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop
EDIT If dtypes of some columns are not object
(obviously string
s) cast by DataFrame.astype
:
编辑如果某些列的 dtypes 不是object
(显然是string
s)由DataFrame.astype
:
df['new'] = df.astype(str).values.sum(axis=1)
回答by blacksite
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Gives us:
给我们:
df
Out[6]:
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
To select a given set of columns:
要选择一组给定的列:
df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')
df
Out[8]:
A B C concat
0 1 4 7 14
1 2 5 8 25
2 3 6 9 36
However, I've noticed that approach can sometimes result in NaN
s being populated where they shouldn't, so here's another way:
但是,我注意到这种方法有时会导致NaN
s 被填充到它们不应该被填充的地方,所以这是另一种方法:
>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Although it should be noted that this approach is a lot slower:
虽然需要注意的是,这种方法要慢很多:
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop
Versus
相对
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop
回答by bodily11
I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.
我没有足够的声誉来发表评论,所以我正在根据 blacksite 的回应来构建我的答案。
For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:
为清楚起见,LunchBox 评论说它在 Python 3.7.0 上失败了。我在 Python 3.6.3 上也失败了。这是 blacksite 的原始答案:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Here is my modification for Python 3.6.3:
这是我对 Python 3.6.3 的修改:
df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))
回答by nick_montpetit
The solutions given above that use numpy arrays have worked great for me.
上面给出的使用 numpy 数组的解决方案对我来说非常有用。
However, one thing to be careful about is the indexing when you get the numpy.ndarray
from df.values
, since the axis labels are removed from df.values
.
但是,需要注意的一件事是获取numpy.ndarray
from时的索引df.values
,因为轴标签已从df.values
.
So to take one of the solutions offered above (the one that I use most often) as an example:
因此,以上面提供的解决方案之一(我最常使用的解决方案)为例:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
This portion:
这部分:
df.fillna('').values
does not preserve the indices of the original DataFrame
. Not a problem when the DataFrame
has the common 0, 1, 2, ...
row indexing scheme, but this solution will not work when the DataFrame
is indexed in any other way. You can fix this by adding an index=
argument to pd.Series()
:
不保留原始 的索引DataFrame
。当DataFrame
具有公共0, 1, 2, ...
行索引方案时不是问题,但是当DataFrame
以任何其他方式索引时,此解决方案将不起作用。您可以通过添加一个index=
参数来解决这个问题pd.Series()
:
df['concat'] = pd.Series(df.fillna('').values.tolist(),
index=df.index).str.join('')
I always add the index=
argument just to be safe, even when I'm sure the DataFrame
is row-indexed as 0, 1, 2, ...
我总是添加index=
参数只是为了安全,即使我确定它DataFrame
是按行索引的0, 1, 2, ...