Python 连接熊猫数据框中的所有列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42786804/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:09:12  来源:igfitidea点击:

Concatenate all columns in a pandas dataframe

pythonpandas

提问by Clock Slave

I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this

我有多个 Pandas 数据框,它们可能有不同的列数,这些列的数量通常从 50 到 100 不等。我需要创建一个简单的连接所有列的最后一列。基本上列第一行中的字符串应该是所有列第一行上字符串的总和(串联)。我写了下面的循环,但我觉得可能有更好更有效的方法来做到这一点。关于如何做到这一点的任何想法

num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
    print('Concatenating', col_names[each_col_ind])
    df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]

回答by jezrael

Solution with sum, but output is float, so convert to intand stris necessary:

解决方案sum,但输出为float,因此转换为intstr是必要的:

df['new'] = df.sum(axis=1).astype(int).astype(str)

Another solution with applyfunction join, but it the slowiest:

另一个具有applyfunction 的解决方案join,但它是最慢的:

df['new'] = df.apply(''.join, axis=1)

Last very fast numpy solution- convert to numpy arrayand then 'sum':

最后非常快numpy solution- 转换为numpy array然后'sum'

df['new'] = df.values.sum(axis=1)

Timings:

时间

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)

cols = list('ABC')

#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop

In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop

In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop

In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop

EDIT If dtypes of some columns are not object(obviously strings) cast by DataFrame.astype:

编辑如果某些列的 dtypes 不是object(显然是strings)由DataFrame.astype

df['new'] = df.astype(str).values.sum(axis=1)

回答by blacksite

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

Gives us:

给我们:

df
Out[6]: 
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

To select a given set of columns:

要选择一组给定的列:

df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')

df
Out[8]: 
   A  B  C concat
0  1  4  7     14
1  2  5  8     25
2  3  6  9     36

However, I've noticed that approach can sometimes result in NaNs being populated where they shouldn't, so here's another way:

但是,我注意到这种方法有时会导致NaNs 被填充到它们不应该被填充的地方,所以这是另一种方法:

>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

Although it should be noted that this approach is a lot slower:

虽然需要注意的是,这种方法要慢很多:

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop

Versus

相对

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop

回答by bodily11

I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.

我没有足够的声誉来发表评论,所以我正在根据 blacksite 的回应来构建我的答案。

For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:

为清楚起见,LunchBox 评论说它在 Python 3.7.0 上失败了。我在 Python 3.6.3 上也失败了。这是 blacksite 的原始答案:

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

Here is my modification for Python 3.6.3:

这是我对 Python 3.6.3 的修改:

df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))

回答by nick_montpetit

The solutions given above that use numpy arrays have worked great for me.

上面给出的使用 numpy 数组的解决方案对我来说非常有用。

However, one thing to be careful about is the indexing when you get the numpy.ndarrayfrom df.values, since the axis labels are removed from df.values.

但是,需要注意的一件事是获取numpy.ndarrayfrom时的索引df.values,因为轴标签已从df.values.

So to take one of the solutions offered above (the one that I use most often) as an example:

因此,以上面提供的解决方案之一(我最常使用的解决方案)为例:

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

This portion:

这部分:

df.fillna('').values

does not preserve the indices of the original DataFrame. Not a problem when the DataFramehas the common 0, 1, 2, ...row indexing scheme, but this solution will not work when the DataFrameis indexed in any other way. You can fix this by adding an index=argument to pd.Series():

不保留原始 的索引DataFrame。当DataFrame具有公共0, 1, 2, ...行索引方案时不是问题,但是当DataFrame以任何其他方式索引时,此解决方案将不起作用。您可以通过添加一个index=参数来解决这个问题pd.Series()

df['concat'] = pd.Series(df.fillna('').values.tolist(), 
                         index=df.index).str.join('')

I always add the index=argument just to be safe, even when I'm sure the DataFrameis row-indexed as 0, 1, 2, ...

我总是添加index=参数只是为了安全,即使我确定它DataFrame是按行索引的0, 1, 2, ...