将 2D numpy.ndarray 转换为 pandas.DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24336518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert 2D numpy.ndarray to pandas.DataFrame
提问by y2p
I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
我有一个相当大的numpy.ndarray。它基本上是一个数组数组。我想将其转换为pandas.DataFrame. 我想做的是在下面的代码中
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1and cache2are pandas.DataFrame. Each has ~100krows.
我将外部数组和内部数组的索引映射到两个DataFrames 的索引以获取某些 ID。
cache1并且cache2是pandas.DataFrame。每个都有~100k行。
This takes really really long, like a few hours to complete. Is there some way I can speed it up?
这真的需要很长时间,比如几个小时才能完成。有什么办法可以加快速度吗?
回答by CT Zhu
I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where nis the length of cache1.id1and mis the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837}instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:
我怀疑您的ndarr,如果表示为 2d np.array,总是具有 的形状n,m,其中n是 的长度cache1.id1和m是 的长度cache2.id2。cache2 中的最后一个条目应该是{'id2': 38472837}而不是{'id': 38472837}. 如果是这样,以下简单的解决方案可能就是所需要的:
In [30]:
df=pd.DataFrame(np.array(ndarr).ravel(),
index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
columns=['val'])
In [33]:
print df.reset_index()
idx1 idx2 val
0 ABC1234 3276827 4.3
1 ABC1234 98567498 5.6
2 ABC1234 38472837 6.7
3 NCMN7838 3276827 3.2
4 NCMN7838 98567498 4.5
5 NCMN7838 38472837 2.1
[6 rows x 3 columns]
Actually, I also think, that keep it having the MultiIndexmay be a better idea.
实际上,我也认为,保留它MultiIndex可能是一个更好的主意。
回答by DSM
Something like this should work:
这样的事情应该工作:
ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values
which gives
这使
>>> fast_df
value id1 id2
0 4.3 ABC1234 3276827
1 5.6 ABC1234 98567498
2 6.7 ABC1234 NaN
3 3.2 NCMN7838 3276827
4 4.5 NCMN7838 98567498
5 2.1 NCMN7838 NaN
And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].
然后如果你真的想删除零值,你可以只保留非零值使用fast_df = fast_df[fast_df['value'] != 0].

