将 Pandas DataFrame 转换为 dict 和 dropna
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26033301/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
make pandas DataFrame to a dict and dropna
提问by der_die_das_jojo
I have some pandas DataFrame with NaNs in it. Like this:
我有一些带有 NaN 的 Pandas DataFrame。像这样:
import pandas as pd
import numpy as np
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
>>> data
A B
1 2 NaN
2 3 44
3 4 NaN
Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:
现在我想用它做一个字典,同时删除 NaN。结果应如下所示:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
But using pandas to_dict function gives me a result like this:
但是使用 pandas to_dict 函数给了我这样的结果:
>>> data.to_dict()
{'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}
So how to make a dict out of the DataFrame and get rid of the NaNs ?
那么如何从 DataFrame 中制作一个 dict 并摆脱 NaN 呢?
回答by Peter Mularien
There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.
有很多方法可以实现这一点,我花了一些时间评估一个不太大(70k)的数据帧的性能。虽然@der_die_das_jojo 的回答是有效的,但它也很慢。
The answer suggested by this questionactually turns out to be about 5x faster on a large dataframe.
所建议的回答这个问题其实原来是约5倍上的大数据帧更快。
On my test dataframe (df):
在我的测试数据框 ( df) 上:
Above method:
以上方法:
%time [ v.dropna().to_dict() for k,v in df.iterrows() ]
CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s
Wall time: 50.9 s
Another slow method:
另一种缓慢的方法:
%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows')
CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s
Wall time: 1min 8s
Fastest method I could find:
我能找到的最快方法:
%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s
Wall time: 14.7 s
The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.
此输出的格式是面向行的字典,如果您想要问题中的面向列的形式,则可能需要进行调整。
Very interested if anyone finds an even faster answer to this question.
如果有人找到这个问题的更快答案,非常感兴趣。
回答by der_die_das_jojo
write a function insired by to_dict from pandas
写一个函数 insired to_dict from pandas
import pandas as pd
import numpy as np
from pandas import compat
def to_dict_dropna(self,data):
return dict((k, v.dropna().to_dict()) for k, v in compat.iteritems(data))
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
dict=to_dict_dropna(data)
and as a result you get what you want:
结果你得到了你想要的:
>>> dict
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
回答by jezrael
First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.
第一个图为每列生成字典,因此输出很少是很长的字典,字典的数量取决于列数。
I test multiple methods with perfplotand fastest method is loop by each column and remove missing values or Nones by Series.dropnaor with Series.notnain boolean indexingin larger DataFrames.
我测试用多种方法perfplot和最快的方法是通过循环的每一列和删除缺失值或None由sSeries.dropna或与Series.notna在boolean indexing较大DataFrames。
Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaNtrick and also testing Nones.
较小的 DataFrames 是最快的字典理解,通过NaN != NaN技巧测试缺失值并测试Nones。
np.random.seed(2020)
import perfplot
def comp_notnull(df1):
return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}
def comp_NaNnotNaN_None(df1):
return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}
def comp_dropna(df1):
return {k: v.dropna().to_dict() for k,v in df1.items()}
def comp_bool_indexing(df1):
return {k: v[v.notna()].to_dict() for k,v in df1.items()}
def make_df(n):
df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
return df1
perfplot.show(
setup=make_df,
kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],
n_range=[10**k for k in range(1, 7)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:
另一种情况是,如果每行生成字典 - 获取大量小字典的列表,那么最快的是过滤 NaN 和 None 的列表理解:
np.random.seed(2020)
import perfplot
def comp_notnull1(df1):
return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]
def comp_NaNnotNaN_None1(df1):
return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]
def comp_dropna1(df1):
return [v.dropna().to_dict() for k,v in df1.T.items()]
def comp_bool_indexing1(df1):
return [v[v.notna()].to_dict() for k,v in df1.T.items()]
def make_df(n):
df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
return df1
perfplot.show(
setup=make_df,
kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],
n_range=[10**k for k in range(1, 7)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
回答by kederrac
you can have your own mapping class where you can get rid of the NANs:
您可以拥有自己的映射类,您可以在其中摆脱 NAN:
class NotNanDict(dict):
@staticmethod
def is_nan(v):
if isinstance(v, dict):
return False
return np.isnan(v)
def __new__(self, a):
return {k: v for k, v in a if not self.is_nan(v)}
data.to_dict(into=NotNanDict)
Output:
输出:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
Timing (from @jezrael answer):
时间(来自@jezrael 的回答):
to boost the speed you can use numba:
要提高速度,您可以使用numba:
from numba import jit
@jit
def dropna(arr):
return [(i + 1, n) for i, n in enumerate(arr) if not np.isnan(n)]
class NotNanDict(dict):
def __new__(self, a):
return {k: dict(dropna(v.to_numpy())) for k, v in a}
data.to_dict(orient='s', into=NotNanDict)
output:
输出:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
Timing (from @jezrael answer):
时间(来自@jezrael 的回答):
回答by McLovvin
You can use a dict comprehension and loop over the columns
您可以使用字典理解并遍历列
{col:df[col].dropna().to_dict() for col in df}
回答by Shibiraj
Try the code below,
试试下面的代码,
import numpy as np
import pandas as pd
raw_data = {'A': {1: 2, 2: 3, 3: 4}, 'B': {1: np.nan, 2: 44, 3: np.nan}}
data = pd.DataFrame(raw_data)
{col: data[col].dropna().to_dict() for col in data}
Output
输出
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
回答by villoro
There are a lot of ways of solving that. Depending of the number of rows the fastest methods will change. Since performance is relevant I understand that the number of rows is big.
有很多方法可以解决这个问题。根据行数,最快的方法会发生变化。由于性能是相关的,我知道行数很大。
import pandas as pd
import numpy as np
# Create a dataframe with random data
df = pd.DataFrame(np.random.randint(10, size=[1_000_000, 2]), columns=["A", "B"])
# Add some NaNs
df.loc[df["A"]==1, "B"] = np.nan
The fastest soluton I got is by simply using the dropnamethod and a dict comprehension:
我得到的最快的解决方案是简单地使用dropna方法和字典理解:
%time {col: df[col].dropna().to_dict() for col in df.columns}
CPU times: user 528 ms, sys: 87.2 ms, total: 615 ms
Wall time: 615 ms
Which is 10 times fastercompared to one of the proposed solutions:
这是快10倍相比,所提出的解决方案之一:
Now if we test it with one of the proposed solutions we get:
现在,如果我们使用建议的解决方案之一对其进行测试,我们会得到:
%time [{k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 5.49 s, sys: 205 ms, total: 5.7 s
Wall time: 5.69 s
It is also 2 times fasterthan other options like:
它也比其他选项快 2 倍,例如:
%time {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df.to_dict().items()}
CPU times: user 900 ms, sys: 133 ms, total: 1.03 s
Wall time: 1.03 s
The idea is to always try to use pandasor numpybuiltin functions since they are faster than regular python.
这个想法是总是尝试使用pandas或numpy内置函数,因为它们比常规 python 更快。
回答by chrisckwong821
improving on the answer of https://stackoverflow.com/a/46098323
改进https://stackoverflow.com/a/46098323的答案
With a ~300K dataframe with 2 entire nan columns, his answer results:
使用具有 2 个完整 nan 列的 ~300K 数据框,他的回答结果是:
%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='records')]
CPU times: user 8.63 s, sys: 137 ms, total: 8.77 s
Wall time: 8.79 s
%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='records')]
CPU times: user 8.63 s, sys: 137 ms, total: 8.77 s
Wall time: 8.79 s
With a tiny twist:
有一个小小的转折:
%time [ {k:v for k,v in m.items()} for m in df.dropna(axis=1).to_dict(orient='records')]
CPU times: user 4.37 s, sys: 109 ms, total: 4.48 s
Wall time: 4.49 s
%time [ {k:v for k,v in m.items()} for m in df.dropna(axis=1).to_dict(orient='records')]
CPU times: user 4.37 s, sys: 109 ms, total: 4.48 s
Wall time: 4.49 s
The idea is to always drop nan first, so to avoid unnecessary iteration on nan value. On the first answer nan is converted into dict first before being dropped, which can be optimized.
这个想法是总是先删除 nan,以避免对 nan 值进行不必要的迭代。在第一个答案 nan 被删除之前首先转换为 dict ,可以优化。
回答by John Haberstroh
I wrote a function to solve this problem without reimplementing to_dict, and without calling it more than once. The approach is to recursively trim out the "leaves" with nan/None value.
我编写了一个函数来解决这个问题,而无需重新实现 to_dict,也没有多次调用它。该方法是递归修剪具有 nan/None 值的“叶子”。
def trim_nan_leaf(tree):
"""For a tree of dict-like and list-like containers, prune None and NaN leaves.
Particularly applicable for json-like dictionary objects
"""
# d may be a dictionary, iterable, or other (element)
# * Do not recursively iterate if string
# * element is the base case
# * Only remove nan and None leaves
def valid_leaf(leaf):
if leaf is None:
return(False)
if isinstance(leaf, numbers.Number):
if (not math.isnan(leaf)):
return(leaf != -9223372036854775808)
return(False)
return(True)
# Attempt dictionary
try:
return({k: trim_nan_leaf(tree[k]) for k in tree.keys() if valid_leaf(tree[k])})
except AttributeError:
# Execute base case on string for simplicity...
if isinstance(tree, str):
return(tree)
# Attempt iterator
try:
# Avoid infinite recursion for self-referential objects (like one-length strings!)
if tree[0] == tree:
return(tree)
return([trim_nan_leaf(leaf) for leaf in tree if valid_leaf(leaf)])
# TypeError occurs when either [] or iterator are availble
except TypeError:
# Base Case
return(tree)


