将 Pandas DataFrame 转换为 dict 和 dropna

Question

提问by der_die_das_jojo

I have some pandas DataFrame with NaNs in it. Like this:

我有一些带有 NaN 的 Pandas DataFrame。像这样：

import pandas as pd
import numpy as np
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
>>> data
   A   B
1  2 NaN
2  3  44
3  4 NaN

Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:

现在我想用它做一个字典，同时删除 NaN。结果应如下所示：

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

But using pandas to_dict function gives me a result like this:

但是使用 pandas to_dict 函数给了我这样的结果：

>>> data.to_dict()
{'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}

So how to make a dict out of the DataFrame and get rid of the NaNs ?

那么如何从 DataFrame 中制作一个 dict 并摆脱 NaN 呢？

Answer 1

回答by Peter Mularien

There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.

有很多方法可以实现这一点，我花了一些时间评估一个不太大（70k）的数据帧的性能。虽然@der_die_das_jojo 的回答是有效的，但它也很慢。

The answer suggested by this questionactually turns out to be about 5x faster on a large dataframe.

所建议的回答这个问题其实原来是约5倍上的大数据帧更快。

On my test dataframe (df):

在我的测试数据框 ( df) 上：

Above method:

以上方法：

%time [ v.dropna().to_dict() for k,v in df.iterrows() ]
CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s
Wall time: 50.9 s

Another slow method:

另一种缓慢的方法：

%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows')
CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s
Wall time: 1min 8s

Fastest method I could find:

我能找到的最快方法：

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s
Wall time: 14.7 s

The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.

此输出的格式是面向行的字典，如果您想要问题中的面向列的形式，则可能需要进行调整。

Very interested if anyone finds an even faster answer to this question.

如果有人找到这个问题的更快答案，非常感兴趣。

Answer 2

回答by der_die_das_jojo

write a function insired by to_dict from pandas

写一个函数 insired to_dict from pandas

import pandas as pd
import numpy as np
from pandas import compat 

def to_dict_dropna(self,data):
  return dict((k, v.dropna().to_dict()) for k, v in compat.iteritems(data))

raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)

dict=to_dict_dropna(data)

and as a result you get what you want:

结果你得到了你想要的：

>>> dict
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Answer 3

回答by jezrael

First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.

第一个图为每列生成字典，因此输出很少是很长的字典，字典的数量取决于列数。

I test multiple methods with perfplotand fastest method is loop by each column and remove missing values or Nones by Series.dropnaor with Series.notnain boolean indexingin larger DataFrames.

我测试用多种方法perfplot和最快的方法是通过循环的每一列和删除缺失值或None由sSeries.dropna或与Series.notna在boolean indexing较大DataFrames。

Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaNtrick and also testing Nones.

较小的 DataFrames 是最快的字典理解，通过NaN != NaN技巧测试缺失值并测试Nones。

np.random.seed(2020)
import perfplot

def comp_notnull(df1):
    return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}

def comp_NaNnotNaN_None(df1):
    return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}

def comp_dropna(df1):
    return {k: v.dropna().to_dict() for k,v in df1.items()}

def comp_bool_indexing(df1):
    return {k: v[v.notna()].to_dict() for k,v in df1.items()}

def make_df(n):
    df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
    return df1

perfplot.show(
    setup=make_df,
    kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],
    n_range=[10**k for k in range(1, 7)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:

另一种情况是，如果每行生成字典 - 获取大量小字典的列表，那么最快的是过滤 NaN 和 None 的列表理解：

np.random.seed(2020)
import perfplot


def comp_notnull1(df1):
    return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]

def comp_NaNnotNaN_None1(df1):
    return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]

def comp_dropna1(df1):
    return [v.dropna().to_dict() for k,v in df1.T.items()]

def comp_bool_indexing1(df1):
    return [v[v.notna()].to_dict() for k,v in df1.T.items()]


def make_df(n):
    df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
    return df1

perfplot.show(
    setup=make_df,
    kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],
    n_range=[10**k for k in range(1, 7)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Answer 4

回答by kederrac

you can have your own mapping class where you can get rid of the NANs:

您可以拥有自己的映射类，您可以在其中摆脱 NAN：

class NotNanDict(dict):

    @staticmethod
    def is_nan(v):
        if isinstance(v, dict):
            return False
        return np.isnan(v)

    def __new__(self, a):
        return {k: v for k, v in a if not self.is_nan(v)} 

data.to_dict(into=NotNanDict)

Output:

输出：

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Timing (from @jezrael answer):

时间（来自@jezrael 的回答）：

to boost the speed you can use numba:

要提高速度，您可以使用numba：

from numba import jit

@jit
def dropna(arr):
    return [(i + 1, n) for i, n in enumerate(arr) if not np.isnan(n)]


class NotNanDict(dict):

    def __new__(self, a):
        return {k: dict(dropna(v.to_numpy())) for k, v in a}

data.to_dict(orient='s', into=NotNanDict)

output:

输出：

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Timing (from @jezrael answer):

时间（来自@jezrael 的回答）：

Answer 5

回答by McLovvin

You can use a dict comprehension and loop over the columns

您可以使用字典理解并遍历列

{col:df[col].dropna().to_dict() for col in df}

Answer 6

回答by Shibiraj

Try the code below,

试试下面的代码，

import numpy as np
import pandas as pd
raw_data = {'A': {1: 2, 2: 3, 3: 4}, 'B': {1: np.nan, 2: 44, 3: np.nan}}
data = pd.DataFrame(raw_data)
{col: data[col].dropna().to_dict() for col in data}

Output

输出

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Answer 7

回答by villoro

There are a lot of ways of solving that. Depending of the number of rows the fastest methods will change. Since performance is relevant I understand that the number of rows is big.

有很多方法可以解决这个问题。根据行数，最快的方法会发生变化。由于性能是相关的，我知道行数很大。

import pandas as pd
import numpy as np

# Create a dataframe with random data
df = pd.DataFrame(np.random.randint(10, size=[1_000_000, 2]), columns=["A", "B"])

# Add some NaNs
df.loc[df["A"]==1, "B"] = np.nan

The fastest soluton I got is by simply using the dropnamethod and a dict comprehension:

我得到的最快的解决方案是简单地使用dropna方法和字典理解：

%time {col: df[col].dropna().to_dict() for col in df.columns}

CPU times: user 528 ms, sys: 87.2 ms, total: 615 ms
Wall time: 615 ms

Which is 10 times fastercompared to one of the proposed solutions:

这是快10倍相比，所提出的解决方案之一：

Now if we test it with one of the proposed solutions we get:

现在，如果我们使用建议的解决方案之一对其进行测试，我们会得到：

%time [{k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]

CPU times: user 5.49 s, sys: 205 ms, total: 5.7 s
Wall time: 5.69 s

It is also 2 times fasterthan other options like:

它也比其他选项快 2 倍，例如：

%time {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df.to_dict().items()}

CPU times: user 900 ms, sys: 133 ms, total: 1.03 s
Wall time: 1.03 s

The idea is to always try to use pandasor numpybuiltin functions since they are faster than regular python.

这个想法是总是尝试使用pandas或numpy内置函数，因为它们比常规 python 更快。

Answer 8

回答by chrisckwong821

improving on the answer of https://stackoverflow.com/a/46098323

改进https://stackoverflow.com/a/46098323的答案

With a ~300K dataframe with 2 entire nan columns, his answer results:

使用具有 2 个完整 nan 列的 ~300K 数据框，他的回答结果是：

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='records')] CPU times: user 8.63 s, sys: 137 ms, total: 8.77 s Wall time: 8.79 s

With a tiny twist:

有一个小小的转折：

%time [ {k:v for k,v in m.items()} for m in df.dropna(axis=1).to_dict(orient='records')] CPU times: user 4.37 s, sys: 109 ms, total: 4.48 s Wall time: 4.49 s

The idea is to always drop nan first, so to avoid unnecessary iteration on nan value. On the first answer nan is converted into dict first before being dropped, which can be optimized.

这个想法是总是先删除 nan，以避免对 nan 值进行不必要的迭代。在第一个答案 nan 被删除之前首先转换为 dict ，可以优化。

Answer 9

回答by John Haberstroh

I wrote a function to solve this problem without reimplementing to_dict, and without calling it more than once. The approach is to recursively trim out the "leaves" with nan/None value.

我编写了一个函数来解决这个问题，而无需重新实现 to_dict，也没有多次调用它。该方法是递归修剪具有 nan/None 值的“叶子”。

def trim_nan_leaf(tree):
    """For a tree of dict-like and list-like containers, prune None and NaN leaves.

    Particularly applicable for json-like dictionary objects
    """
    # d may be a dictionary, iterable, or other (element)
    # * Do not recursively iterate if string
    # * element is the base case
    # * Only remove nan and None leaves

    def valid_leaf(leaf):
        if leaf is None:
            return(False)
        if isinstance(leaf, numbers.Number):
            if (not math.isnan(leaf)):
                return(leaf != -9223372036854775808)
            return(False)
        return(True)

    # Attempt dictionary
    try:
        return({k: trim_nan_leaf(tree[k]) for k in tree.keys() if valid_leaf(tree[k])})
    except AttributeError:
        # Execute base case on string for simplicity...
        if isinstance(tree, str):
            return(tree)
        # Attempt iterator
        try:
            # Avoid infinite recursion for self-referential objects (like one-length strings!)
            if tree[0] == tree:
                return(tree)
            return([trim_nan_leaf(leaf) for leaf in tree if valid_leaf(leaf)])
        # TypeError occurs when either [] or iterator are availble
        except TypeError:
            # Base Case
            return(tree)

将 Pandas DataFrame 转换为 dict 和 dropna

提问by der_die_das_jojo

回答by Peter Mularien

回答by der_die_das_jojo

回答by jezrael

回答by kederrac

回答by McLovvin

回答by Shibiraj

回答by villoro

回答by chrisckwong821

回答by John Haberstroh

相关推荐

最近更新

标签

将 Pandas DataFrame 转换为 dict 和 dropna

提问by der_die_das_jojo

回答by Peter Mularien

回答by der_die_das_jojo

回答by jezrael

回答by kederrac

回答by McLovvin

回答by Shibiraj

回答by villoro

回答by chrisckwong821

回答by John Haberstroh

相关推荐

pandas 带有熊猫数据框的矢量化半正弦公式

如何在 IPython 笔记本的 Pandas DataFrame 列中左对齐文本

绘制表格并显示 Pandas Dataframe

pandas 和 numpy 线程安全

相关推荐

最近更新

标签