pandas 分层数据：有效地为每个节点构建一个每个后代的列表

Question

提问by taky2

I have a two column data set depicting multiple child-parent relationships that form a large tree. I would like to use this to build an updated list of every descendant for each node.

我有一个两列数据集，描述了形成一棵大树的多个子父关系。我想用它来为每个节点构建每个后代的更新列表。

Original Input:

原始输入：

   child  parent
1   2010    1000
7   2100    1000
5   2110    1000
3   3000    2110
2   3011    2010
4   3033    2100
0   3102    2010
6   3111    2110

Graphical depiction of relationships:

关系的图形描述：

Expected output:

预期输出：

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110

Originally I decided to use a recursive solution with DataFrames. It works as intended, but Pandas is awfully inefficient. My research has led me to believe that an implementation using NumPy arrays (or other simple data structures) would be much faster on large data sets (of 10's of thousands of records).

最初我决定对 DataFrame 使用递归解决方案。它按预期工作，但 Pandas 效率极低。我的研究使我相信使用 NumPy 数组（或其他简单数据结构）的实现在大型数据集（数十万条记录）上会快得多。

Solution using data frames:

使用数据框的解决方案：

import pandas as pd

df = pd.DataFrame(
    {
        'child':     [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent':    [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    },  columns=['child', 'parent']
)


def get_ancestry_dataframe_flat(df):

    def get_child_list(parent_id):

        list_of_children = list()
        list_of_children.append(df[df['parent'] == parent_id]['child'].values)

        for i, r in df[df['parent'] == parent_id].iterrows():
            if r['child'] != parent_id:
                list_of_children.append(get_child_list(r['child']))

        # flatten list
        list_of_children = [item for sublist in list_of_children for item in sublist]
        return list_of_children

    new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
    for index, row in df.iterrows():
        temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
        temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
        temp_df['ancestor'] = row['parent']
        new_df = new_df.append(temp_df)

    new_df = new_df\
        .drop_duplicates()\
        .sort_values(['ancestor', 'descendant'])\
        .reset_index(drop=True)

    return new_df

Because using pandas DataFrames in this way is very inefficient on large data sets, I need to improve the performance of this operation.My understanding is that this can be done by using more efficient data structures better suited for looping and recursion. I want to perform this same operation in the most efficient way possible.

因为以这种方式使用pandas DataFrames在大数据集上效率很低，我需要提高这个操作的性能。我的理解是，这可以通过使用更适合循环和递归的更高效的数据结构来完成。我想以最有效的方式执行相同的操作。

Specifically, I'm asking for optimization of speed.

具体来说，我要求优化速度。

Answer 1

采纳答案by EFT

This is a method using numpy to iterate down the tree a generation at a time.

这是一种使用 numpy 一次迭代树的方法。

Code:

代码：

import numpy as np
import pandas as pd  # only used to return a dataframe


def list_ancestors(edges):
    """
    Take edge list of a rooted tree as a numpy array with shape (E, 2),
    child nodes in edges[:, 0], parent nodes in edges[:, 1]
    Return pandas dataframe of all descendant/ancestor node pairs

    Ex:
        df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
                           'parent': [100, 100, 200, 200, 201, 300]})

        df
           child  parent
        0    200     100
        1    201     100
        2    300     200
        3    301     200
        4    302     201
        5    400     300

        list_ancestors(df.values)

        returns

            descendant  ancestor
        0          200       100
        1          201       100
        2          300       200
        3          300       100
        4          301       200
        5          301       100
        6          302       201
        7          302       100
        8          400       300
        9          400       200
        10         400       100
    """
    ancestors = []
    for ar in trace_nodes(edges):
        ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1),
                               ar[:, 1:].flatten()])
    return pd.DataFrame(np.concatenate(ancestors),
                        columns=['descendant', 'ancestor'])


def trace_nodes(edges):
    """
    Take edge list of a rooted tree as a numpy array with shape (E, 2),
    child nodes in edges[:, 0], parent nodes in edges[:, 1]
    Yield numpy array with cross-section of tree and associated
    ancestor nodes

    Ex:
        df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
                           'parent': [100, 100, 200, 200, 201, 300]})

        df
           child  parent
        0    200     100
        1    201     100
        2    300     200
        3    301     200
        4    302     201
        5    400     300

        trace_nodes(df.values)

        yields

        array([[200, 100],
               [201, 100]])

        array([[300, 200, 100],
               [301, 200, 100],
               [302, 201, 100]])

        array([[400, 300, 200, 100]])
    """
    mask = np.in1d(edges[:, 1], edges[:, 0])
    gen_branches = edges[~mask]
    edges = edges[mask]
    yield gen_branches
    while edges.size != 0:
        mask = np.in1d(edges[:, 1], edges[:, 0])
        next_gen = edges[~mask]
        gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches)
        edges = edges[mask]
        yield gen_branches


def numpy_col_inner_many_to_one_join(ar1, ar2):
    """
    Take two 2-d numpy arrays ar1 and ar2,
    with no duplicate values in first column of ar2
    Return inner join of ar1 and ar2 on
    last column of ar1, first column of ar2

    Ex:

        ar1 = np.array([[1,  2,  3],
                        [4,  5,  3],
                        [6,  7,  8],
                        [9, 10, 11]])

        ar2 = np.array([[ 1,  2],
                        [ 3,  4],
                        [ 5,  6],
                        [ 7,  8],
                        [ 9, 10],
                        [11, 12]])

        numpy_col_inner_many_to_one_join(ar1, ar2)

        returns

        array([[ 1,  2,  3,  4],
               [ 4,  5,  3,  4],
               [ 9, 10, 11, 12]])
    """
    ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])]
    ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])]
    if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0:
        bins = np.bincount(ar1[:, -1])
        counts = bins[bins.nonzero()[0]]
    else:
        counts = np.unique(ar1[:, -1], False, False, True)[1]
    left = ar1[ar1[:, -1].argsort()]
    right = ar2[ar2[:, 0].argsort()]
    return np.concatenate([left[:, :-1],
                           right[np.repeat(np.arange(right.shape[0]),
                                           counts)]], 1)

Timing Comparison:

时序比较：

Test cases 1 & 2 provided by @taky2, test cases 3 & 4 comparing performance on tall and wide tree structures respectively – most use cases are likely somewhere in the middle.

@taky2 提供的测试用例 1 和 2，测试用例 3 和 4 分别比较了高树结构和宽树结构的性能——大多数用例可能在中间的某个地方。

df = pd.DataFrame(
    {
        'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    }
)

df2 = pd.DataFrame(
    {
        'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001,
                  3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311],
        'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102,
                   2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033]
    }
)

df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)],
                         np.c_[np.arange(501, 1001), np.arange(500)]],
                   columns=['child', 'parent'])

df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)],
                         np.c_[np.arange(1001, 11001),
                               np.repeat(np.arange(1, 101), 100)]],
                   columns=['child', 'parent'])

%timeit get_ancestry_dataframe_flat(df)
10 loops, best of 3: 53.4 ms per loop

%timeit add_children_of_children(df)
1000 loops, best of 3: 1.13 ms per loop

%timeit all_descendants_nx(df)
1000 loops, best of 3: 675 μs per loop

%timeit list_ancestors(df.values)
1000 loops, best of 3: 391 μs per loop

%timeit get_ancestry_dataframe_flat(df2)
10 loops, best of 3: 168 ms per loop

%timeit add_children_of_children(df2)
1000 loops, best of 3: 1.8 ms per loop

%timeit all_descendants_nx(df2)
1000 loops, best of 3: 1.06 ms per loop

%timeit list_ancestors(df2.values)
1000 loops, best of 3: 933 μs per loop

%timeit add_children_of_children(df3)
10 loops, best of 3: 156 ms per loop

%timeit all_descendants_nx(df3)
1 loop, best of 3: 952 ms per loop

%timeit list_ancestors(df3.values)
10 loops, best of 3: 104 ms per loop

%timeit add_children_of_children(df4)
1 loop, best of 3: 503 ms per loop

%timeit all_descendants_nx(df4)
1 loop, best of 3: 238 ms per loop

%timeit list_ancestors(df4.values)
100 loops, best of 3: 2.96 ms per loop

Notes:

笔记：

get_ancestry_dataframe_flatnot timed on cases 3 & 4 due to time and memory concerns.

get_ancestry_dataframe_flat由于时间和记忆问题，未对案例 3 和 4 计时。

add_children_of_childrenmodified to identify root node internally, but allowed to assume a unique root. First line root_node = (set(dataframe.parent) - set(dataframe.child)).pop()added.

add_children_of_children修改为在内部识别根节点，但允许假设一个唯一的根。root_node = (set(dataframe.parent) - set(dataframe.child)).pop()添加了第一行。

all_descendants_nxmodified to accept a dataframe as an argument, instead of pulling from an external namespace.

all_descendants_nx修改为接受数据帧作为参数，而不是从外部命名空间中提取。

Example demonstrating proper behavior:

演示正确行为的示例：

np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])\
                                       .reset_index(drop=True) ==\
       list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])\
                                 .reset_index(drop=True))
Out[20]: True

Answer 2

回答by Stephen Rauch

Here is a method which builds a dict to allow easier navigation of the tree. Then runs the tree once and adds the children to their grand parents and above. And finally adds the new data to the dataframe.

这是一种构建字典以允许更轻松地导航树的方法。然后运行一次树并将孩子添加到他们的祖父母及以上。最后将新数据添加到数据框中。

Code:

代码：

def add_children_of_children(dataframe, root_node):
    # build a dict of lists to allow easy tree descent
    tree = {}
    for idx, (child, parent) in dataframe.iterrows():
        tree.setdefault(parent, []).append(child)

    data = []

    def descend_tree(parent):
        # get list of children of this parent
        children = tree[parent]

        # reverse order so that we can modify the list while looping
        for child in reversed(children):
            if child in tree:

                # descend tree and find children which need to be added
                lower_children = descend_tree(child)

                # add children from below to parent at this level
                data.extend([(c, parent) for c in lower_children])

                # return lower children to parents above
                children.extend(lower_children)

        return children

    descend_tree(root_node)

    return dataframe.append(
        pd.DataFrame(data, columns=dataframe.columns))

Timings:

时间：

There are three test methods in the test code, seconds from a timeit run:

测试代码中有三个测试方法，从 timeit run 开始的几秒钟：

0.073 - add_children_of_children()from above.
0.153 - add_children_of_children()with the output sorted.
3.385 - original get_ancestry_dataframe_flat()pandas implementation.

0.073 -add_children_of_children()从上面。
0.153 -add_children_of_children()输出已排序。
3.385 - 原始get_ancestry_dataframe_flat()Pandas实现。

So a native data structure approach is considerably faster than the original implementation.

因此，原生数据结构方法比原始实现要快得多。

Test Code:

测试代码：

import pandas as pd

df = pd.DataFrame(
    {
        'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    }, columns=['child', 'parent']
)

def method1():
    # the root node is the node which is not a child
    root = set(df.parent) - set(df.child)
    assert len(root) == 1, "Number of roots != 1 '{}'".format(root)
    return add_children_of_children(df, root.pop())

def method2():
    dataframe = method1()
    names = ['ancestor', 'descendant']
    rename = {o: n for o, n in zip(dataframe.columns, reversed(names))}
    return dataframe.rename(columns=rename) \
        .sort_values(names).reset_index(drop=True)

def method3():
    return get_ancestry_dataframe_flat(df)

def get_ancestry_dataframe_flat(df):

    def get_child_list(parent_id):

        list_of_children = list()
        list_of_children.append(
            df[df['parent'] == parent_id]['child'].values)

        for i, r in df[df['parent'] == parent_id].iterrows():
            if r['child'] != parent_id:
                list_of_children.append(get_child_list(r['child']))

        # flatten list
        list_of_children = [
            item for sublist in list_of_children for item in sublist]
        return list_of_children

    new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
    for index, row in df.iterrows():
        temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
        temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
        temp_df['ancestor'] = row['parent']
        new_df = new_df.append(temp_df)

    new_df = new_df\
        .drop_duplicates()\
        .sort_values(['ancestor', 'descendant'])\
        .reset_index(drop=True)

    return new_df

print(method2())
print(method3())

from timeit import timeit
print(timeit(method1, number=50))
print(timeit(method2, number=50))
print(timeit(method3, number=50))

Test Results:

检测结果：

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110

0.0737142168563
0.153700592966
3.38558308083

Answer 3

回答by Kevin

A solution using networkx, there may be a more efficient method in the documentation, but this nested loop does the trick.

使用networkx的解决方案，文档中可能有更有效的方法，但是这个嵌套循环可以解决问题。

import pandas as pd
from timeit import timeit

df = pd.DataFrame(
    {
        'child':     [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent':    [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    },  columns=['child', 'parent']
)

In networkx 2.0, use from_pandas_edgelistto create a directed graph:

在 networkx 2.0 中，使用from_pandas_edgelist创建有向图：

import networkx as nx
DiG = nx.from_pandas_edgelist(df, 'parent', 'child', create_using=nx.DiGraph())

Simply iterate over the nodesand the ancestorsof each node.

只需遍历节点和每个节点的祖先。

for n1 in DiG.nodes():
    for n2 in nx.ancestors(DiG, n1):
        print(n1,n2)

3000 1000
3000 2110
3011 1000
3011 2010
2100 1000
2110 1000
3111 1000
3111 2110
3033 1000
3033 2100
2010 1000
3102 1000
3102 2010

Wrapped into a function:

封装成一个函数：

def all_descendants_nx():
    DiG = nx.from_pandas_edgelist(df,'parent','child',create_using=nx.DiGraph())
    return pd.DataFrame.from_records([(n1,n2) for n1 in DiG.nodes() for n2 in nx.ancestors(DiG, n1)], columns=['descendant','ancestor'])

print(timeit(all_descendants_nx, number=50)) #to compare to Stephen's nice answer
0.05033063516020775

all_descendants_nx()
    descendant  ancestor
0   3000    1000
1   3000    2110
2   3011    1000
3   3011    2010
4   2100    1000
5   2110    1000
6   3111    1000
7   3111    2110
8   3033    1000
9   3033    2100
10  2010    1000
11  3102    1000
12  3102    2010

Answer 4

回答by Vaishali

Here is one way using isin() and map

这是使用 isin() 和 map 的一种方法

df_new = df.append(df[df['parent'].isin(df['child'].values.tolist())])\
.reset_index(drop = True)

df_new.loc[df_new.duplicated(), 'parent'] = df_new.loc[df_new.duplicated(), 'parent']\
.map(df.set_index('child')['parent'])

df_new = df_new.sort_values('parent').reset_index(drop=True)
df_new.columns = [' descendant' , 'ancestor']

You get

你得到

    descendant  ancestor
0   2010    1000
1   2100    1000
2   2110    1000
3   3000    1000
4   3011    1000
5   3033    1000
6   3102    1000
7   3111    1000
8   3011    2010
9   3102    2010
10  3033    2100
11  3000    2110
12  3111    2110

pandas 分层数据：有效地为每个节点构建一个每个后代的列表

提问by taky2

Original Input:

原始输入：

Graphical depiction of relationships:

关系的图形描述：

Expected output:

预期输出：

Solution using data frames:

使用数据框的解决方案：

采纳答案by EFT

Code:

代码：

Timing Comparison:

时序比较：

回答by Stephen Rauch

Code:

代码：

Timings:

时间：

Test Code:

测试代码：

Test Results:

检测结果：

回答by Kevin

回答by Vaishali

相关推荐

最近更新

标签

pandas 分层数据：有效地为每个节点构建一个每个后代的列表

提问by taky2

Original Input:

原始输入：

Graphical depiction of relationships:

关系的图形描述：

Expected output:

预期输出：

Solution using data frames:

使用数据框的解决方案：

采纳答案by EFT

Code:

代码：

Timing Comparison:

时序比较：

回答by Stephen Rauch

Code:

代码：

Timings:

时间：

Test Code:

测试代码：

Test Results:

检测结果：

回答by Kevin

回答by Vaishali

相关推荐

pandas 从数据框中的字符串中删除字符

pandas 使用多索引合并两个数据帧

pandas 查找多个数据框列之间的公共元素

Python Pandas 数据框创建

相关推荐

最近更新

标签