pandas Dask连接的简单方法(水平,轴= 1,列)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46911220/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:41:07  来源:igfitidea点击:

Simple way to Dask concatenate (horizontal, axis=1, columns)

pythonpandasdask

提问by Tom Hemmes

ActionReading two csv (data.csv and label.csv) to a single dataframe.

操作将两个 csv(data.csv 和 label.csv)读取到单个数据帧。

df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label'])

ProblemConcatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match.

问题列的串联需要已知的除法。然而,设置索引会对数据进行排序,我明确不想要,因为两个文件的顺序是它们的匹配。

df = dd.concat([df, df_label], axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-e6c2e1bdde55> in <module>()
----> 1 df = dd.concat([df, df_label], axis=1)

/uhome/hemmest/.local/lib/python3.5/site-packages/dask/dataframe/multi.py in concat(dfs, axis, join, interleave_partitions)
    573             return concat_unindexed_dataframes(dfs)
    574         else:
--> 575             raise ValueError('Unable to concatenate DataFrame with unknown '
    576                              'division specifying axis=1')
    577     else:

ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1

TriedAdding an 'id'column

尝试添加一'id'

df['id'] = pd.Series(range(len(df)))

However, the length of Dataframe results in a Series larger than memory.

但是,Dataframe 的长度导致系列大于内存。

QuestionApparently Dask knows both Dataframe have the same length:

问题显然 Dask 知道两个 Dataframe 具有相同的长度:

In [15]:
df.index.compute()
Out[15]:
Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            1120910, 1120911, 1120912, 1120913, 1120914, 1120915, 1120916,
            1120917, 1120918, 1120919],
           dtype='int64', length=280994776)
In [16]:
df_label.index.compute()
Out[16]:
Int64Index([1, 5, 5, 2, 2, 2, 2, 2, 2, 2,
            ...
            3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
           dtype='int64', length=280994776)

How to exploit this knowledge to simply concatenate?

如何利用这些知识来简单地连接?

采纳答案by Tom Hemmes

The solution (from the comments by @Primer):

解决方案(来自@Primer 的评论):

  • both repartitioning and resetting the index
  • use assign instead of concatenate
  • 重新分区和重置索引
  • 使用分配而不是连接

The final code;

最终代码;

import os
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd



df = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.txt'], delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.labels'], header=None, names=['label'])
# len(df), len(df_label), df_label.label.isnull().sum().compute()

df = df.repartition(npartitions=200)
df = df.reset_index(drop=True)
df_label = df_label.repartition(npartitions=200)
df_label = df_label.reset_index(drop=True)

df = df.assign(label = df_label.label)
df.head()

回答by architectonic

I had the same problem and solved it by making sure that both dataframes have the same number of partitions (since we know already that both have the same length):

我遇到了同样的问题并通过确保两个数据帧具有相同数量的分区来解决它(因为我们已经知道两者具有相同的长度):

df = df.repartition(npartitions=200)
df_label = df_label.repartition(npartitions=200)
df = dd.concat([df, df_label], axis=1)