在 Pandas 中将列附加到 HDF 文件的框架

Question

提问by lstyls

I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.

我正在处理 CSV 格式的大型数据集。我正在尝试逐列处理数据，然后将数据附加到 HDF 文件中的帧。所有这些都是使用 Pandas 完成的。我的动机是，虽然整个数据集比我的物理内存大得多，但列大小是可以管理的。在稍后的阶段，我将通过将列一一加载回内存并对其进行操作来执行特征逻辑回归。

I am able to make a new HDF file and make a new frame with the first column:

我能够创建一个新的 HDF 文件并使用第一列创建一个新框架：

hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)

But after that, I get a ValueError when trying to append a new column to the frame:

但在那之后，当我尝试将新列附加到框架时，我得到了一个 ValueError：

feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)

Stack trace and error message:

堆栈跟踪和错误消息：

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data

I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.

我是处理大型数据集和有限内存的新手，所以我愿意接受有关处理这些数据的替代方法的建议。

Answer 1

回答by Jeff

complete docs are here, and some cookbook strategies here

完整的文档在这里，一些食谱策略在这里

PyTables is row-oriented, so you can only append rows. Read the csv chunk-by-chunk then append the entire frame as you go, something like this:

PyTables 是面向行的，因此您只能附加行。逐块读取 csv 文件，然后随时添加整个帧，如下所示：

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
         store.append('df',chunk)
store.close()

You must be a tad careful as it is possiible for the dtypes of the resultant frrame when read chunk-by-chunk to have different dtypes, e.g. you have a integer like column that doesn't have missing values until say the 2nd chunk. The first chunk would have that column as an int64, while the second as float64. You may need to force dtypes with the dtypekeyword to read_csv, see here.

您必须小心一点，因为当逐个块读取时，结果帧的 dtypes 可能具有不同的 dtypes，例如，您有一个像列一样的整数，直到说第二个块时才没有缺失值。第一个块将该列作为int64，而第二个作为float64. 您可能需要使用dtype关键字to 强制 dtypes read_csv，请参见此处。

hereis a similar question as well.

这里也是一个类似的问题。

在 Pandas 中将列附加到 HDF 文件的框架

提问by lstyls

回答by Jeff

相关推荐

最近更新

标签

在 Pandas 中将列附加到 HDF 文件的框架

提问by lstyls

回答by Jeff

相关推荐

在 Pandas 中按小时过滤

删除 python pandas 中的 NaN 值

PYODBC 到 Pandas - DataFrame 不工作 - 传递值的形状是 (x,y)，索引意味着 (w,z)

将多索引添加到 Pandas 数据帧并保持当前索引

相关推荐

最近更新

标签