将 pandas.DataFrame 转换为字节

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34666860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:28:14  来源:igfitidea点击:

Converting pandas.DataFrame to bytes

pythonnumpypandastype-conversiondataframe

提问by Paul Jtheitroademan

I need convert the data stored in a pandas.DataFrameinto a byte string where each column can have a separate data type (integer or floating point). Here is a simple set of data:

我需要将存储在 a 中的数据pandas.DataFrame转换为字节字符串,其中每列可以具有单独的数据类型(整数或浮点数)。这是一组简单的数据:

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8')
df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')

and df looks something like this:

和 df 看起来像这样:

    a            b                  c
0   10  18446744073709551615    1.324000e+10
1   15  230498234019            3.141590e+00
2   20  32094812309             2.341341e+02

The DataFrameknows about the types of each column df.dtypesso I'd like to do something like this:

DataFrame对各类型列的都知道df.dtypes,所以我想这样做:

data_to_pack = [tuple(record) for _, record in df.iterrows()]
data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes))
data_bytes = data_array.tostring()

This typically works fine but in this case (due to the maximum value stored in df['b'][0]. The second line above converting the array of tuples to an np.arraywith a given set of types causes the following error:

这通常可以正常工作,但在这种情况下(由于存储在df['b'][0]. 中的最大值。将元组数组转换为np.array具有给定类型集的第二行会导致以下错误:

OverflowError: Python int too large to convert to C long

The error results (I believe) in the first line which extracts the record as a Serieswith a single data type (defaults to float64) and the representation chosen in float64for the maximum uint64value is not directly convertible back to uint64.

第一行中的错误结果(我相信)将记录提取为Series具有单一数据类型(默认为float64),并且float64为最大值选择的表示形式uint64不能直接转换回uint64.

1) Since the DataFramealready knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.arrayconstructor? Or is there a better way than outlined above to preserve the type information in such a conversion?

1)既然DataFrame已经知道每一列的类型,有没有办法绕过创建一行元组以输入到类型化numpy.array构造函数中?或者有没有比上面概述的更好的方法来保留这种转换中的类型信息?

2) Is there a way to go directly from DataFrameto a byte string representing the data using the type information for each column.

2)有没有办法DataFrame使用每列的类型信息直接从表示数据的字节字符串。

采纳答案by ali_m

You can use df.to_records()to convert your dataframe to a numpy recarray, then call .tostring()to convert this to a string of bytes:

您可以使用df.to_records()将数据帧转换为 numpy recarray,然后调用.tostring()将其转换为字节字符串:

rec = df.to_records(index=False)

print(repr(rec))
# rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159),
#  (20, 32094812309, 234.1341)], 
#           dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')])

s = rec.tostring()
rec2 = np.fromstring(s, rec.dtype)

print(np.all(rec2 == rec))
# True