Pandas read_fwf 不加载文件的整个内容

Question

提问by eroma934

I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.

我有一个相当大的固定宽度文件（~30M 行，4gb），当我尝试使用 pandas read_fwf() 创建 DataFrame 时，它只加载了文件的一部分，并且很好奇是否有人遇到过类似的问题此解析器不读取文件的全部内容。

import pandas as pd

file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]

df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M

If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.

如果我天真地使用 read_csv() 将文件读入 1 列，则所有文件都被读取到内存中并且不会丢失数据。

import pandas as pd

file_name = r"C:\....\file.txt"

df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M

Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).

当然，没有看到文件的内容或格式，它可能与我的事情有关，但想看看过去是否有其他人遇到过任何问题。我做了一个健全性检查并测试了文件深处的几行，它们似乎都被正确格式化（当我能够使用相同的规范将其拉入带有 Talend 的 Oracle DB 时进一步验证）。

Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.

如果有人有任何想法，请告诉我，通过 Python 运行所有内容并且在我开始开发分析时不要来回走动会很棒。

Answer 1

采纳答案by Marcin

Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwfinto it. This is the code for the generation and reading it:

输入文件的几行对于查看日期的外观很有用。尽管如此，我还是生成了一些您拥有的类似格式（我认为）的随机文件，并将其应用pd.read_fwf到其中。这是生成和阅读它的代码：

from random import random

从随机导入随机

import pandas as pd


file_name = r"/tmp/file.txt"

lines_no = int(30e6)

with open(file_name, 'w') as f:
    for i in range(lines_no):
        if i%int(1e5) == 0:
            print("Writing progress: {:0.1f}%"
                    .format(float(i) / float(lines_no)*100), end='\r')
        f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")


print("File created. Now read it using pd.read_fwf ...")

fwidths = [11,11,11,11,11,11]

df = pd.read_fwf(file_name, widths = fwidths,
               names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])


#print(df)

print(df.shape) #<30M

So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.

所以在这种情况下，它接缝工作正常。我使用 Python 3.4、Ubuntu 14.04 x64 和 Pandas 0.15.1。创建文件并使用pd.read_fwf. 但它似乎有效，至少对我和我的设置而言。

The result is : (30000000, 6)

结果是： (30000000, 6)

Example file created:

创建的示例文件：

7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905

Pandas read_fwf 不加载文件的整个内容

提问by eroma934

采纳答案by Marcin

相关推荐

最近更新

标签

Pandas read_fwf 不加载文件的整个内容

提问by eroma934

采纳答案by Marcin

相关推荐

pandas 将键值对解析为 DataFrame 列

绘制 Pandas 系列数据的平滑曲线

pandas 使用布尔索引的 IndexingError

pandas 比较熊猫数据框中的行值

相关推荐

最近更新

标签