Pandas read_fwf 不加载文件的整个内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27416031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:45:18  来源:igfitidea点击:

Pandas read_fwf not Loading Entire Content of File

pythonparsingpandasfixed-width

提问by eroma934

I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.

我有一个相当大的固定宽度文件(~30M 行,4gb),当我尝试使用 pandas read_fwf() 创建 DataFrame 时,它​​只加载了文件的一部分,并且很好奇是否有人遇到过类似的问题此解析器不读取文件的全部内容。

import pandas as pd

file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]

df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M

If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.

如果我天真地使用 read_csv() 将文件读入 1 列,则所有文件都被读取到内存中并且不会丢失数据。

import pandas as pd

file_name = r"C:\....\file.txt"

df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M

Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).

当然,没有看到文件的内容或格式,它可能与我的事情有关,但想看看过去是否有其他人遇到过任何问题。我做了一个健全性检查并测试了文件深处的几行,它们似乎都被正确格式化(当我能够使用相同的规范将其拉入带有 Talend 的 Oracle DB 时进一步验证)。

Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.

如果有人有任何想法,请告诉我,通过 Python 运行所有内容并且在我开始开发分析时不要来回走动会很棒。

采纳答案by Marcin

Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwfinto it. This is the code for the generation and reading it:

输入文件的几行对于查看日期的外观很有用。尽管如此,我还是生成了一些您拥有的类似格式(我认为)的随机文件,并将其应用pd.read_fwf到其中。这是生成和阅读它的代码:

from random import random

从随机导入随机

import pandas as pd


file_name = r"/tmp/file.txt"

lines_no = int(30e6)

with open(file_name, 'w') as f:
    for i in range(lines_no):
        if i%int(1e5) == 0:
            print("Writing progress: {:0.1f}%"
                    .format(float(i) / float(lines_no)*100), end='\r')
        f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")


print("File created. Now read it using pd.read_fwf ...")

fwidths = [11,11,11,11,11,11]

df = pd.read_fwf(file_name, widths = fwidths,
               names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])


#print(df)

print(df.shape) #<30M

So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.

所以在这种情况下,它接缝工作正常。我使用 Python 3.4、Ubuntu 14.04 x64 和 Pandas 0.15.1。创建文件并使用pd.read_fwf. 但它似乎有效,至少对我和我的设置而言。

The result is : (30000000, 6)

结果是: (30000000, 6)

Example file created:

创建的示例文件:

7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905