如何在 Pandas 中读取固定宽度格式的文本文件

Question

提问by user1234440

I just got my hands on pandas and am figuring out how I can read a file. The file is from WRDS database and is the SP500?constituents?list all the way back to 1960s. I checked the file and no matter what I do to import it using 'read_csv', i still cant display the data correctly.

我刚刚接触了熊猫，正在研究如何读取文件。该文件来自 WRDS 数据库，是 SP500 的成分列表，可以追溯到 1960 年代。我检查了文件，无论我如何使用“read_csv”导入它，我仍然无法正确显示数据。

df = read_csv('sp500-sb.txt')

df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 0 to 1230
Data columns: gvkeyx ? ? ?from ? ? ?thru ? ? conm
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gvkey ? ? ?co_conm
...(the column names)
dtypes: object(1)

What does the above chunk of output mean? Anything would be helpful

上面的输出块是什么意思？任何事情都会有帮助

Answer 1

采纳答案by user1234440

Wes answered me in an email. Cheers.

韦斯在一封电子邮件中回复了我。干杯。

This is a fixed-width-format file (not delimited by commas or tabs as usual). I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily. I'll see what I can do. In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv. I suspect with some unix magic you can transform a FWF file into a CSV file.
I recommend following the issue on github as your e-mail is about to disappear from my inbox :)
https://github.com/pydata/pandas/issues/920
best, Wes

这是一个固定宽度格式的文件（不像往常一样用逗号或制表符分隔）。我意识到 Pandas 没有像 R 那样的固定宽度阅读器，尽管可以很容易地设计。我会看看我能做什么。同时，如果您可以以另一种格式导出数据（例如 csv——真正以逗号分隔），您将能够使用 read_csv 读取它。我怀疑使用一些 unix 魔法可以将 FWF 文件转换为 CSV 文件。
我建议关注 github 上的问题，因为您的电子邮件即将从我的收件箱中消失:)
https://github.com/pydata/pandas/issues/920
最好的，韦斯

Answer 2

回答by WoodChopper

pandas.read_fwf()was added in pandas 0.7.3 (April 2012) to handle fixed-width files.

pandas.read_fwf()在 pandas 0.7.3（2012 年 4 月）中添加以处理固定宽度的文件。

Answer 3

回答by TR.

user, if you need to deal with the fixed format right now, you can use something like the following:

用户，如果您现在需要处理固定格式，您可以使用以下内容：

def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
    reader = open(filename, 'r')
    # skip first rows 
    for i in xrange(ignore_first_rows):
        reader.next()
    if first_column_is_index:
        index = slice(0, fields[1])
        fields = [slice(*x) for x  in zip(fields[1:-1], fields[2:])]
        return ((line[index], [line[x].strip() for x in fields]) for line in reader)
    else:
        fields = [slice(*x) for x  in zip(fields[:-1], fields[1:])]
        return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader))

Here's a test program:

这是一个测试程序：

import pandas
import numpy
import tempfile

# create a data frame
df = pandas.DataFrame(numpy.random.randn(100, 5))
file_ = tempfile.NamedTemporaryFile(delete=True)
file_.write(df.to_string())
file_.flush()

# specify fields
fields = [0, 3, 12, 22, 32, 42, 52]
df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T

# need to specify the datatypes, otherwise everything is a string
df2 = pandas.DataFrame(df2, dtype=float)
df2.index = [int(x) for x in df2.index]

# check
assert (df - df2).abs().max().max() < 1E-6

This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.

如果您现在需要它，这应该可以解决问题，但请记住，上面的函数非常简单，特别是它对数据类型没有任何作用。

Answer 4

回答by TR.

What do you mean by display? Doesn't df['gvkey']give you the data in the gvkey column?

你说的显示是什么意思？不df['gvkey']给你 gvkey 列中的数据？

If what you do is print the whole data frame to the console, then take a look at df.to_string(), but it'll be hard to read if you have too many columns. Pandas won't print the whole thing by default if you have too many columns:

如果您所做的是将整个数据框打印到控制台，请查看df.to_string()，但如果您有太多列，将很难阅读。如果列太多，Pandas 默认不会打印整个内容：

import pandas
import numpy 

df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )

print df1   # <--- substitute by df2 to see the difference
print
print df1['col1']
print
print df1.to_string()

如何在 Pandas 中读取固定宽度格式的文本文件

提问by user1234440

采纳答案by user1234440

回答by WoodChopper

回答by TR.

回答by TR.

相关推荐

最近更新

标签

如何在 Pandas 中读取固定宽度格式的文本文件

提问by user1234440

采纳答案by user1234440

回答by WoodChopper

回答by TR.

回答by TR.

相关推荐

使用 Mahapps.Metro 在 WPF 中设计汉堡菜单

wpf 为什么即使级别设置为调试，Serilog 也不写入调试消息？

wpf 在 FontFamilyCollection 中找不到 FontFamily 元素

C# WPF 鼠标点击事件

相关推荐

最近更新

标签