Python 导入文本文件:没有要从文件解析的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40193452/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:15:21  来源:igfitidea点击:

Importing text file : No Columns to parse from file

pythonpandashadoop-streaming

提问by mezz

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set:

我正在尝试从 sys.stdin 获取输入。这是一个用于 hadoop 的 map reducer 程序。输入文件为txt格式。数据集预览:

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013
62  257 2   879372434
286 1014    5   879781125
200 222 5   876042340
210 40  3   891035994
224 29  3   888104457
303 785 3   879485318
122 387 5   879270459
194 274 2   879539794
291 1042    4   874834944

Code that I have been trying -

我一直在尝试的代码 -

import sys
df = pd.read_csv(sys.stdin,error_bad_lines=False)

I have also tried with delimiter = \t, header=False,defining column nameNothing seems to work, the error I am getting is this error:

我也尝试过delimiter = \t, header=False,defining column name似乎没有任何效果,我得到的错误是这个错误:

[root@sandbox lab]# cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py
Traceback (most recent call last):
  File "/root/lab/mid-1-reducer.py", line 8, in <module>
    df = pd.read_csv(sys.stdin,delimiter='\t')
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 388, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
    self._make_engine(self.engine)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file

However, if when I try this directly in python(not in hadoop), it works fine.

但是,如果当我直接在 python 中(而不是在 hadoop 中)尝试这个时,它工作正常。

I have tried to looked into stackoverflow posts, one of the post suggested try and except. Applying that leaves me with a empty file. Can anybody help? Thanks

我曾尝试查看 stackoverflow 帖子,其中一篇帖子建议尝试和除外。应用它给我留下一个空文件。有人可以帮忙吗?谢谢

采纳答案by DerWeh

Using try and except just lets you continue in spite of errors and handle them. It won't magically fix your errors.

使用 try 和 except 只会让您在出现错误的情况下继续并处理它们。它不会神奇地修复您的错误。

read_csvexpects csvfiles, which your input is obviously not. A quick look into the documentation:

read_csv需要csv文件,而您的输入显然不是。快速查看文档:

delim_whitespace : boolean, default False

Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the sep. Equivalent to setting sep='+s'. If this option is set to True, nothing should be passed in for the delimiter parameter.

delim_whitespace : 布尔值,默认为 False

指定是否将空格(例如“ ”或“ ”)用作分隔符。相当于设置 sep='+s'。如果此选项设置为 True,则不应为 delimiter 参数传入任何内容。

This seems like the right argument. Use

这似乎是正确的论点。用

pandas.read_csv(filepath_or_buffer, delim_whitespace=True).

Using delimiter='\t'should also work, unless the tabs are expanded (replaced by spaces). As we can't really tell, delim_whitespaceseems to be the better option.

使用delimiter='\t'也应该有效,除非选项卡被扩展(由空格替换)。正如我们无法确定的那样,delim_whitespace似乎是更好的选择。

If this doesn't help, just print out your sys.stdinto check if you properly pass the text.

如果这没有帮助,只需打印出您的内容sys.stdin以检查您是否正确传递了文本。

Edit: I just saw that you use

编辑:我刚刚看到你使用

cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py

Is this intended, this way mid-1-reducer.pyprocesses the output of mid-1-mapper.py. If you want to process the content of the file u.dataconsider reading the file and not sys.stdin.

这是有意的,这种方式mid-1-reducer.py处理mid-1-mapper.py. 如果要处理文件的内容,请u.data考虑读取文件而不是sys.stdin.

回答by Grainier

You have to set delim_whitespaceto True, to use whitespaces as the separator.

您必须设置delim_whitespace为 True,才能使用空格作为分隔符。

import sys
import pandas as pd

if __name__ == '__main__':
    df = pd.read_csv(sys.stdin, header=None, delim_whitespace=True)
    print df