Python 导入文本文件：没有要从文件解析的列

Question

提问by mezz

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set:

我正在尝试从 sys.stdin 获取输入。这是一个用于 hadoop 的 map reducer 程序。输入文件为txt格式。数据集预览：

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013
62  257 2   879372434
286 1014    5   879781125
200 222 5   876042340
210 40  3   891035994
224 29  3   888104457
303 785 3   879485318
122 387 5   879270459
194 274 2   879539794
291 1042    4   874834944

Code that I have been trying -

我一直在尝试的代码 -

import sys
df = pd.read_csv(sys.stdin,error_bad_lines=False)

I have also tried with delimiter = \t, header=False,defining column nameNothing seems to work, the error I am getting is this error:

我也尝试过delimiter = \t, header=False,defining column name似乎没有任何效果，我得到的错误是这个错误：

[root@sandbox lab]# cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py
Traceback (most recent call last):
  File "/root/lab/mid-1-reducer.py", line 8, in <module>
    df = pd.read_csv(sys.stdin,delimiter='\t')
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 388, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
    self._make_engine(self.engine)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file

However, if when I try this directly in python(not in hadoop), it works fine.

但是，如果当我直接在 python 中（而不是在 hadoop 中）尝试这个时，它工作正常。

I have tried to looked into stackoverflow posts, one of the post suggested try and except. Applying that leaves me with a empty file. Can anybody help? Thanks

我曾尝试查看 stackoverflow 帖子，其中一篇帖子建议尝试和除外。应用它给我留下一个空文件。有人可以帮忙吗？谢谢

Answer 1

采纳答案by DerWeh

Using try and except just lets you continue in spite of errors and handle them. It won't magically fix your errors.

使用 try 和 except 只会让您在出现错误的情况下继续并处理它们。它不会神奇地修复您的错误。

read_csvexpects csvfiles, which your input is obviously not. A quick look into the documentation:

read_csv需要csv文件，而您的输入显然不是。快速查看文档：

delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the sep. Equivalent to setting sep='+s'. If this option is set to True, nothing should be passed in for the delimiter parameter.

delim_whitespace : 布尔值，默认为 False
指定是否将空格（例如“ ”或“ ”）用作分隔符。相当于设置 sep='+s'。如果此选项设置为 True，则不应为 delimiter 参数传入任何内容。

This seems like the right argument. Use

这似乎是正确的论点。用

pandas.read_csv(filepath_or_buffer, delim_whitespace=True).

Using delimiter='\t'should also work, unless the tabs are expanded (replaced by spaces). As we can't really tell, delim_whitespaceseems to be the better option.

使用delimiter='\t'也应该有效，除非选项卡被扩展（由空格替换）。正如我们无法确定的那样，delim_whitespace似乎是更好的选择。

If this doesn't help, just print out your sys.stdinto check if you properly pass the text.

如果这没有帮助，只需打印出您的内容sys.stdin以检查您是否正确传递了文本。

Edit: I just saw that you use

编辑：我刚刚看到你使用

cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py

Is this intended, this way mid-1-reducer.pyprocesses the output of mid-1-mapper.py. If you want to process the content of the file u.dataconsider reading the file and not sys.stdin.

这是有意的，这种方式mid-1-reducer.py处理mid-1-mapper.py. 如果要处理文件的内容，请u.data考虑读取文件而不是sys.stdin.

Answer 2

回答by Grainier

You have to set delim_whitespaceto True, to use whitespaces as the separator.

您必须设置delim_whitespace为 True，才能使用空格作为分隔符。

import sys
import pandas as pd

if __name__ == '__main__':
    df = pd.read_csv(sys.stdin, header=None, delim_whitespace=True)
    print df

Python 导入文本文件：没有要从文件解析的列

提问by mezz

采纳答案by DerWeh

回答by Grainier

相关推荐

最近更新

标签

Python 导入文本文件：没有要从文件解析的列

提问by mezz

采纳答案by DerWeh

回答by Grainier

相关推荐

Python 如何修复“TypeError: len() of unsized object”

Python Visual Studio Code pylint：无法导入“protorpc”

Python 如何选择数据框的最后一列

如何在 Visual Studio 2017 中向 python 添加包

相关推荐

最近更新

标签