使用 Pandas 读取 CSV 文件:复杂分隔符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30248128/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read CSV file using Pandas: complex separator
提问by Mohammad Saifullah
I have a csv file which I want to read using python panda. The header and lines looks the following:
我有一个 csv 文件,我想使用 python panda 读取它。标题和行如下所示:
A ^B^C^D^E ^F ^G ^H^I^J^K^L^M^N
Clearly it seen that, separator is ^, sometimes there are some odd spaces. How can I read this file perfectly?
很明显,分隔符是^,有时会有一些奇怪的空格。我怎样才能完美地阅读这个文件?
I am using the following command to read the csv file:
我正在使用以下命令读取 csv 文件:
df = pd.read_csv('input.csv', sep='^')
回答by EdChum
Use regex \s*\^which means 0 or more whitespace and ^, you have to specify the python engine here to avoid a warning about regex support:
使用正则表达式\s*\^,这意味着 0 个或多个空格和 ^,您必须在此处指定 python 引擎以避免有关正则表达式支持的警告:
In [152]:
t="""A ^B^C^D^E ^F ^G ^H^I^J^K^L^M^N"""
df= pd.read_csv(io.StringIO(t), sep='\s*\^', engine='python')
df.columns
Out[152]:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N'], dtype='object')
回答by Malik Brahimi
Can't you supply regex as a seperator?
你不能提供正则表达式作为分隔符吗?
sep = re.compile(r'[\^\s]+')
回答by Zachary Cross
Your separator can be a regular expression, so try something like this:
您的分隔符可以是正则表达式,因此请尝试以下操作:
df = pd.read_csv('input.csv', sep="[ ^]+")
The regular expression should use any number of spaces or carets (^) in a row as a single separator.
正则表达式应在一行中使用任意数量的空格或插入符号 (^) 作为单个分隔符。
回答by Alexander
Read the file as you have done and then strip extra whitespace for each column which is a string:
像您一样读取文件,然后为作为字符串的每一列去除额外的空格:
df = (pd.read_csv('input.csv', sep="^")
.apply(lambda x: x.str.strip() if isinstance(x, str) else x))
回答by user2030378
If the only whitespace in your file is the extra whitespace between columns (i.e. no columns have raw text with spaces), an easy fix would be to simply remove all the spaces in the file. An example command to do that would be:
如果文件中唯一的空白是列之间的额外空白(即没有列具有带空格的原始文本),一个简单的解决方法是简单地删除文件中的所有空格。执行此操作的示例命令是:
<input.csv tr -d '[[:blank:]]' > new_input.txt

