Python 在 pandas read_csv 中自定义分隔符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41235111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Customizing the separator in pandas read_csv
提问by Peaceful
I am reading many different data files into various pandas dataframes. The columns in these datafiles are separated by spaces. However, for each file, the number of spaces is different (for some of them, there is only one space, for others, there are two spaces and so on). Thus, every time I import the file, I have to manually go to that file and see the number of spaces that have been used and give those many number of spaces in sep
:
我正在将许多不同的数据文件读入各种熊猫数据帧。这些数据文件中的列用空格分隔。但是,对于每个文件,空格的数量是不同的(有的只有一个空格,有的有两个空格,以此类推)。因此,每次导入文件时,我都必须手动转到该文件并查看已使用的空格数,并在 中给出这些空格数sep
:
import pandas as pd
df = pd.read_csv('myfile.dat', sep = ' ')
Is there any way I can tell pandas to assume "any number of spaces" as the separator? Also, is there any way I can tell pandas to use either tab (\t
) or spaces as the separator?
有什么办法可以告诉熊猫假设“任意数量的空格”作为分隔符?另外,有什么办法可以告诉熊猫使用制表符 ( \t
) 或空格作为分隔符吗?
采纳答案by Ted Petrou
Yes, you can use a simple regular expression like sep='\s+'
to denote one or more spaces.
是的,您可以使用简单的正则表达式sep='\s+'
来表示一个或多个空格。
回答by piRSquared
You can also use the parameter skipinitialspace=True
which skips the leading spaces after any delimiter.
您还可以使用skipinitialspace=True
在任何分隔符之后跳过前导空格的参数。
回答by nlahri
You can directly use delim_whitespace
:
您可以直接使用delim_whitespace
:
import pandas as pd
df = pd.read_csv('myfile.dat', delim_whitespace=True )
The argument delim_whitespace
controls whether or not whitespace (e.g. ' '
or ' '
) will be used as separator. See pandas.read_csvfor details.
参数delim_whitespace
控制是否将空格(例如' '
或' '
)用作分隔符。有关详细信息,请参阅pandas.read_csv。
回答by Dustin Williams
One thing I found is if you use a unsupported separator. Pandas/Dask will have to use the Python engine instead of the C engine. This is a good deal slower.
我发现的一件事是,如果您使用不受支持的分隔符。Pandas/Dask 将不得不使用 Python 引擎而不是 C 引擎。这要慢很多。