Pandas 中的 .dat 文件导入

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50628861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:37:38  来源:igfitidea点击:

.dat file import in pandas

pythonpython-3.xpandasdataframe

提问by Mrowkacala

I want to import this publicly available fileusing pandas. Simply as csv (I have renamed simply .dat to .csv):

我想使用 Pandas导入这个公开可用的文件。就像 csv(我简单地将 .dat 重命名为 .csv):

clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")

However in some cases country name is composed of two words, not just one. In those cases shifts my data frame to the right. This looks like (name hot springs is in two columns): enter image description hereHow to fix it for the entire dataset at once?

然而,在某些情况下,国名由两个词组成,而不仅仅是一个。在这些情况下,将我的数据框向右移动。这看起来像(名称温泉在两列中):在此处输入图片说明如何一次为整个数据集修复它?

回答by Scott Boston

No need to rename the .dat to .csv. Instead you can use a regex that matches two or more spaces as a column separator.

无需将 .dat 重命名为 .csv。相反,您可以使用匹配两个或多个空格的正则表达式作为列分隔符。

Try use sepparameter:

尝试使用sep参数:

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+', engine='python')

Output:

输出:

            0      1     2      3      4     5      6      7     8     9    10
0  Autauga, AL  30.92  31.7  57623  15768  15.2  10.74  51.41  60.4  2.36  457
1  Baldwin, AL  26.24  35.5  84935  16954  13.6   9.73  51.34  66.5  5.40  282
2  Barbour, AL  46.36  32.8  83656  15532  25.0   8.82  53.03  28.8  7.02   47
3   Blount, AL  32.92  34.5  61249  14820  15.0   9.67  51.15  62.4  2.36  185
4  Bullock, AL  67.67  31.7  75725  11120  33.0   7.08  50.76  17.6  2.91  141

If you want your state as a seperate column you can use this sep='\s\s+|,' which means seperate columns on two spaces or more OR a comma.

如果你希望你的状态作为一个单独的列,你可以使用这个 sep='\s\s+|,' 这意味着在两个或更多空格或逗号上分隔列。

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+|,', engine='python')

Output:

输出:

        0    1      2     3      4        5     6      7      8     9     10     11
0  Autauga   AL  30.92  31.7  57623  15768.0  15.2  10.74  51.41  60.4  2.36  457.0
1  Baldwin   AL  26.24  35.5  84935  16954.0  13.6   9.73  51.34  66.5  5.40  282.0
2  Barbour   AL  46.36  32.8  83656  15532.0  25.0   8.82  53.03  28.8  7.02   47.0
3   Blount   AL  32.92  34.5  61249  14820.0  15.0   9.67  51.15  62.4  2.36  185.0
4  Bullock   AL  67.67  31.7  75725  11120.0  33.0   7.08  50.76  17.6  2.91  141.0

回答by Tim Nixon

You can use a regular expression as a separator. In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces.

您可以使用正则表达式作为分隔符。在您的特定情况下,所有分隔符都不止一个空格,而名称中的空格只是单个空格。

import pandas as pd

clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')