Pandas 中的 .dat 文件导入
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50628861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
.dat file import in pandas
提问by Mrowkacala
I want to import this publicly available fileusing pandas. Simply as csv (I have renamed simply .dat to .csv):
我想使用 Pandas导入这个公开可用的文件。就像 csv(我简单地将 .dat 重命名为 .csv):
clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")
However in some cases country name is composed of two words, not just one. In those cases shifts my data frame to the right. This looks like (name hot springs is in two columns): How to fix it for the entire dataset at once?
然而,在某些情况下,国名由两个词组成,而不仅仅是一个。在这些情况下,将我的数据框向右移动。这看起来像(名称温泉在两列中):如何一次为整个数据集修复它?
回答by Scott Boston
No need to rename the .dat to .csv. Instead you can use a regex that matches two or more spaces as a column separator.
无需将 .dat 重命名为 .csv。相反,您可以使用匹配两个或多个空格的正则表达式作为列分隔符。
Try use sep
parameter:
尝试使用sep
参数:
pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
header=None, sep='\s\s+', engine='python')
Output:
输出:
0 1 2 3 4 5 6 7 8 9 10
0 Autauga, AL 30.92 31.7 57623 15768 15.2 10.74 51.41 60.4 2.36 457
1 Baldwin, AL 26.24 35.5 84935 16954 13.6 9.73 51.34 66.5 5.40 282
2 Barbour, AL 46.36 32.8 83656 15532 25.0 8.82 53.03 28.8 7.02 47
3 Blount, AL 32.92 34.5 61249 14820 15.0 9.67 51.15 62.4 2.36 185
4 Bullock, AL 67.67 31.7 75725 11120 33.0 7.08 50.76 17.6 2.91 141
If you want your state as a seperate column you can use this sep='\s\s+|,' which means seperate columns on two spaces or more OR a comma.
如果你希望你的状态作为一个单独的列,你可以使用这个 sep='\s\s+|,' 这意味着在两个或更多空格或逗号上分隔列。
pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
header=None, sep='\s\s+|,', engine='python')
Output:
输出:
0 1 2 3 4 5 6 7 8 9 10 11
0 Autauga AL 30.92 31.7 57623 15768.0 15.2 10.74 51.41 60.4 2.36 457.0
1 Baldwin AL 26.24 35.5 84935 16954.0 13.6 9.73 51.34 66.5 5.40 282.0
2 Barbour AL 46.36 32.8 83656 15532.0 25.0 8.82 53.03 28.8 7.02 47.0
3 Blount AL 32.92 34.5 61249 14820.0 15.0 9.67 51.15 62.4 2.36 185.0
4 Bullock AL 67.67 31.7 75725 11120.0 33.0 7.08 50.76 17.6 2.91 141.0
回答by Tim Nixon
You can use a regular expression as a separator. In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces.
您可以使用正则表达式作为分隔符。在您的特定情况下,所有分隔符都不止一个空格,而名称中的空格只是单个空格。
import pandas as pd
clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')