Pandas 中的 .dat 文件导入

Question

提问by Mrowkacala

I want to import this publicly available fileusing pandas. Simply as csv (I have renamed simply .dat to .csv):

我想使用 Pandas导入这个公开可用的文件。就像 csv（我简单地将 .dat 重命名为 .csv）：

clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")

However in some cases country name is composed of two words, not just one. In those cases shifts my data frame to the right. This looks like (name hot springs is in two columns): How to fix it for the entire dataset at once?

然而，在某些情况下，国名由两个词组成，而不仅仅是一个。在这些情况下，将我的数据框向右移动。这看起来像（名称温泉在两列中）：如何一次为整个数据集修复它？

Answer 1

回答by Scott Boston

No need to rename the .dat to .csv. Instead you can use a regex that matches two or more spaces as a column separator.

无需将 .dat 重命名为 .csv。相反，您可以使用匹配两个或多个空格的正则表达式作为列分隔符。

Try use sepparameter:

尝试使用sep参数：

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+', engine='python')

Output:

输出：

            0      1     2      3      4     5      6      7     8     9    10
0  Autauga, AL  30.92  31.7  57623  15768  15.2  10.74  51.41  60.4  2.36  457
1  Baldwin, AL  26.24  35.5  84935  16954  13.6   9.73  51.34  66.5  5.40  282
2  Barbour, AL  46.36  32.8  83656  15532  25.0   8.82  53.03  28.8  7.02   47
3   Blount, AL  32.92  34.5  61249  14820  15.0   9.67  51.15  62.4  2.36  185
4  Bullock, AL  67.67  31.7  75725  11120  33.0   7.08  50.76  17.6  2.91  141

If you want your state as a seperate column you can use this sep='\s\s+|,' which means seperate columns on two spaces or more OR a comma.

如果你希望你的状态作为一个单独的列，你可以使用这个 sep='\s\s+|,' 这意味着在两个或更多空格或逗号上分隔列。

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+|,', engine='python')

Output:

输出：

        0    1      2     3      4        5     6      7      8     9     10     11
0  Autauga   AL  30.92  31.7  57623  15768.0  15.2  10.74  51.41  60.4  2.36  457.0
1  Baldwin   AL  26.24  35.5  84935  16954.0  13.6   9.73  51.34  66.5  5.40  282.0
2  Barbour   AL  46.36  32.8  83656  15532.0  25.0   8.82  53.03  28.8  7.02   47.0
3   Blount   AL  32.92  34.5  61249  14820.0  15.0   9.67  51.15  62.4  2.36  185.0
4  Bullock   AL  67.67  31.7  75725  11120.0  33.0   7.08  50.76  17.6  2.91  141.0

Answer 2

回答by Tim Nixon

You can use a regular expression as a separator. In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces.

您可以使用正则表达式作为分隔符。在您的特定情况下，所有分隔符都不止一个空格，而名称中的空格只是单个空格。

import pandas as pd

clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')

Pandas 中的 .dat 文件导入

提问by Mrowkacala

回答by Scott Boston

回答by Tim Nixon

相关推荐

最近更新

标签

Pandas 中的 .dat 文件导入

提问by Mrowkacala

回答by Scott Boston

回答by Tim Nixon

相关推荐

Pandas：查找特定列不是 NA 但所有其他列的行

pandas Python 错误：TypeError：'Timestamp' 类型的对象不是 JSON 可序列化的'

pandas 如何在熊猫中进行前滚求和？

pandas 仅在多索引中的第二个索引上使用 .loc

相关推荐

最近更新

标签