如何在 Pandas 中读取 .txt

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41509435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:44:05  来源:igfitidea点击:

How to Read .txt in Pandas

pythonpython-3.xpandas

提问by Nick Duddy

I'm trying to pull a txt file which has two series of data into pandas. So far I've tried the variations below which I've source from other posts on stack. So far it will only read in as one series.

我正在尝试将一个包含两个系列数据的 txt 文件提取到 Pandas 中。到目前为止,我已经尝试了下面的变体,我从堆栈上的其他帖子中获取了这些变体。到目前为止,它只会作为一个系列阅读。

The data I'm using is available here

我正在使用的数据可在此处获得

icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", delim_whitespace=True, header=None)
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, sep="/t")
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, delimiter=r"\s+")

I'm sure I'm doing something really obviously wrong but I can't see it.

我确定我在做一些明显错误的事情,但我看不到。

回答by MaxU

try to use sep=r'\s{2,}'as separator - it means use as separator twoor more spaces or tabs:

尝试sep=r'\s{2,}'用作分隔符 - 这意味着将两个或多个空格或制表符用作分隔符:

In [28]: df = pd.read_csv(url, sep=r'\s{2,}', engine='python', header=None, names=['ID','Name'])

In [29]: df
Out[29]:
        ID                                                Name
0     A000  Cholera due to Vibrio cholerae 01, biovar cholerae
1     A001     Cholera due to Vibrio cholerae 01, biovar eltor
2     A009                                Cholera, unspecified
3    A0100                          Typhoid fever, unspecified
4    A0101                                  Typhoid meningitis
5    A0102                Typhoid fever with heart involvement
6    A0103                                   Typhoid pneumonia
7    A0104                                   Typhoid arthritis
8    A0105                               Typhoid osteomyelitis
9    A0109              Typhoid fever with other complications
10    A011                                 Paratyphoid fever A
11    A012                                 Paratyphoid fever B
12    A013                                 Paratyphoid fever C
13    A014                      Paratyphoid fever, unspecified
14    A020                                Salmonella enteritis
15    A021                                   Salmonella sepsis
16   A0220         Localized salmonella infection, unspecified
17   A0221                               Salmonella meningitis
18   A0222                                Salmonella pneumonia
19   A0223                                Salmonella arthritis
20   A0224                            Salmonella osteomyelitis
21   A0225                           Salmonella pyelonephritis
22   A0229           Salmonella with other localized infection
23    A028               Other specified salmonella infections
24    A029                   Salmonella infection, unspecified
..     ...                                                 ...
671   B188                       Other chronic viral hepatitis
672   B189                Chronic viral hepatitis, unspecified
673   B190       Unspecified viral hepatitis with hepatic coma
674  B1910  Unspecified viral hepatitis B without hepatic coma
675  B1911     Unspecified viral hepatitis B with hepatic coma
676  B1920  Unspecified viral hepatitis C without hepatic coma
677  B1921     Unspecified viral hepatitis C with hepatic coma
678   B199    Unspecified viral hepatitis without hepatic coma
679    B20          Human immunodeficiency virus [HIV] disease
680   B250                         Cytomegaloviral pneumonitis
681   B251                           Cytomegaloviral hepatitis
682   B252                        Cytomegaloviral pancreatitis
683   B258                      Other cytomegaloviral diseases
684   B259                Cytomegaloviral disease, unspecified
685   B260                                      Mumps orchitis
686   B261                                    Mumps meningitis
687   B262                                  Mumps encephalitis
688   B263                                  Mumps pancreatitis
689  B2681                                     Mumps hepatitis
690  B2682                                   Mumps myocarditis
691  B2683                                     Mumps nephritis
692  B2684                                Mumps polyneuropathy
693  B2685                                     Mumps arthritis
694  B2689                           Other mumps complications
695   B269                          Mumps without complication

[696 rows x 2 columns]

alternatively you can use read_fwf()method

或者你可以使用read_fwf()方法

回答by EdChum

Your file is a fixed width file so you can use read_fwf, here the default params are able to infer the column widths:

您的文件是固定宽度的文件,因此您可以使用read_fwf,这里的默认参数能够推断列宽:

In [106]:
df = pd.read_fwf(r'icd10cm_codes_2017.txt', header=None)
df.head()

Out[106]:
       0                                                  1
0   A000  Cholera due to Vibrio cholerae 01, biovar chol...
1   A001    Cholera due to Vibrio cholerae 01, biovar eltor
2   A009                               Cholera, unspecified
3  A0100                         Typhoid fever, unspecified
4  A0101                                 Typhoid meningitis

If you know the names you want for the column names you can pass these to read_fwf:

如果您知道列名称所需的名称,则可以将它们传递给read_fwf

In [107]:
df = pd.read_fwf(r'C:\Users\alanwo\Downloads\icd10cm_codes_2017.txt', header=None, names=['col1', 'col2'])
df.head()

Out[107]:
    col1                                               col2
0   A000  Cholera due to Vibrio cholerae 01, biovar chol...
1   A001    Cholera due to Vibrio cholerae 01, biovar eltor
2   A009                               Cholera, unspecified
3  A0100                         Typhoid fever, unspecified
4  A0101                                 Typhoid meningitis

Or just overwrite the columnsattribute after reading:

或者只是columns在阅读后覆盖属性:

df.columns = ['col1', 'col2']

As to why what you tried failed, read_tableuses tabs as the default separator but the file just has spaces and is fixed width

至于您尝试失败的原因,read_table使用制表符作为默认分隔符,但文件只有空格且宽度固定