如何在 Pandas 中读取 .txt
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41509435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Read .txt in Pandas
提问by Nick Duddy
I'm trying to pull a txt file which has two series of data into pandas. So far I've tried the variations below which I've source from other posts on stack. So far it will only read in as one series.
我正在尝试将一个包含两个系列数据的 txt 文件提取到 Pandas 中。到目前为止,我已经尝试了下面的变体,我从堆栈上的其他帖子中获取了这些变体。到目前为止,它只会作为一个系列阅读。
The data I'm using is available here
我正在使用的数据可在此处获得
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", delim_whitespace=True, header=None)
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, sep="/t")
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, delimiter=r"\s+")
I'm sure I'm doing something really obviously wrong but I can't see it.
我确定我在做一些明显错误的事情,但我看不到。
回答by MaxU
try to use sep=r'\s{2,}'
as separator - it means use as separator twoor more spaces or tabs:
尝试sep=r'\s{2,}'
用作分隔符 - 这意味着将两个或多个空格或制表符用作分隔符:
In [28]: df = pd.read_csv(url, sep=r'\s{2,}', engine='python', header=None, names=['ID','Name'])
In [29]: df
Out[29]:
ID Name
0 A000 Cholera due to Vibrio cholerae 01, biovar cholerae
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
5 A0102 Typhoid fever with heart involvement
6 A0103 Typhoid pneumonia
7 A0104 Typhoid arthritis
8 A0105 Typhoid osteomyelitis
9 A0109 Typhoid fever with other complications
10 A011 Paratyphoid fever A
11 A012 Paratyphoid fever B
12 A013 Paratyphoid fever C
13 A014 Paratyphoid fever, unspecified
14 A020 Salmonella enteritis
15 A021 Salmonella sepsis
16 A0220 Localized salmonella infection, unspecified
17 A0221 Salmonella meningitis
18 A0222 Salmonella pneumonia
19 A0223 Salmonella arthritis
20 A0224 Salmonella osteomyelitis
21 A0225 Salmonella pyelonephritis
22 A0229 Salmonella with other localized infection
23 A028 Other specified salmonella infections
24 A029 Salmonella infection, unspecified
.. ... ...
671 B188 Other chronic viral hepatitis
672 B189 Chronic viral hepatitis, unspecified
673 B190 Unspecified viral hepatitis with hepatic coma
674 B1910 Unspecified viral hepatitis B without hepatic coma
675 B1911 Unspecified viral hepatitis B with hepatic coma
676 B1920 Unspecified viral hepatitis C without hepatic coma
677 B1921 Unspecified viral hepatitis C with hepatic coma
678 B199 Unspecified viral hepatitis without hepatic coma
679 B20 Human immunodeficiency virus [HIV] disease
680 B250 Cytomegaloviral pneumonitis
681 B251 Cytomegaloviral hepatitis
682 B252 Cytomegaloviral pancreatitis
683 B258 Other cytomegaloviral diseases
684 B259 Cytomegaloviral disease, unspecified
685 B260 Mumps orchitis
686 B261 Mumps meningitis
687 B262 Mumps encephalitis
688 B263 Mumps pancreatitis
689 B2681 Mumps hepatitis
690 B2682 Mumps myocarditis
691 B2683 Mumps nephritis
692 B2684 Mumps polyneuropathy
693 B2685 Mumps arthritis
694 B2689 Other mumps complications
695 B269 Mumps without complication
[696 rows x 2 columns]
alternatively you can use read_fwf()method
或者你可以使用read_fwf()方法
回答by EdChum
Your file is a fixed width file so you can use read_fwf
, here the default params are able to infer the column widths:
您的文件是固定宽度的文件,因此您可以使用read_fwf
,这里的默认参数能够推断列宽:
In [106]:
df = pd.read_fwf(r'icd10cm_codes_2017.txt', header=None)
df.head()
Out[106]:
0 1
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
If you know the names you want for the column names you can pass these to read_fwf
:
如果您知道列名称所需的名称,则可以将它们传递给read_fwf
:
In [107]:
df = pd.read_fwf(r'C:\Users\alanwo\Downloads\icd10cm_codes_2017.txt', header=None, names=['col1', 'col2'])
df.head()
Out[107]:
col1 col2
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
Or just overwrite the columns
attribute after reading:
或者只是columns
在阅读后覆盖属性:
df.columns = ['col1', 'col2']
As to why what you tried failed, read_table
uses tabs as the default separator but the file just has spaces and is fixed width
至于您尝试失败的原因,read_table
使用制表符作为默认分隔符,但文件只有空格且宽度固定