pandas 将SAS数据文件导入python数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30911872/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:29:06  来源:igfitidea点击:

Import SAS data file into python data frame

pythonpandassas

提问by pdevar

I'm working on a data set (PSID) that gives data in a SAS format (a .txt and another file containing instructions to interpret the data). I cannot find anything in Python to read this type of data.

我正在研究一个数据集 (PSID),它以 SAS 格式提供数据(一个 .txt 和另一个包含解释数据指令的文件)。我在 Python 中找不到任何东西来读取这种类型的数据。

Does anyone know of a pre-existing module/script to read SAS data?

有谁知道一个预先存在的模块/脚本来读取 SAS 数据?

Edit(added from a comment to an answer): The data is in ascii/text and the start of a row of data looks like this:

编辑(从评论添加到答案):数据在 ascii/text 中,一行数据的开头如下所示:

3 10 1015000 150013200 00 002500 00 00

3 10 1015000 150013200 00 002500 00 00

采纳答案by pdevar

So I've written a package that can import the data. It can be found here:

所以我写了一个可以导入数据的包。在这里能找到它:

https://pypi.python.org/pypi/psid_py

https://pypi.python.org/pypi/psid_py

It's my first package, so sorry for the sloppy work. Additionally, it's only been tested against the PSID data sets and I'm positive that there are mistakes for other SAS formats. However, something's better than nothing.

这是我的第一个包裹,很抱歉草率的工作。此外,它仅针对 PSID 数据集进行了测试,我确信其他 SAS 格式存在错误。然而,有总比没有好。

Beyond reading in sas data, it will also build a panel data for you, in case you're into that kind of thing.

除了读取 sas 数据外,它还将为您构建面板数据,以防您遇到这种情况。

回答by josiah

As of version 17, Pandas now supports reading sas files with the .xpt file extension. See this linkto pandas documentation for additional details.

从版本 17 开始,Pandas 现在支持读取文件扩展名为 .xpt 的 sas 文件。有关其他详细信息,请参阅指向Pandas 文档的链接

df = pd.read_sas('sas_xport.xpt')

回答by JohnE

When you have the option to download a SAS dataset you will often also have the option to download a Stata dataset (this is indeed the case for PSID btw). In that case, the easiest way will likely be to import with read_stata(this might change in the future, but I believe is a very accurate statement as of today).

当您可以选择下载 SAS 数据集时,您通常还可以选择下载 Stata 数据集(顺便说一句,PSID 确实是这种情况)。在这种情况下,最简单的方法可能是导入 with read_stata(这在未来可能会发生变化,但我相信这是一个非常准确的声明)。

Less convenient, but almost always an option, is to download a text file (usually referred to as text, ascii, or csv). Those tend to come in two flavors: delimited (with comma or tab), or space separated (columnar or tabulated). If the file is comma or tab delimited, use read_csvand set the delimiter as appropriate. If it's space delimited or tabular, you might have good luck with read_csv, or you might be better off with read_fwfor read_table. Depends a bit on the variable types and formatting.

不太方便,但几乎总是一种选择,是下载文本文件(通常称为文本、ascii 或 csv)。这些往往有两种风格:分隔(用逗号或制表符),或空格分隔(柱状或表格)。如果文件以逗号或制表符分隔,请使用read_csv并适当设置分隔符。如果它的空间分隔或表格,你可能有好运气read_csv,否则你可能会更好用read_fwfread_table。取决于变量类型和格式。

From what I have read, sas7bdatmentioned by @hd1 seems to work well but is not part of pandas yet. For that reason, I tend to default to read_stataor read_csvbut hopefully sas7bdatalso works well and perhaps will be brought into pandas in the future. Also, I'm wondering about the speed of sas7bdat. read_csvhas been pretty fast for a long time and read_statais very fast in the latest versions (since 15.0, I believe). I'm not sure about the speed of sas7bdat?

从我读到的内容来看,sas7bdat@hd1 提到的似乎效果很好,但还不是Pandas的一部分。出于这个原因,我倾向于默认read_stataread_csv但希望sas7bdat也能很好地工作,也许将来会被引入Pandas。另外,我想知道sas7bdat. read_csv很长一段时间以来read_stata都非常快,并且在最新版本中非常快(我相信从 15.0 开始)。我不确定sas7bdat?

回答by Arun

I know this is an old post however just wanted to provide an efficient method if someone reaches this page through google.

我知道这是一篇旧帖子,但只是想提供一种有效的方法,如果有人通过谷歌访问此页面。

we can use pyreadstat to do the same (can get data as well as metadata).

我们可以使用 pyreadstat 来做同样的事情(可以获取数据和元数据)。

import pyreadstat
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')

or save the pandas frame using following code

或使用以下代码保存Pandas框架

pyreadstat.write_xport(df, "path/to/destination.xpt", file_label="test", column_labels=column_labels)

回答by Tom

Look at the SAS code that you get. You should see that it is very consistently formatted so that you can parse out the variable names and the columns to read for those variables. For example in this paper https://psidonline.isr.umich.edu/Guide/FileStructure.pdfyou can see that the INPUT statement is of the form:

查看您获得的 SAS 代码。您应该看到它的格式非常一致,以便您可以解析变量名称和要读取这些变量的列。例如在这篇论文https://psidonline.isr.umich.edu/Guide/FileStructure.pdf你可以看到 INPUT 语句的形式:

INPUT
    ER30001 2 - 5
    ER30002 6 - 8
    ER30642 1528 - 1532
    ER30643 1533 - 1534
...
;

So just read the SAS program and generate the appropriate Python to read the text file using the same variable names.

因此,只需读取 SAS 程序并生成适当的 Python 即可使用相同的变量名称读取文本文件。

回答by Davoud Taghawi-Nejad

The data is in a fixed with table. Fixed with means that for example the 3rd value starts in every row at the 15th letter and goes until the 114 letter.

数据在一个固定的表中。Fixed with 意味着例如第三个值在每一行的第 15 个字母处开始,直到第 114 个字母。

1.Open your SAS or SPSS input statement. In case of SAS you will find something like this:

1. 打开您的 SAS 或 SPSS 输入语句。在 SAS 的情况下,您会发现如下内容:

@1  ANO_CENSO   5.  /*Ano do Censo*/
@6  PK_COD_ENTIDADE 9.  /*Código da Escola*/
@15 NO_ENTIDADE $Char100.   /*Nome da Escola*/
@115    COD_ORGAO_REGIONAL_INEP $Char5. /*Código do órg?o Regional de Ensino*/

the left number is the column (in number of letters, where the according value)

左边的数字是列(以字母数表示,其中对应的值)

2.In the script below, fill in the columns, first is the column name second the tuple is the first and the column of the first and the last letter/number of the variable. NOTE THAT SAS starts counting at 1 and python at 0.

2.在下面的脚本中,填写列,首先是列名,其次是元组是第一个和变量的第一个和最后一个字母/数字的列。注意 SAS 从 1 开始计数,python 从 0 开始计数。

import pandas as pd

columns=(
('ANO', (0, 5)),
('CODE', (5, 14)),
('DESC_SITUACAO_FUNCIONAMENTO', (119, 134)),
('FK_COD_ESTADO', (176, 178)),
('SIGLA', (178, 181)),
)

df = pd.read_fwf('TS_ESCOLA.TXT', names=zip(*columns)[0], colspecs=zip(*columns)[1], header=None)

// pd.read_fwf is the fixed with reader of pandas.

// pd.read_fwf 是pandas 的固定阅读器。