pandas 使用read_sas后如何从pandas对象类型中的b'Text'获取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38930583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:48:40  来源:igfitidea点击:

How to get Text from b'Text' in the pandas object type after using read_sas?

pythonobjectpandasdataframe

提问by doktr

I'm trying to read the data from .sas7bdat format of SAS using pandas function read_sas:

我正在尝试使用 Pandas 函数 read_sas 从 .sas7bdat 格式的 SAS 读取数据:

import pandas as pd
df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat')
df.head()

And I have two data types in the df dataframe - float64 and object. I completely satisfied with the float64 datatype, so I can freely convert it to int, string etc. The problem is with object data type, which I can see in the df dataframe wrapped like this:

我在 df 数据帧中有两种数据类型 - float64 和 object。我对 float64 数据类型完全满意,因此我可以自由地将其转换为 int、string 等。问题在于对象数据类型,我可以在如下包装的 df 数据帧中看到:

b'Text'

or like this:

或者像这样:

b'12345'

instead of

代替

Text

or

或者

12345

I can't convert it to string or int respectively or to "normal" object data type. Also I can't eleminate b'' using slice or replace technics. So I'm not able to use columns with the object data type. Please, tell me how can I get rid of b''.

我无法将其分别转换为 string 或 int 或“普通”对象数据类型。此外,我无法使用切片或替换技术来消除 b''。所以我不能使用对象数据类型的列。请告诉我如何摆脱b''。

回答by MAFiA303

add this encoding="utf-8"

添加这个 encoding="utf-8"

so the line would be as follows:

所以该行如下:

df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat', encoding="utf-8")

回答by Eric

First, figure out your sas dataset encoding. In SAS, run proc contents on the dataset. Check the "Encoding". In my case, my encoding was "latin1 Western (ISO)". Then enter your encoding as such:

首先,弄清楚您的 sas 数据集编码。在 SAS 中,对数据集运行 proc 内容。检查“编码”。就我而言,我的编码是“latin1 Western (ISO)”。然后输入您的编码:

df = pd.read_sas('filename', format = 'sas7bdat', encoding = 'latin-1')

回答by Adrien Pacifico

The encodingargument in pd.read_sas()leads me to have very large dataframes which lead me to have memory related errors.

中的encoding参数pd.read_sas()导致我拥有非常大的数据帧,这导致我出现与内存相关的错误。

An other way to deal with the problem would be to convertthe byte strings to an other encoding (e.g. utf8).

处理该问题的另一种方法是convert将字节字符串转换为其他编码(例如utf8)。

Example:

例子:

Example dataframe:

示例数据框:


df = pd.DataFrame({"A": [1, 2, 3], 
                   "B": [b"a", b"b", b"c"], 
                   "C": ["a", "b", "c"]})

Transform byte strings to strings:

将字节字符串转换为字符串:

for col in df:
    if isinstance(df[col][0], bytes):
        print(col, "will be transformed from bytestring to string")
        df[col] = df[col].str.decode("utf8")  # or any other encoding
print(df)

output:

输出:

   A  B  C
0  1  a  a
1  2  b  b
2  3  c  c

Useful links:

有用的链接:

  1. Pandas Series.str.decode() page of GeeksforGeeks(where I found my solution)

  2. What is the difference between a string and a byte string?

  1. GeeksforGeeks 的 Pandas Series.str.decode() 页面(在那里我找到了我的解决方案)

  2. 字符串和字节字符串有什么区别?