Python 熊猫将 csv 读取为字符串类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16988526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:08:42  来源:igfitidea点击:

Pandas reading csv as string type

pythonpandas

提问by daver

I have a data frame with alpha-numeric keys which I want to save as a csv and read back later. For various reasons I need to explicitly read this key column as a string format, I have keys which are strictly numeric or even worse, things like: 1234E5 which Pandas interprets as a float. This obviously makes the key completely useless.

我有一个带有字母数字键的数据框,我想将其另存为 csv 并稍后读回。由于各种原因,我需要将这个键列显式地读取为字符串格式,我的键是严格数字的,甚至更糟,例如:1234E5,Pandas 将其解释为浮点数。这显然使密钥完全无用。

The problem is when I specify a string dtype for the data frame or any column of it I just get garbage back. I have some example code here:

问题是当我为数据框或它的任何列指定一个字符串 dtype 时,我只会得到垃圾。我这里有一些示例代码:

df = pd.DataFrame(np.random.rand(2,2),
                  index=['1A', '1B'],
                  columns=['A', 'B'])
df.to_csv(savefile)

The data frame looks like:

数据框如下所示:

           A         B
1A  0.209059  0.275554
1B  0.742666  0.721165

Then I read it like so:

然后我是这样读的:

df_read = pd.read_csv(savefile, dtype=str, index_col=0)

and the result is:

结果是:

   A  B
B  (  <

Is this a problem with my computer, or something I'm doing wrong here, or just a bug?

这是我的电脑有问题,还是我在这里做错了什么,或者只是一个错误?

采纳答案by Andy Hayden

Update: this has been fixed: from 0.11.1 you passing str/np.strwill be equivalent to using object.

更新:这已得到修复:从 0.11.1 开始,您传递str/np.str将等效于使用object.

Use the object dtype:

使用对象数据类型:

In [11]: pd.read_csv('a', dtype=object, index_col=0)
Out[11]:
                      A                     B
1A  0.35633069074776547     0.745585398803751
1B  0.20037376323337375  0.013921830784260236

or better yet, just don't specify a dtype:

或者更好的是,只是不要指定 dtype:

In [12]: pd.read_csv('a', index_col=0)
Out[12]:
           A         B
1A  0.356331  0.745585
1B  0.200374  0.013922

but bypassing the type sniffer and truly returning onlystrings requires a hacky use of converters:

但是绕过类型嗅探器并真正返回字符串需要使用以下方法converters

In [13]: pd.read_csv('a', converters={i: str for i in range(100)})
Out[13]:
                      A                     B
1A  0.35633069074776547     0.745585398803751
1B  0.20037376323337375  0.013921830784260236

where 100is some number equal or greater than your total number of columns.

其中100某个数字等于或大于您的总列数。

It's best to avoid the str dtype, see for example here.

最好避免使用 str dtype,例如参见此处

回答by Chris Conlan

Like Anton T said in his comment, pandaswill randomly turn objecttypes into floattypes using its type sniffer, even you pass dtype=object, dtype=str, or dtype=np.str.

就像 Anton T 在他的评论中所说的那样,即使您通过, , 或,也会使用其类型嗅探器pandas随机将object类型转换为float类型。dtype=objectdtype=strdtype=np.str

Since you can pass a dictionary of functions where the key is a column index and the value is a converter function, you can do something like this (e.g. for 100 columns).

因为你可以传递一个函数字典,其中键是一个列索引,值是一个转换器函数,你可以做这样的事情(例如 100 列)。

pd.read_csv('some_file.csv', converters={i: str for i in range(0, 100)})

You can even pass range(0, N)for N much larger than the number of columns if you don't know how many columns you will read.

range(0, N)如果您不知道要读取多少列,您甚至可以传递比列数大得多的 N。

回答by DanielRS

Use a converter that applies to any column if you don't know the columns before hand:

如果您事先不知道列,请使用适用于任何列的转换器:

import pandas as pd

class StringConverter(dict):
    def __contains__(self, item):
        return True

    def __getitem__(self, item):
        return str

    def get(self, default=None):
        return str

pd.read_csv(file_or_buffer, converters=StringConverter())