在 numpy 或 pandas 中处理巨大的数字
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21591109/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Handling HUGE numbers in numpy or pandas
提问by Terence Chow
I am doing a competition where I am provided data that is anonymized. Quite a few of the columns have HUGE values. The largest was 40 digits long! I used pd.read_csvbut those columns have been converted to objects as a result.
我正在举办一场比赛,我在比赛中获得了匿名数据。相当多的列具有巨大的价值。最大的是40位数!我使用过,pd.read_csv但结果这些列已转换为对象。
My original plan was to scale the data down but since they are seen as objects I can't do arithmetic on these.
我最初的计划是缩小数据规模,但由于它们被视为对象,我无法对它们进行算术运算。
Does anyone have a suggestion on how to handle huge numbers in Pandas or Numpy?
有没有人对如何在 Pandas 或 Numpy 中处理大量数字有任何建议?
Note that I've tried converting the value to a uint64with no luck. I get the error "long too big to convert"
请注意,我已经尝试将值转换为 auint64没有运气。我收到错误“太长而无法转换”
回答by DSM
If you have a mixed-type column -- some integers, some strings -- stored in a dtype=object column, you can still convert to ints and perform arithmetic. Starting from a mixed-type column:
如果你有一个混合类型的列——一些整数,一些字符串——存储在 dtype=object 列中,你仍然可以转换为整数并执行算术。从混合类型列开始:
>>> df = pd.DataFrame({"A": [11**44, "11"*22]})
>>> df
                                                A
0  6626407607736641103900260617069258125403649041
1    11111111111111111111111111111111111111111111
[2 rows x 1 columns]
>>> df.dtypes, list(map(type, df.A))
(A    object
dtype: object, [<type 'long'>, <type 'str'>])
We can convert to ints:
我们可以转换为整数:
>>> df["A"] = df["A"].apply(int)
>>> df.dtypes, list(map(type, df.A))
(A    object
dtype: object, [<type 'long'>, <type 'long'>])
>>> df
                                                A
0  6626407607736641103900260617069258125403649041
1    11111111111111111111111111111111111111111111
[2 rows x 1 columns]
And then perform arithmetic:
然后进行算术运算:
>>> df // 11
                                               A
0  602400691612421918536387328824478011400331731
1    1010101010101010101010101010101010101010101
[2 rows x 1 columns]
回答by dawg
You can use Pandas convertersto call intor some other custom converter function on the string as they are being imported:
您可以使用 Pandas转换int器在导入字符串时调用或其他一些自定义转换器函数:
import pandas as pd 
from StringIO import StringIO
txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,1,"Tiny"
4,-9999999999999999999999999999999999999999,"Really negative"
'''
df=pd.read_csv(StringIO(txt), converters={'Big_Num':int})
print df
Prints:
印刷:
   line                                    Big_Num                           text
0     1   1234567890123456789012345678901234567890      That sure is a big number
1     2   9999999999999999999999999999999999999999  That is an even BIGGER number
2     3                                          1                           Tiny
3     4  -9999999999999999999999999999999999999999                Really negative
Now test arithmetic:
现在测试算术:
n=df["Big_Num"][1]
print n,n+1 
Prints:
印刷:
9999999999999999999999999999999999999999 10000000000000000000000000000000000000000
If you have any values in the column that might cause intto croak, you can do this:
如果列中的任何值可能导致int发出嘶嘶声,您可以执行以下操作:
txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,0.000000000000000001,"Tiny"
4,"a string","Use 0 for strings"
'''
def conv(s):
    try:
        return int(s)
    except ValueError:
        try:
            return float(s)
        except ValueError:
            return 0        
df=pd.read_csv(StringIO(txt), converters={'Big_Num':conv})
print df
Prints:
印刷:
   line                                   Big_Num                           text
0     1  1234567890123456789012345678901234567890      That sure is a big number
1     2  9999999999999999999999999999999999999999  That is an even BIGGER number
2     3                                     1e-18                           Tiny
3     4                                         0              Use 0 for strings
Then every value in the column will be either a Python int or a float and will support arithmetic.
然后列中的每个值都将是 Python int 或 float 并且支持算术。
回答by Andy Hayden
Edit: These can't be (accurately) represented as floats either, it just doesn't raise when you try... probably best to use object dtype and longs as in DSM's answer.
编辑:这些也不能(准确地)表示为浮点数,它只是在您尝试时不会引发……可能最好使用对象 dtype 和 longs,如 DSM 的答案。
But you can do it inaccurately (using @DSM's example):
但是您可以不准确地执行此操作(使用@DSM 的示例):
In [11]: df = pd.DataFrame({"A": [11**44, "11"*22]}).astype(float)
In [12]: df
Out[12]: 
              A
0  6.626408e+45
1  1.111111e+43
[2 rows x 1 columns]
In [13]: df.dtypes
Out[13]: 
A    float64
dtype: object
But it may not be what you want...
但它可能不是你想要的......
In [21]: df.iloc[0, 0]
Out[21]: 6.6264076077366411e+45
In [22]: long(df.iloc[0, 0])
Out[22]: 6626407607736641089115845702792172379125579776L
In [23]: 11 ** 44
Out[23]: 6626407607736641103900260617069258125403649041L
As DSM suggests, convert to long (and use object dtype) so as not to lose accuracy:
正如 DSM 建议的那样,转换为 long(并使用 object dtype)以免失去准确性:
In [31]: df = pd.DataFrame({"A": [11**44, "11"*22]}).apply(long, 1)
In [32]: df
Out[32]: 
0    6626407607736641103900260617069258125403649041
1      11111111111111111111111111111111111111111111
dtype: object

