Pandas：有错误的行的位置

Question

提问by user4199637

I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:

我对 Pandas 很陌生，并试图找出我的代码中断的地方。说，我正在做一个类型转换：

df['x']=df['x'].astype('int')

...and I get an error "ValueError: invalid literal for long() with base 10: '1.0692e+06'

...我收到一个错误“ValueError: invalidliteral for long() with base 10: '1.0692e+06'

In general, if I have 1000 entries in the dataframe, how can I find out what entry causes a break. Is there anything in ipdb to output the current location (i.e. where the code broke)? Basically, I am trying to pinpoint what value cannot be converted to Int.

一般来说，如果数据框中有 1000 个条目，我如何找出导致中断的条目。ipdb 中是否有任何内容可以输出当前位置（即代码中断的位置）？基本上，我试图查明哪些值不能转换为 Int。

Answer 1

回答by unutbu

The error you are seeing might be due to the value(s) in the xcolumn being strings:

您看到的错误可能是由于列中的值x是字符串：

In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
In [16]: df['x'].astype('int')
ValueError: invalid literal for long() with base 10: '1.0692e+06'

Ideally, the problem can be avoided by making sure the values stored in the DataFrame are already ints not strings when the DataFrame is built. How to do that depends of course on how you are building the DataFrame.

理想的情况下，可以通过确保存储在数据帧的值来避免这个问题已经不是整数的字符串时，数据帧建成。如何做到这一点当然取决于您如何构建 DataFrame。

After the fact, the DataFrame could be fixed using applymap:

事后，可以使用 applymap 修复 DataFrame：

import ast
df = df.applymap(ast.literal_eval).astype('int')

but calling ast.literal_evalon each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.

但是调用ast.literal_evalDataFrame 中的每个值可能会很慢，这就是为什么从一开始就解决问题是最好的选择。

Usually you could drop to a debugger when an exception is raisedto inspect the problematic value of row.

通常，您可以在引发异常时使用调试器来检查row.

However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.

但是，在这种情况下，异常发生在对的调用中astype，它是 C 编译代码的薄包装。C 编译的代码正在循环遍历中的值df['x']，因此 Python 调试器在这里没有帮助——它不允许您内省从 C 编译的代码中引发异常的值。

There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.

Pandas 和 NumPy 的许多重要部分是用 C、C++、Cython 或 Fortran 编写的，Python 调试器不会带您进入处理快速循环的那些非 Python 代码段。

So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...exceptto catch the first error:

因此，相反，我将恢复到低级解决方案：遍历 Python 循环中的值并使用它try...except来捕获第一个错误：

df = pd.DataFrame({'x':['1.0692e+06']})
for i, item in enumerate(df['x']):
   try:
      int(item)
   except ValueError:
      print('ERROR at index {}: {!r}'.format(i, item))

yields

产量

ERROR at index 0: '1.0692e+06'

Answer 2

回答by crypdick

To report all rows which fails to map due to any exception:

要报告由于任何异常而无法映射的所有行：

df.apply(my_function)  # throws various exceptions at unknown rows

# print Exceptions, index, and row content
for i, row in enumerate(df):
    try:
        my_function(row)
    except Exception as e: 
        print('Error at index {}: {!r}'.format(i, row))
        print(e)

Answer 3

回答by Patrick Ng

I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.

我遇到了同样的问题，因为我有一个很大的输入文件（300 万行），枚举所有行需要很长时间。因此，我编写了一个二分搜索来定位违规行。

import pandas as pd
import sys

def binarySearch(df, l, r, func):
    while l <= r:
        mid = l + (r - l) // 2;

        result = func(df, mid, mid+1)
        if result:
            # Check if we hit exception at mid
            return mid, result

        result = func(df, l, mid)
        if result is None:
            # If no exception at left, ignore left half
            l = mid + 1
        else:
            r = mid - 1

    # If we reach here, then the element was not present
    return -1

def check(df, start, end):
    result = None

    try:
        # In my case, I want to find out which row cause this failure
        df.iloc[start:end].uid.astype(int)
    except Exception as e:
        result = str(e)

    return result

df = pd.read_csv(sys.argv[1])

index, result = binarySearch(df, 0, len(df), check)
print("index: {}".format(index))
print(result)

Pandas：有错误的行的位置

提问by user4199637

回答by unutbu

回答by crypdick

回答by Patrick Ng

相关推荐

最近更新

标签

Pandas：有错误的行的位置

提问by user4199637

回答by unutbu

回答by crypdick

回答by Patrick Ng

相关推荐

Python pandas 使用 read_hdf 和 HDFStore.select 从 HDF5 文件中读取特定值

将计算列添加到 Pandas 数据透视表

将 Pandas DataFrame 转换为 Orange Table

Pandas groupby(),agg() - 如何在没有多索引的情况下返回结果？

相关推荐

最近更新

标签