pandas 熊猫合并错误类型错误：“int”和“str”的实例之间不支持“>”

Question

提问by Mohammad ElNesr

I have a dataset with several tables, each in the form of countries, years, and some indicators. I have converted all the excel tables to csv files, then merged them into one table.

我有一个包含多个表的数据集，每个表的形式为国家、年份和一些指标。我已将所有 excel 表转换为 csv 文件，然后将它们合并为一张表。

The problem is that I have some tables that refuse to be merged, and the following message appears TypeError: '>' not supported between instances of 'int' and 'str'I tried everything I can, but no luck, still the same error appears! Also, I tried with hundreds of different files, but there are still tens of files that face this problem.

问题是我有一些表拒绝合并，出现以下消息TypeError: '>' not supported between instances of 'int' and 'str'我尝试了一切，但没有运气，仍然出现相同的错误！此外，我尝试了数百个不同的文件，但仍有数十个文件面临此问题。

For the sample files file17.csvand file35.csv(In case someone needs to repeat it). Here are the code I used:

对于示例文件file17.csv和file35.csv（以防有人需要重复）。这是我使用的代码：

# To load the first file
import pandas as pd
filename1 = 'file17.csv'
df1 = pd.read_csv(filename1, encoding='cp1252', low_memory=False)
df1.set_index(['Country', 'Year'], inplace=True)
df1.dropna(axis=0, how='all', inplace=True)
df1.head()

Out>>>

出>>>

+-------------+------+--------+--------+
|             |      | ind500 | ind356 |
| Country     | Year |        |        |
| Afghanistan | 1800 | 603.0  | NaN    |
|             | 1801 | 603.0  | NaN    |
|             | 1802 | 603.0  | NaN    |
|             | 1803 | 603.0  | NaN    |
|             | 1804 | 603.0  | NaN    |
+-------------+------+--------+--------+

In>>>

在>>>

# To load the second file
filename2 = 'file35.csv'
df2 = pd.read_csv(filename2, encoding='cp1252', low_memory=False)
df2.set_index(['Country', 'Year'], inplace=True)
df2.dropna(axis=0, how='all', inplace=True)
df2.head()

Out>>>

出>>>

# To merge the two dataframes
gross_df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
gross_df.dropna(axis=0, how='all', inplace=True)
print (gross_df.shape)
gross_df.to_csv('merged.csv')

Important notice:I noticed that in all the successful files, the columns names appear in ascending orders i.e. ind001, ind009, ind012, as they were sorted automatically. while the files with errors have one or more columns with misordered placement like ind500followed by in356in the first table and the same applies to the second sample provided.

重要提示：我注意到在所有成功的文件中，列名称以升序出现，即ind001, ind009, ind012，因为它们是自动排序的。而有错误的文件有一个或多个列的排列顺序错误，例如第一个表中的ind500后跟，in356同样适用于提供的第二个示例。

Notice that the two dataframesindiceswo indices (Countryand year)

请注意，两个 dataframesindiceswo 索引 (Country和year)

The error

错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
    480         try:
--> 481             sorter = values.argsort()
    482             ordered = values.take(sorter)

TypeError: '>' not supported between instances of 'int' and 'str'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-11-960b2698de60> in <module>()
----> 1 gross_df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False)
      2 gross_df.dropna(axis=0, how='all', inplace=True)
      3 print (gross_df.shape)
      4 gross_df.to_csv('merged.csv')

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
     53                          copy=copy, indicator=indicator)
---> 54     return op.get_result()
     55 
     56 

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
    567                 self.left, self.right)
    568 
--> 569         join_index, left_indexer, right_indexer = self._get_join_info()
    570 
    571         ldata, rdata = self.left._data, self.right._data

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _get_join_info(self)
    720             join_index, left_indexer, right_indexer = \
    721                 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722                              sort=self.sort)
    723         elif self.right_index and self.how == 'left':
    724             join_index, left_indexer, right_indexer = \

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\indexes\base.py in join(self, other, how, level, return_indexers, sort)
   2995             else:
   2996                 return self._join_non_unique(other, how=how,
-> 2997                                              return_indexers=return_indexers)
   2998         elif self.is_monotonic and other.is_monotonic:
   2999             try:

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\indexes\base.py in _join_non_unique(self, other, how, return_indexers)
   3076         left_idx, right_idx = _get_join_indexers([self.values],
   3077                                                  [other._values], how=how,
-> 3078                                                  sort=True)
   3079 
   3080         left_idx = _ensure_platform_int(left_idx)

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
    980 
    981     # get left & right join labels and num. of levels at each location
--> 982     llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
    983 
    984     # get flat i8 keys from label lists

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _factorize_keys(lk, rk, sort)
   1409     if sort:
   1410         uniques = rizer.uniques.to_array()
-> 1411         llab, rlab = _sort_labels(uniques, llab, rlab)
   1412 
   1413     # NA group

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _sort_labels(uniques, left, right)
   1435     labels = np.concatenate([left, right])
   1436 
-> 1437     _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
   1438     new_labels = _ensure_int64(new_labels)
   1439     new_left, new_right = new_labels[:l], new_labels[l:]

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
    483         except TypeError:
    484             # try this anyway
--> 485             ordered = sort_mixed(values)
    486 
    487     # labels:

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in sort_mixed(values)
    469         str_pos = np.array([isinstance(x, string_types) for x in values],
    470                            dtype=bool)
--> 471         nums = np.sort(values[~str_pos])
    472         strs = np.sort(values[str_pos])
    473         return _ensure_object(np.concatenate([nums, strs]))

C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\numpy\core\fromnumeric.py in sort(a, axis, kind, order)
    820     else:
    821         a = asanyarray(a).copy(order="K")
--> 822     a.sort(axis=axis, kind=kind, order=order)
    823     return a
    824 

TypeError: '>' not supported between instances of 'int' and 'str'

Answer 1

采纳答案by MaxU

This error indicates that indices in merged DF have different dtypes

此错误表明合并 DF 中的索引具有不同的 dtype

Demo - how to convert string index level to int:

演示 - 如何将字符串索引级别转换为int：

In [183]: df
Out[183]:
              0         1         2         3
bar 1 -0.205037  0.762509  0.816608 -1.057907
    2  1.249104  0.338777 -0.982084  0.329330
baz 1  0.845695 -0.996365  0.548100 -0.113733
    2  1.247092 -2.674061 -0.071993 -0.734242
foo 1 -1.233825 -0.195377 -0.240303  1.168055
    2 -0.108942 -0.615612 -1.299512  0.908641
qux 1  0.844421  0.251425 -0.506877  1.307800
    2  0.038580  0.045072 -0.262974  0.629804

In [184]: df.index
Out[184]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['1', '2']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])

In [185]: df.index.get_level_values(1)
Out[185]: Index(['1', '2', '1', '2', '1', '2', '1', '2'], dtype='object')

In [187]: df.index = df.index.set_levels(df.index.get_level_values(1) \
                       .map(lambda x: pd.to_numeric(x, errors='coerce')), level=1)

Result:

结果：

In [189]: df.index.get_level_values(1)
Out[189]: Int64Index([1, 2, 1, 2, 1, 2, 1, 2], dtype='int64')

UPDATE:try this:

更新：试试这个：

In [247]: d1 = pd.read_csv('https://docs.google.com/uc?id=1jUsbr5pw6sUMvewI4fmbpssroG4RZ7LE&export=download', index_col=[0,1])

In [248]: d2 = pd.read_csv('https://docs.google.com/uc?id=1Ufx6pvnSC6zQdTAj05ObmV027fA4-Mr3&export=download', index_col=[0,1])

In [249]: d2 = d2[pd.to_numeric(d2.index.get_level_values(1), errors='coerce').notna()]

In [250]: d2.index = d2.index.set_levels(d2.index.get_level_values(1).map(lambda x: pd.to_numeric(x, errors='coerce')), level=1)

In [251]: d1.reset_index().merge(d2.reset_index(), on=['Country','Year'], how='outer').set_index(['Country','Year'])
Out[251]:
                            ind500  ind356  ind475  ind476        ind456
Country               Year
Afghanistan           1800   603.0     NaN     NaN     NaN           NaN
                      1801   603.0     NaN     NaN     NaN           NaN
                      1802   603.0     NaN     NaN     NaN           NaN
                      1803   603.0     NaN     NaN     NaN           NaN
                      1804   603.0     NaN     NaN     NaN           NaN
                      1805   603.0     NaN     NaN     NaN           NaN
                      1806   603.0     NaN     NaN     NaN           NaN
                      1807   603.0     NaN     NaN     NaN           NaN
                      1808   603.0     NaN     NaN     NaN           NaN
                      1809   603.0     NaN     NaN     NaN           NaN
...                            ...     ...     ...     ...           ...
Bahamas, The          1967     NaN     NaN     NaN     NaN  18381.131314
Gambia, The           1967     NaN     NaN     NaN     NaN    937.355288
Korea, Dem. Rep.      1967     NaN     NaN     NaN     NaN   1428.689253
Lao PDR               1967     NaN     NaN     NaN     NaN   1412.359955
Netherlands Antilles  1967     NaN     NaN     NaN     NaN  14076.731352
Russian Federation    1967     NaN     NaN     NaN     NaN  11794.726437
Serbia and Montenegro 1967     NaN     NaN     NaN     NaN   2987.080489
Syrian Arab Republic  1967     NaN     NaN     NaN     NaN   2015.913906
Yemen, Rep.           1967     NaN     NaN     NaN     NaN   1075.693355
Bahamas, The          1968     NaN     NaN     NaN     NaN  18712.082830

[46607 rows x 5 columns]

pandas 熊猫合并错误类型错误：“int”和“str”的实例之间不支持“>”

提问by Mohammad ElNesr

采纳答案by MaxU

相关推荐

最近更新

标签

pandas 熊猫合并错误类型错误：“int”和“str”的实例之间不支持“>”

提问by Mohammad ElNesr

采纳答案by MaxU

相关推荐

使用映射器时，pandas DataFrame.rename 意外的关键字参数“axis”

pandas 按年和月分组 Panda Pivot Table

pandas 如何在熊猫多索引数据框中仅选择索引列？

pandas 使用 Graphviz 绘制决策树时出现“特征名称的长度与特征数量不匹配”错误

相关推荐

最近更新

标签