pandas 类型错误：不支持的操作数类型 -：python 3.x Anaconda 中的“str”和“str”

Question

提问by Sitz Blogz

I am trying to count some instances per hour time in a large dataset. The code below seems to work fine on python 2.7 but I had to upgrade it to 3.x latest version of python with all updated packages on Anaconda. When I am trying to execute the program I am getting following strerror

我正在尝试在大型数据集中每小时计算一些实例。下面的代码似乎在 python 2.7 上运行良好，但我必须将它升级到 3.x 最新版本的 python，并在 Anaconda 上使用所有更新的包。当我尝试执行程序时出现以下str错误

Code:

代码：

import pandas as pd
from datetime import datetime,time
import numpy as np

fn = r'00_input.csv'
cols = ['UserId', 'UserMAC', 'HotspotID', 'StartTime', 'StopTime']
df = pd.read_csv(fn, header=None, names=cols)

df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime

# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)

# building reporting DF: `r`
freq = '1H'  # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)

# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1

r['LogCount'] = 0
r['UniqueIDCount'] = 0

for i, row in r.iterrows():
        # intervals overlap test
        # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
        # i've slightly simplified the calculations of m and d
        # by getting rid of division by 2,
        # because it can be done eliminating common terms
    u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
    r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]

r['Date'] = pd.to_datetime(r.start, unit='s').dt.date
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time

#r.to_csv('results.csv', index=False)
#print(r[r.LogCount > 0])
#print (r['StartTime'], r['EndTime'], r['Day'], r['LogCount'], r['UniqueIDCount'])

rout =  r[['Date', 'StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
#print rout
rout.to_csv('o_1_hour.csv', index=False, header=False

)

Where do I make changes to get a error free execution

我在哪里进行更改以获得无错误执行

Error:

错误：

File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py", line 686, in <lambda>
    lambda x: op(x, rvalues))

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Appreciate the Help, Thanks in advance

感谢帮助，提前致谢

Answer 1

回答by jezrael

I think you need change header=0for select first row to header - then column names are replace by list cols.

我认为您需要将header=0select 第一行更改为 header - 然后将列名替换为 list cols。

If still problem, need to_numeric, because some values in StartTimeand StopTimeare strings, which are parsed to NaN, replace by 0an last convert column to int:

如果仍然有问题，需要to_numeric，因为StartTime和StopTime中的一些值是字符串，被解析为NaN，替换0为最后一个转换列int：

cols = ['UserId', 'UserMAC', 'HotspotID', 'StartTime', 'StopTime']
df = pd.read_csv('canada_mini_unixtime.csv', header=0, names=cols)
#print (df)

df['StartTime'] = pd.to_numeric(df['StartTime'], errors='coerce').fillna(0).astype(int)
df['StopTime'] =  pd.to_numeric(df['StopTime'], errors='coerce').fillna(0).astype(int)

No change:

没变：

df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)

freq = '1H'  # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)

# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1

r['LogCount'] = 0
r['UniqueIDCount'] = 0

ixis deprecated in last version of pandas, so use locand column name is in []:

ix在最新版本的Pandas中已弃用，因此使用loc和列名在[]：

for i, row in r.iterrows():
        # intervals overlap test
        # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
        # i've slightly simplified the calculations of m and d
        # by getting rid of division by 2,
        # because it can be done eliminating common terms
    u = df.loc[np.abs(df.m - 2*row.start - interval) < df.d + interval, 'UserId']
    r.loc[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]

r['Date'] = pd.to_datetime(r.start, unit='s').dt.date
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time

print (r)

Answer 2

回答by Ken Wei

df['d'] = df.StopTime - df.StartTimeis attempting to subtract a string from another string. I don't know what your data looks like, but chances are that you want to parse StopTimeand StartTimeas dates. Try

df['d'] = df.StopTime - df.StartTime试图从另一个字符串中减去一个字符串。我不知道您的数据是什么样的，但您可能想要解析StopTime并StartTime作为日期。尝试

df = pd.read_csv(fn, header=None, names=cols, parse_dates=[3,4])

instead of df = pd.read_csv(fn, header=None, names=cols).

而不是df = pd.read_csv(fn, header=None, names=cols).

pandas 类型错误：不支持的操作数类型 -：python 3.x Anaconda 中的“str”和“str”

提问by Sitz Blogz

回答by jezrael

回答by Ken Wei

相关推荐

最近更新

标签

pandas 类型错误：不支持的操作数类型 -：python 3.x Anaconda 中的“str”和“str”

提问by Sitz Blogz

回答by jezrael

回答by Ken Wei

相关推荐

pandas SKLearn MinMaxScaler - 仅缩放特定列

Pandas：替代 .ix

pandas 根据多个条件格式化熊猫数据框中单元格的颜色

pandas 如何在pandas数据框中获得等效的numpy数组索引？

相关推荐

最近更新

标签