使用 read_csv 将财务数据导入 Python Pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27534746/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Importing financial data into Python Pandas using read_csv
提问by invoker
I have a .csv with the following structure:
我有一个具有以下结构的 .csv:
date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456",",890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","7,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079",",265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","2,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","0,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","7,654.32","109,876"
...
etc
I am trying to read them into into a pandas dataframe by using the following code:
我正在尝试使用以下代码将它们读入Pandas数据帧:
import pandas as pd
data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
infer_datetime_format=True)
This works just fine except that I would like to control the data types in each column. When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise.
这工作得很好,只是我想控制每一列中的数据类型。当我在解释器中运行以下代码时,我发现引号中的数字不会被识别为数字,无论是美元还是其他数字。
In [10]: data.dtypes
Out[10]:
date_begin_date_end object
name object
name_code int64
active_accounts object # Problem, I want this to be a number
transaction_amount object # Ditto, I want this to be a number (it's a dollar amount)
transaction_count object # Still a number!
dtype: object
I have done some snooping around in the Pandas csv documentationbut haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. My ultimate goal here is to be able to do some arithmetic operations on the values in these columns.
我在Pandas csv 文档中做了一些窥探,但没有找到我正在寻找的关于声明类型的内容,当它们在 csv 中保存为带有逗号和美元符号的字符串时。我的最终目标是能够对这些列中的值进行一些算术运算。
Any thoughts?
有什么想法吗?
回答by unutbu
You could use vectorized string methodsto parse those columns after the call to
read_csv:
您可以在调用后使用矢量化字符串方法来解析这些列
read_csv:
import pandas as pd
import decimal
D = decimal.Decimal
data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)
for col in ('active_accounts', 'transaction_count'):
data[col] = data[col].str.replace(r',', '').astype(int)
data['transaction_amount'] = (data['transaction_amount']
.str.replace(r'[^-+\d.]', '').astype(D))
print(data.dtypes)
# date_begin_date_end object
# name object
# name_code int64
# active_accounts int64
# transaction_amount object
# transaction_count int64
# dtype: object
print(data)
yields
产量
date_begin_date_end name name_code active_accounts \
0 1/1/2008 1/31/2008 Name_1 1001 123456
1 2/1/2008 2/29/2008 Name_1 1001 43210
2 3/1/2008 3/31/2008 Name_1 1001 485079
3 12/1/2008 12/31/2008 Name_1 1001 87543
4 1/1/2008 1/31/2008 Name_2 1002 268456
5 2/1/2008 2/29/2008 Name_2 1002 53210
transaction_amount transaction_count
0 7890123.45 67890
1 987654.32 109876
2 1265789433.98 777888
3 432098987 87987
4 890123.45 97890
5 987654.32 109876
PS. read_csvdoes have a convertersparameterwith which you could supply a function to parse the problematic columns. These
functions are called once for each string. If you have a lot of rows, that could be require a lot of Python function
calls. Handling the columns using vectorized string methods, as shown above
should be far quicker.
附注。read_csv确实有一个converters参数,您可以使用该参数提供一个函数来解析有问题的列。这些函数为每个字符串调用一次。如果您有很多行,则可能需要大量 Python 函数调用。使用矢量化字符串方法处理列,如上所示应该快得多。
import pandas as pd
import re
import decimal
D = decimal.Decimal
def make_parser(cls):
def parse_commas(text):
return cls(re.sub(r'[^-+\d.]', '', text))
return parse_commas
to_int = make_parser(int)
to_decimal = make_parser(D)
data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
, converters={4: to_int, 5: to_decimal, 6: to_int})
print(data)
yields
产量
date_begin_date_end name name_code active_accounts \
0 1/1/2008 1/31/2008 Name_1 1001 123456
1 2/1/2008 2/29/2008 Name_1 1001 43210
2 3/1/2008 3/31/2008 Name_1 1001 485079
3 12/1/2008 12/31/2008 Name_1 1001 87543
4 1/1/2008 1/31/2008 Name_2 1002 268456
5 2/1/2008 2/29/2008 Name_2 1002 53210
transaction_amount transaction_count
0 7890123.45 67890
1 987654.32 109876
2 1265789433.98 777888
3 432098987 87987
4 890123.45 97890
5 987654.32 109876
and the values in the transaction_amountcolumn are decimal.Decimals:
并且transaction_amount列中的值是小数。小数:
In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')

