pandas 将键值对解析为 DataFrame 列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27085116/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing key-value pairs into DataFrame columns
提问by Peter
I have key-value pairs that are nested row-wise in a pandas Series. What's the most efficient/optimized way to split them into individual columns? (I can unzip and do transformations, but what's the bestway?)
我有在Pandas系列中逐行嵌套的键值对。将它们拆分为单个列的最有效/优化的方法是什么?(我可以解压缩并进行转换,但最好的方法是什么?)
I do notknow:
我不知道:
- Key names beforehand
- Number of keys in each record
- Order of keys in each record
- 事先键名
- 每条记录中的键数
- 每条记录中的键顺序
The rows are a list of unicode strings. Once extracted, values will always be bigint.
这些行是一个 unicode 字符串列表。提取后,值将始终为 bigint。
Input:
输入:
parsedSeries.head()
0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]
Expected output:
预期输出:
record key1 key2 key3
0 774 238 NAN
1 524 101 848
2 NAN NAN 843
3 232 457 298
Note that the input consists of lists containing Unicode strings of the format u"X=Y"where Xis assumed to adhere to any required conventions for serving as the name of an attribute in Python, and Ycan always be interpreted as an integer. For example, the following can be used to construct the sample data above:
请注意,输入由包含以下格式的 Unicode 字符串的列表组成,u"X=Y"其中X假定遵守任何必需的约定作为 Python 中的属性名称,并且Y始终可以解释为整数。例如,可以使用以下内容来构建上面的示例数据:
pandas.Series([[u"key1=774", u"key2=238"],
[u"key1=524", u"key2=101", u"key3=848"],
[u"key3=843"],
[u"key1=232", u"key3=298", u"key2=457"]])
采纳答案by DSM
The "best" solution probably involves not finding yourself in this situation in the first place. Most of the time when you have non-scalar quantities in a Series or DataFrame you've already taken a step in the wrong direction, because you can't really apply vector ops.
“最佳”解决方案可能涉及首先不让自己陷入这种情况。大多数情况下,当您在 Series 或 DataFrame 中有非标量时,您已经朝错误的方向迈出了一步,因为您无法真正应用矢量操作。
In any case, starting from your Series, you could do something like this:
无论如何,从您的系列开始,您可以执行以下操作:
>>> ds = [dict(w.split('=', 1) for w in x) for x in s]
>>> pd.DataFrame.from_records(ds)
key1 key2 key3
0 774 238 NaN
1 524 101 848
2 NaN NaN 843
3 232 457 298
回答by ely
A very slight tweak to DSM's use of from_recordsfor the sake of processing the values as integers rather than strings.
from_records为了将值处理为整数而不是字符串,对 DSM 的使用进行了很小的调整。
def key_to_int(split_vals):
return (split_vals[0], int(split_vals[1]))
def dictify(row):
return dict(key_to_int(elem.split("=")) for elem in row)
pandas.DataFrame.from_records(parsedSeries.map(dictify))
gives
给
Out[518]:
key1 key2 key3
0 774 238 NaN
1 524 101 848
2 NaN NaN 843
3 232 457 298
[4 rows x 3 columns]
where the values are integers (columns are still floattype due to the NaNvalues and NumPy's continued lack of support for an integer NaN value).
其中值是整数(float由于NaN值和 NumPy 继续缺乏对整数 NaN 值的支持,列仍然是类型)。
回答by Phil Cooper
ok the final answer for you might differ based on how exact your example is. Specifically, the regexes to parse the data might be adjusted.
好的,您的最终答案可能会根据您的示例的精确程度而有所不同。具体来说,可能会调整解析数据的正则表达式。
Lets do some imoprts and set your datafile:
让我们做一些 imoprts 并设置您的数据文件:
import re
import pandas as pd
from StringIO import StringIO
f = StringIO("""0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]""")
Now we're ready to start. First just some regex magic to get a dict representation of your rows:
现在我们准备开始了。首先只是一些正则表达式魔术来获得您的行的字典表示:
# get the dicts
rows = [dict(re.findall('(key[0-9]*)=([0-9]*)',l)) for l in f]
# convert values to ints
rows = [dict((k,int(v)) for k,v in row.items()) for row in rows]
rows
Output:
输出:
[{'key1': 774, 'key2': 238},
{'key1': 524, 'key2': 101, 'key3': 848},
{'key3': 843},
{'key1': 232, 'key2': 457, 'key3': 298}]
That was just regex but you are just about there:
那只是正则表达式,但您就在那里:
pd.DataFrame(rows)
Output:
输出:
key1 key2 key3
0 774 238 NaN
1 524 101 848
2 NaN NaN 843
3 232 457 298
Convert to a one-liner if you like but I leave it in two steps so you can tweak the regex to suit your actual data file.
如果您愿意,可以转换为单行,但我将其分为两个步骤,以便您可以调整正则表达式以适合您的实际数据文件。

