pandas 将键值对解析为 DataFrame 列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27085116/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:42:06  来源:igfitidea点击:

Parsing key-value pairs into DataFrame columns

pythonperformanceparsingpandas

提问by Peter

I have key-value pairs that are nested row-wise in a pandas Series. What's the most efficient/optimized way to split them into individual columns? (I can unzip and do transformations, but what's the bestway?)

我有在Pandas系列中逐行嵌套的键值对。将它们拆分为单个列的最有效/优化的方法是什么?(我可以解压缩并进行转换,但最好的方法是什么?)

I do notknow:

知道:

  • Key names beforehand
  • Number of keys in each record
  • Order of keys in each record
  • 事先键名
  • 每条记录中的键数
  • 每条记录中的键顺序

The rows are a list of unicode strings. Once extracted, values will always be bigint.

这些行是一个 unicode 字符串列表。提取后,值将始终为 bigint。

Input:

输入:

parsedSeries.head()

0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]

Expected output:

预期输出:

record   key1   key2   key3
0        774    238    NAN
1        524    101    848
2        NAN    NAN    843
3        232    457    298

Note that the input consists of lists containing Unicode strings of the format u"X=Y"where Xis assumed to adhere to any required conventions for serving as the name of an attribute in Python, and Ycan always be interpreted as an integer. For example, the following can be used to construct the sample data above:

请注意,输入由包含以下格式的 Unicode 字符串的列表组成,u"X=Y"其中X假定遵守任何必需的约定作为 Python 中的属性名称,并且Y始终可以解释为整数。例如,可以使用以下内容来构建上面的示例数据:

pandas.Series([[u"key1=774", u"key2=238"],
               [u"key1=524", u"key2=101", u"key3=848"],
               [u"key3=843"],
               [u"key1=232", u"key3=298", u"key2=457"]])

采纳答案by DSM

The "best" solution probably involves not finding yourself in this situation in the first place. Most of the time when you have non-scalar quantities in a Series or DataFrame you've already taken a step in the wrong direction, because you can't really apply vector ops.

“最佳”解决方案可能涉及首先不让自己陷入这种情况。大多数情况下,当您在 Series 或 DataFrame 中有非标量时,您已经朝错误的方向迈出了一步,因为您无法真正应用矢量操作。

In any case, starting from your Series, you could do something like this:

无论如何,从您的系列开始,您可以执行以下操作:

>>> ds = [dict(w.split('=', 1) for w in x) for x in s]
>>> pd.DataFrame.from_records(ds)
  key1 key2 key3
0  774  238  NaN
1  524  101  848
2  NaN  NaN  843
3  232  457  298

回答by ely

A very slight tweak to DSM's use of from_recordsfor the sake of processing the values as integers rather than strings.

from_records为了将值处理为整数而不是字符串,对 DSM 的使用进行了很小的调整。

def key_to_int(split_vals):
    return (split_vals[0], int(split_vals[1]))

def dictify(row):
    return dict(key_to_int(elem.split("=")) for elem in row)

pandas.DataFrame.from_records(parsedSeries.map(dictify))

gives

Out[518]: 
   key1  key2  key3
0   774   238   NaN
1   524   101   848
2   NaN   NaN   843
3   232   457   298

[4 rows x 3 columns]

where the values are integers (columns are still floattype due to the NaNvalues and NumPy's continued lack of support for an integer NaN value).

其中值是整数(float由于NaN值和 NumPy 继续缺乏对整数 NaN 值的支持,列仍然是类型)。

回答by Phil Cooper

ok the final answer for you might differ based on how exact your example is. Specifically, the regexes to parse the data might be adjusted.

好的,您的最终答案可能会根据您的示例的精确程度而有所不同。具体来说,可能会调整解析数据的正则表达式。

Lets do some imoprts and set your datafile:

让我们做一些 imoprts 并设置您的数据文件:

import re
import pandas as pd
from StringIO import StringIO

f = StringIO("""0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]""")

Now we're ready to start. First just some regex magic to get a dict representation of your rows:

现在我们准备开始了。首先只是一些正则表达式魔术来获得您的行的字典表示:

# get the dicts
rows = [dict(re.findall('(key[0-9]*)=([0-9]*)',l)) for l in f]
# convert values to ints
rows = [dict((k,int(v)) for k,v in row.items()) for row in rows]
rows

Output:

输出:

[{'key1': 774, 'key2': 238},
 {'key1': 524, 'key2': 101, 'key3': 848},
 {'key3': 843},
 {'key1': 232, 'key2': 457, 'key3': 298}]

That was just regex but you are just about there:

那只是正则表达式,但您就在那里:

pd.DataFrame(rows)

Output:

输出:

  key1 key2 key3
0  774  238  NaN
1  524  101  848
2  NaN  NaN  843
3  232  457  298

Convert to a one-liner if you like but I leave it in two steps so you can tweak the regex to suit your actual data file.

如果您愿意,可以转换为单行,但我将其分为两个步骤,以便您可以调整正则表达式以适合您的实际数据文件。