在 Python 中子集数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3806878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:49:52  来源:igfitidea点击:

Subsetting data in Python

pythonarraysrnumpysubset

提问by user308827

I want to use the equivalent of the subset command in R for some Python code I am writing.

我想对我正在编写的某些 Python 代码使用 R 中与 subset 命令等效的命令。

Here is my data:

这是我的数据:

col1    col2    col3    col4    col5
100002  2006    1.1 0.01    6352
100002  2006    1.2 0.84    304518
100002  2006    2   1.52    148219
100002  2007    1.1 0.01    6292
10002   2006    1.1 0.01    5968
10002   2006    1.2 0.25    104318
10002   2007    1.1 0.01    6800
10002   2007    4   2.03    25446
10002   2008    1.1 0.01    6408

I want to subset the data based on contents of col1and col2. (The unique values in col1 are 100002 and 10002, and in col2 are 2006,2007 and 2008.)

我想子集基于内容的数据col1col2。(col1 中的唯一值是 100002 和 10002,col2 中的唯一值是 2006、2007 和 2008。)

This can be done in R using the subset command, is there anything similar in Python?

这可以在 R 中使用 subset 命令来完成,Python 中是否有类似的东西?

采纳答案by Joe Kington

While the iterator-based answers are perfectly fine, if you're working with numpy arrays (as you mention that you are) there are better and faster ways of selecting things:

虽然基于迭代器的答案非常好,但如果您正在使用 numpy 数组(正如您提到的那样),则有更好更快的选择方法:

import numpy as np
data = np.array([
        [100002, 2006, 1.1, 0.01, 6352],
        [100002, 2006, 1.2, 0.84, 304518],
        [100002, 2006, 2,   1.52, 148219],
        [100002, 2007, 1.1, 0.01, 6292],
        [10002,  2006, 1.1, 0.01, 5968],
        [10002,  2006, 1.2, 0.25, 104318],
        [10002,  2007, 1.1, 0.01, 6800],
        [10002,  2007, 4,   2.03, 25446],
        [10002,  2008, 1.1, 0.01, 6408]    ])

subset1 = data[data[:,0] == 100002]
subset2 = data[data[:,0] == 10002]

This yields

这产生

subset1:

子集1:

array([[  1.00002e+05,   2.006e+03,   1.10e+00, 1.00e-02,   6.352e+03],
       [  1.00002e+05,   2.006e+03,   1.20e+00, 8.40e-01,   3.04518e+05],
       [  1.00002e+05,   2.006e+03,   2.00e+00, 1.52e+00,   1.48219e+05],
       [  1.00002e+05,   2.007e+03,   1.10e+00, 1.00e-02,   6.292e+03]])

subset2:

子集2:

array([[  1.0002e+04,   2.006e+03,   1.10e+00, 1.00e-02,   5.968e+03],
       [  1.0002e+04,   2.006e+03,   1.20e+00, 2.50e-01,   1.04318e+05],
       [  1.0002e+04,   2.007e+03,   1.10e+00, 1.00e-02,   6.800e+03],
       [  1.0002e+04,   2.007e+03,   4.00e+00, 2.03e+00,   2.5446e+04],
       [  1.0002e+04,   2.008e+03,   1.10e+00, 1.00e-02,   6.408e+03]])

If you didn't know the unique values in the first column beforehand, you can use either numpy.unique1dor the builtin function setto find them.

如果您事先不知道第一列中的唯一值,您可以使用numpy.unique1d或 内置函数set来查找它们。

Edit: I just realized that you wanted to select data where you have unique combinations of two columns... In that case, you might do something like this:

编辑:我刚刚意识到您想选择具有两列独特组合的数据...在这种情况下,您可能会执行以下操作:

col1 = data[:,0]
col2 = data[:,1]

subsets = {}
for val1, val2 in itertools.product(np.unique(col1), np.unique(col2)):
    subset = data[(col1 == val1) & (col2 == val2)]
    if np.any(subset):
        subsets[(val1, val2)] = subset

(I'm storing the subsets as a dict, with the key being a tuple of the combination... There are certainly other (and better, depending on what you're doing) ways to do this!)

(我将子集存储为字典,键是组合的元组......当然还有其他(更好,取决于你在做什么)方法来做到这一点!)

回答by wheaties

Since I'm not familiar with R nor how this subset command works based upon your description I can suggest you take a look at itertool's groupby functionality. If given a function which outputs a value, you can form groups based upon that function's output. Taken from groupby:

由于我不熟悉 R 也不熟悉此子集命令如何根据您的描述工作,因此我建议您查看 itertool 的 groupby 功能。如果给定一个输出值的函数,您可以根据该函数的输出形成组。取自groupby

groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)

and then you've got your subsets. However, do be careful as the values returned are not full fledged lists. They're iterators.

然后你就有了你的子集。但是,请务必小心,因为返回的值不是完整的列表。他们是迭代器。

I am assuming that your values are being returned on a row-by-row basis.

我假设您的值是逐行返回的。

回答by ngroot

subset()in R is pretty much analogous to filter()in Python. As the reference notes, this will be used implicitly by list comprehensions, so the most concise and clear way to write the code might be

subset()在 R 中非常类似于filter()在 Python 中。正如参考文献所指出的,这将被列表推导式隐式使用,因此编写代码的最简洁明了的方法可能是

[ item for item in items if item.col2 == 2006 ] 

if, for example, your data rows were in an iterable called items.

例如,如果您的数据行位于名为items.