在 Python 中子集数据

Question

提问by user308827

I want to use the equivalent of the subset command in R for some Python code I am writing.

我想对我正在编写的某些 Python 代码使用 R 中与 subset 命令等效的命令。

Here is my data:

这是我的数据：

col1    col2    col3    col4    col5
100002  2006    1.1 0.01    6352
100002  2006    1.2 0.84    304518
100002  2006    2   1.52    148219
100002  2007    1.1 0.01    6292
10002   2006    1.1 0.01    5968
10002   2006    1.2 0.25    104318
10002   2007    1.1 0.01    6800
10002   2007    4   2.03    25446
10002   2008    1.1 0.01    6408

I want to subset the data based on contents of col1and col2. (The unique values in col1 are 100002 and 10002, and in col2 are 2006,2007 and 2008.)

我想子集基于内容的数据col1和col2。（col1 中的唯一值是 100002 和 10002，col2 中的唯一值是 2006、2007 和 2008。）

This can be done in R using the subset command, is there anything similar in Python?

这可以在 R 中使用 subset 命令来完成，Python 中是否有类似的东西？

Answer 1

采纳答案by Joe Kington

While the iterator-based answers are perfectly fine, if you're working with numpy arrays (as you mention that you are) there are better and faster ways of selecting things:

虽然基于迭代器的答案非常好，但如果您正在使用 numpy 数组（正如您提到的那样），则有更好更快的选择方法：

import numpy as np
data = np.array([
        [100002, 2006, 1.1, 0.01, 6352],
        [100002, 2006, 1.2, 0.84, 304518],
        [100002, 2006, 2,   1.52, 148219],
        [100002, 2007, 1.1, 0.01, 6292],
        [10002,  2006, 1.1, 0.01, 5968],
        [10002,  2006, 1.2, 0.25, 104318],
        [10002,  2007, 1.1, 0.01, 6800],
        [10002,  2007, 4,   2.03, 25446],
        [10002,  2008, 1.1, 0.01, 6408]    ])

subset1 = data[data[:,0] == 100002]
subset2 = data[data[:,0] == 10002]

This yields

这产生

subset1:

子集1：

array([[  1.00002e+05,   2.006e+03,   1.10e+00, 1.00e-02,   6.352e+03],
       [  1.00002e+05,   2.006e+03,   1.20e+00, 8.40e-01,   3.04518e+05],
       [  1.00002e+05,   2.006e+03,   2.00e+00, 1.52e+00,   1.48219e+05],
       [  1.00002e+05,   2.007e+03,   1.10e+00, 1.00e-02,   6.292e+03]])

subset2:

子集2：

array([[  1.0002e+04,   2.006e+03,   1.10e+00, 1.00e-02,   5.968e+03],
       [  1.0002e+04,   2.006e+03,   1.20e+00, 2.50e-01,   1.04318e+05],
       [  1.0002e+04,   2.007e+03,   1.10e+00, 1.00e-02,   6.800e+03],
       [  1.0002e+04,   2.007e+03,   4.00e+00, 2.03e+00,   2.5446e+04],
       [  1.0002e+04,   2.008e+03,   1.10e+00, 1.00e-02,   6.408e+03]])

If you didn't know the unique values in the first column beforehand, you can use either numpy.unique1dor the builtin function setto find them.

如果您事先不知道第一列中的唯一值，您可以使用numpy.unique1d或内置函数set来查找它们。

Edit: I just realized that you wanted to select data where you have unique combinations of two columns... In that case, you might do something like this:

编辑：我刚刚意识到您想选择具有两列独特组合的数据...在这种情况下，您可能会执行以下操作：

col1 = data[:,0]
col2 = data[:,1]

subsets = {}
for val1, val2 in itertools.product(np.unique(col1), np.unique(col2)):
    subset = data[(col1 == val1) & (col2 == val2)]
    if np.any(subset):
        subsets[(val1, val2)] = subset

(I'm storing the subsets as a dict, with the key being a tuple of the combination... There are certainly other (and better, depending on what you're doing) ways to do this!)

（我将子集存储为字典，键是组合的元组......当然还有其他（更好，取决于你在做什么）方法来做到这一点！）

Answer 2

回答by wheaties

Since I'm not familiar with R nor how this subset command works based upon your description I can suggest you take a look at itertool's groupby functionality. If given a function which outputs a value, you can form groups based upon that function's output. Taken from groupby:

由于我不熟悉 R 也不熟悉此子集命令如何根据您的描述工作，因此我建议您查看 itertool 的 groupby 功能。如果给定一个输出值的函数，您可以根据该函数的输出形成组。取自groupby：

groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)

and then you've got your subsets. However, do be careful as the values returned are not full fledged lists. They're iterators.

然后你就有了你的子集。但是，请务必小心，因为返回的值不是完整的列表。他们是迭代器。

I am assuming that your values are being returned on a row-by-row basis.

我假设您的值是逐行返回的。

Answer 3

回答by ngroot

subset()in R is pretty much analogous to filter()in Python. As the reference notes, this will be used implicitly by list comprehensions, so the most concise and clear way to write the code might be

subset()在 R 中非常类似于filter()在 Python 中。正如参考文献所指出的，这将被列表推导式隐式使用，因此编写代码的最简洁明了的方法可能是

[ item for item in items if item.col2 == 2006 ]

if, for example, your data rows were in an iterable called items.

例如，如果您的数据行位于名为items.

在 Python 中子集数据

提问by user308827

采纳答案by Joe Kington

回答by wheaties

回答by ngroot

相关推荐

最近更新

标签

在 Python 中子集数据

提问by user308827

采纳答案by Joe Kington

回答by wheaties

回答by ngroot

相关推荐

TypeError：'NoneType' 对象不可迭代 - Python

python列表到换行符分隔值

Python 将迭代器转换为列表的最快方法

Python 在 Flask 服务器中禁用控制台消息

相关推荐

最近更新

标签