在 Python 中子集数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3806878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Subsetting data in Python
提问by user308827
I want to use the equivalent of the subset command in R for some Python code I am writing.
我想对我正在编写的某些 Python 代码使用 R 中与 subset 命令等效的命令。
Here is my data:
这是我的数据:
col1 col2 col3 col4 col5
100002 2006 1.1 0.01 6352
100002 2006 1.2 0.84 304518
100002 2006 2 1.52 148219
100002 2007 1.1 0.01 6292
10002 2006 1.1 0.01 5968
10002 2006 1.2 0.25 104318
10002 2007 1.1 0.01 6800
10002 2007 4 2.03 25446
10002 2008 1.1 0.01 6408
I want to subset the data based on contents of col1and col2. (The unique values in col1 are 100002 and 10002, and in col2 are 2006,2007 and 2008.)
我想子集基于内容的数据col1和col2。(col1 中的唯一值是 100002 和 10002,col2 中的唯一值是 2006、2007 和 2008。)
This can be done in R using the subset command, is there anything similar in Python?
这可以在 R 中使用 subset 命令来完成,Python 中是否有类似的东西?
采纳答案by Joe Kington
While the iterator-based answers are perfectly fine, if you're working with numpy arrays (as you mention that you are) there are better and faster ways of selecting things:
虽然基于迭代器的答案非常好,但如果您正在使用 numpy 数组(正如您提到的那样),则有更好更快的选择方法:
import numpy as np
data = np.array([
[100002, 2006, 1.1, 0.01, 6352],
[100002, 2006, 1.2, 0.84, 304518],
[100002, 2006, 2, 1.52, 148219],
[100002, 2007, 1.1, 0.01, 6292],
[10002, 2006, 1.1, 0.01, 5968],
[10002, 2006, 1.2, 0.25, 104318],
[10002, 2007, 1.1, 0.01, 6800],
[10002, 2007, 4, 2.03, 25446],
[10002, 2008, 1.1, 0.01, 6408] ])
subset1 = data[data[:,0] == 100002]
subset2 = data[data[:,0] == 10002]
This yields
这产生
subset1:
子集1:
array([[ 1.00002e+05, 2.006e+03, 1.10e+00, 1.00e-02, 6.352e+03],
[ 1.00002e+05, 2.006e+03, 1.20e+00, 8.40e-01, 3.04518e+05],
[ 1.00002e+05, 2.006e+03, 2.00e+00, 1.52e+00, 1.48219e+05],
[ 1.00002e+05, 2.007e+03, 1.10e+00, 1.00e-02, 6.292e+03]])
subset2:
子集2:
array([[ 1.0002e+04, 2.006e+03, 1.10e+00, 1.00e-02, 5.968e+03],
[ 1.0002e+04, 2.006e+03, 1.20e+00, 2.50e-01, 1.04318e+05],
[ 1.0002e+04, 2.007e+03, 1.10e+00, 1.00e-02, 6.800e+03],
[ 1.0002e+04, 2.007e+03, 4.00e+00, 2.03e+00, 2.5446e+04],
[ 1.0002e+04, 2.008e+03, 1.10e+00, 1.00e-02, 6.408e+03]])
If you didn't know the unique values in the first column beforehand, you can use either numpy.unique1dor the builtin function setto find them.
如果您事先不知道第一列中的唯一值,您可以使用numpy.unique1d或 内置函数set来查找它们。
Edit: I just realized that you wanted to select data where you have unique combinations of two columns... In that case, you might do something like this:
编辑:我刚刚意识到您想选择具有两列独特组合的数据...在这种情况下,您可能会执行以下操作:
col1 = data[:,0]
col2 = data[:,1]
subsets = {}
for val1, val2 in itertools.product(np.unique(col1), np.unique(col2)):
subset = data[(col1 == val1) & (col2 == val2)]
if np.any(subset):
subsets[(val1, val2)] = subset
(I'm storing the subsets as a dict, with the key being a tuple of the combination... There are certainly other (and better, depending on what you're doing) ways to do this!)
(我将子集存储为字典,键是组合的元组......当然还有其他(更好,取决于你在做什么)方法来做到这一点!)
回答by wheaties
Since I'm not familiar with R nor how this subset command works based upon your description I can suggest you take a look at itertool's groupby functionality. If given a function which outputs a value, you can form groups based upon that function's output. Taken from groupby:
由于我不熟悉 R 也不熟悉此子集命令如何根据您的描述工作,因此我建议您查看 itertool 的 groupby 功能。如果给定一个输出值的函数,您可以根据该函数的输出形成组。取自groupby:
groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
and then you've got your subsets. However, do be careful as the values returned are not full fledged lists. They're iterators.
然后你就有了你的子集。但是,请务必小心,因为返回的值不是完整的列表。他们是迭代器。
I am assuming that your values are being returned on a row-by-row basis.
我假设您的值是逐行返回的。
回答by ngroot
subset()in R is pretty much analogous to filter()in Python. As the reference notes, this will be used implicitly by list comprehensions, so the most concise and clear way to write the code might be
subset()在 R 中非常类似于filter()在 Python 中。正如参考文献所指出的,这将被列表推导式隐式使用,因此编写代码的最简洁明了的方法可能是
[ item for item in items if item.col2 == 2006 ]
if, for example, your data rows were in an iterable called items.
例如,如果您的数据行位于名为items.

