将 Pandas 数据框列添加到新数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23170721/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Adding a Pandas Dataframe-column to a new dataframe
提问by FooBar
Using Pandas, I have some data that I want to add to my ``results'' dataframe. That is, I have
使用 Pandas,我有一些数据要添加到我的“结果”数据框中。也就是说,我有
naics = someData
naics = someData
Which can look like this
看起来像这样
indnaics ind1990
89 81393 873
however, it can have more than one row. I want to add these to my resultsdataframe, together with a variable called year. In case there is more than one row, it should be the same yearvalue for all rows. This is what I am trying so far
但是,它可以有多于一行。我想将这些添加到我的results数据框中,以及一个名为 year 的变量。如果有多于一行,则year所有行的值应该相同。这是我目前正在尝试的
for job in jobs:
df2 = iGetThisFromJob()
years = df2.year.unique()
naics = iGetThisFromJob()
if len(naics) == 0:
continue
for year in years:
wages = df2.incwage[df2.year == year]
# Add all the data to results, this is how I try it
rows = pd.DataFrame([dict(year=year, incwage=mean(wages), )])
# I also want to add the column indnaics from my naics
rows['naics'] = naics.indnaics
results = results.append(rows, ignore_index=True)
However, despite naics.indnaics being full, I cannot add it this way to the rows object.
但是,尽管 naics.indnaics 已满,但我无法以这种方式将其添加到行对象。
naics.indnaics
naics.indnaics
Out[1052]:
89 81393
rows['naics'] = naics.indnaics rows
行['naics'] = naics.indnaics 行
Out[1051]:
incwage year naics
0 45853.061224 2002 NaN
If there is anything else that is not nice with my code, please tell. I'm only beginning to learn pandas.
如果我的代码还有什么不好的地方,请告诉。我才刚刚开始学习Pandas。
Thanks!
谢谢!
/edit Expected output:
/edit 预期输出:
incwage year naics
0 45853.061224 2002 81393
0 45853.061224 2002 12312
/edit Suggested solution:
/edit 建议的解决方案:
index = arange(0, len(naics))
columns = ['year', 'incwage', 'naics']
rows = pd.DataFrame(index=index, columns=columns)
rows.year = year
rows.incwage = mean(wages)
rows.naics = naics.indnaics.values
回答by joris
The reason you get a NaN value, is because the index does not match (in rows['naics'] = naics.indnaicsrowshas index 0, while naics.indnaicshas index 89), and assigning the value will try to align the indices.
获得 NaN 值的原因是索引不匹配(rows['naics'] = naics.indnaicsrows索引为 0,naics.indnaics索引为 89),分配值将尝试对齐索引。
You could for example solve that by taking only the value (by eg naics.indnaics.values). With a toy example:
例如,您可以通过仅取值(例如naics.indnaics.values)来解决该问题。以玩具为例:
In [30]: df = pd.DataFrame({'A':[0], 'B':[1]})
In [31]: df
Out[31]:
A B
0 0 1
In [32]: s = pd.Series([2], index=[83])
In [33]: s
Out[33]:
83 2
dtype: int64
In [35]: df['new_column'] = s
In [36]: df
Out[36]:
A B new_column
0 0 1 NaN
In [37]: df['new_column'] = s.values
In [38]: df
Out[38]:
A B new_column
0 0 1 2
If you want to add the series with possibly more values, there are a couple of options. I think of:
如果您想添加具有更多值的系列,有几个选项。我想:
Eg reindexing the dataframe first to the length of the series:
例如,首先将数据帧重新索引到系列的长度:
In [75]: s
Out[75]:
83 2
84 4
dtype: int64
In [76]: df
Out[76]:
A B
0 0 1
In [77]: df = df.reindex(np.zeros(len(s)))
In [78]: df
Out[78]:
A B
0 0 1
0 0 1
In [79]: df['new_column'] = s.values
In [80]: df
Out[80]:
A B new_column
0 0 1 2
0 0 1 4
or the other way around, add the dataframe to the series (that you first convert to a dataframe):
或者反过来,将数据帧添加到系列中(首先转换为数据帧):
In [90]: ss = s.to_frame().set_index(np.array([0,0]))
In [91]: ss[df.columns] = df
In [92]: ss
Out[92]:
0 A B
0 2 0 1
0 4 0 1
[2 rows x 3 columns]

