Python 如何制作良好的可重复熊猫示例
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20109391/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to make good reproducible pandas examples
提问by Marius
Having spent a decent amount of time watching both the rand pandastags on SO, the impression that I get is that pandasquestions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.
花了大量时间观看SO 上的r和pandas标签后,我得到的印象是pandas问题不太可能包含可重复的数据。这是后话了R的群落已经不错了关于鼓励,并感谢像导游这样,新人能得到放在一起,这些例子一些帮助。能够阅读这些指南并返回可重复数据的人通常会更幸运地获得问题的答案。
How can we create good reproducible examples for pandasquestions? Simple dataframes can be put together, e.g.:
我们如何为pandas问题创建良好的可重复示例?可以将简单的数据框放在一起,例如:
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
But many example datasets need more complicated structure, e.g.:
但是许多示例数据集需要更复杂的结构,例如:
datetimeindices or data- Multiple categorical variables (is there an equivalent to R's
expand.grid()function, which produces all possible combinations of some given variables?) - MultiIndex or Panel data
datetime索引或数据- 多个分类变量(是否有等效于 R 的
expand.grid()函数,它产生某些给定变量的所有可能组合?) - 多索引或面板数据
For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput()that allows you to generate copy-pasteable code to regenerate your datastructure?
对于dput()难以使用几行代码模拟的数据集,是否有等效于 R 的允许您生成可复制粘贴的代码以重新生成数据结构?
采纳答案by Andy Hayden
Note: The ideas here are pretty generic for Stack Overflow, indeed questions.
注意:这里的想法对于 Stack Overflow 来说非常通用,确实是问题。
Disclaimer: Writing a good question is HARD.
免责声明:写一个好问题很难。
The Good:
好的:
do include small* example DataFrame, either as runnable code:
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])or make it "copy and pasteable" using
pd.read_clipboard(sep='\s\s+'), you can format the text for Stack Overflow highlight and use Ctrl+K(or prepend four spaces to each line), or place three tildes above and below your code with your code unindented:In [2]: df Out[2]: A B 0 1 2 1 1 3 2 4 6test
pd.read_clipboard(sep='\s\s+')yourself.* I really do mean small, the vast majority of example DataFrames could be fewer than 6 rowscitation needed, and I bet I can do it in 5 rows.Can you reproduce the error with
df = df.head(), if not fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.* Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate (consider using np.random.seed so we have the exact same frame):
df = pd.DataFrame(np.random.randn(100000000, 10)). Saying that, "make this code fast for me" is not strictly on topic for the site...write out the outcome you desire (similarly to above)
In [3]: iwantthis Out[3]: A B 0 1 5 1 4 6Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.
do show the codeyou've tried:
In [4]: df.groupby('A').sum() Out[4]: B A 1 5 4 6But say what's incorrect: the A column is in the index rather than a column.
do show you've done some research (search the docs, search StackOverflow), give a summary:
The docstring for sum simply states "Compute sum of group values"
The groupby docsdon't give any examples for this.
Aside: the answer here is to use
df.groupby('A', as_index=False).sum().if it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply
pd.to_datetimeto them for good measure**.df['date'] = pd.to_datetime(df['date']) # this column ought to be date..** Sometimes this is the issue itself: they were strings.
包括小*示例数据帧,或者作为可运行的代码:
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])或使用 使其“复制和粘贴”
pd.read_clipboard(sep='\s\s+'),您可以格式化 Stack Overflow 突出显示的文本并使用Ctrl+ K(或在每行前面添加四个空格),或者在代码的上方和下方放置三个波浪号,代码不缩进:In [2]: df Out[2]: A B 0 1 2 1 1 3 2 4 6测试
pd.read_clipboard(sep='\s\s+')自己。*我的意思是小,绝大多数示例数据帧可能需要少于 6 行引用,我敢打赌我可以在 5 行中完成。你能用 重现错误吗
df = df.head(),如果不摆弄,看看你是否可以组成一个小的 DataFrame 来展示你所面临的问题。*所有规则都有例外,很明显的一个是性能问题(在这种情况下,肯定要用到%timeit和可能的%PRUN),你应该生成(考虑使用np.random.seed所以我们有相同的帧)
df = pd.DataFrame(np.random.randn(100000000, 10))。说“为我快速编写此代码”并不是该网站的严格主题......写出你想要的结果(类似于上面)
In [3]: iwantthis Out[3]: A B 0 1 5 1 4 6解释数字的来源:5 是 A 为 1 的行的 B 列的总和。
请显示您尝试过的代码:
In [4]: df.groupby('A').sum() Out[4]: B A 1 5 4 6但是说出什么是不正确的:A 列在索引中而不是列中。
表明你已经做了一些研究(搜索文档,搜索 StackOverflow),给出一个总结:
sum 的文档字符串只是说明“计算组值的总和”
该GROUPBY文档不给任何的例子。
旁白:这里的答案是使用
df.groupby('A', as_index=False).sum().如果你有时间戳列是相关的,例如你正在重新采样或其他东西,那么明确并应用
pd.to_datetime它们以获得良好的测量**。df['date'] = pd.to_datetime(df['date']) # this column ought to be date..**有时这就是问题本身:它们是字符串。
The Bad:
坏处:
don't include a MultiIndex, which we can't copy and paste(see above), this is kind of a grievance with pandas default display but nonetheless annoying:
In [11]: df Out[11]: C A B 1 2 3 2 6The correct way is to include an ordinary DataFrame with a
set_indexcall:In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B']) In [13]: df Out[13]: C A B 1 2 3 2 6do provide insight to what it is when giving the outcome you want:
B A 1 1 5 0Be specific about how you got the numbers (what are they)... double check they're correct.
If your code throws an error, do include the entire stack trace (this can be edited out later if it's too noisy). Show the line number (and the corresponding line of your code which it's raising against).
不要包含我们无法复制和粘贴的 MultiIndex (见上文),这对 Pandas默认显示有点不满,但仍然很烦人:
In [11]: df Out[11]: C A B 1 2 3 2 6正确的方法是在
set_index调用中包含一个普通的 DataFrame :In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B']) In [13]: df Out[13]: C A B 1 2 3 2 6在给出你想要的结果时,一定要深入了解它是什么:
B A 1 1 5 0具体说明你是如何得到这些数字的(它们是什么)......仔细检查它们是否正确。
如果您的代码抛出错误,请务必包含整个堆栈跟踪(如果太嘈杂,可以稍后将其删除)。显示行号(以及它所针对的代码的相应行)。
The Ugly:
丑女:
don't link to a csv we don't have access to (ideally don't link to an external source at all...)
df = pd.read_csv('my_secret_file.csv') # ideally with lots of parsing optionsMost data is proprietarywe get that: Make up similar data and see if you can reproduce the problem (something small).
don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.
Essays are bad, it's easier with small examples.
don't include 10+ (100+??) lines of data munging before getting to your actual question.
Please, we see enough of this in our day jobs. We want to help, but not like this....
Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.
不要链接到我们无权访问的 csv(理想情况下根本不要链接到外部源......)
df = pd.read_csv('my_secret_file.csv') # ideally with lots of parsing options大多数数据是专有的,我们得到了:组成类似的数据,看看你是否可以重现问题(一些小问题)。
不要用语言模糊地解释情况,就像你有一个“大”的DataFrame,顺便提一些列名(一定不要提他们的dtypes)。尝试详细了解一些完全没有意义的事情,而无需查看实际上下文。大概没有人会读到这一段的结尾。
论文不好,小例子更容易。
在解决您的实际问题之前,不要包含 10+ (100+??) 行数据修改。
拜托,我们在日常工作中看到的已经够多了。我们要帮助,但不是这样的...。
剪掉介绍,只在给你带来麻烦的步骤中显示相关的数据帧(或它们的小版本)。
Anyways, have fun learning Python, NumPy and Pandas!
不管怎样,学习 Python、NumPy 和 Pandas 玩得开心!
回答by JohnE
How to create sample datasets
如何创建示例数据集
This is to mainly to expand on @AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.
这主要是为了通过提供有关如何创建示例数据帧的示例来扩展 @AndyHayden 的答案。Pandas 和(尤其是)numpy 为您提供了多种工具,这样您通常只需几行代码就可以创建任何真实数据集的合理副本。
After importing numpy and pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.
导入 numpy 和 pandas 后,如果您希望人们能够准确地重现您的数据和结果,请务必提供一个随机种子。
import numpy as np
import pandas as pd
np.random.seed(123)
A kitchen sink example
厨房水槽示例
Here's an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:
这是一个示例,展示了您可以执行的各种操作。可以从以下子集创建各种有用的示例数据帧:
df = pd.DataFrame({
# some ways to create random data
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
This produces:
这产生:
a b c d e f g
0 -1.085631 NaN panda 0 0 2011-01-01 2011-08-12
1 0.997345 7 shark 0 1 2011-01-02 2011-11-10
2 0.282978 5 panda 1 0 2011-01-03 2011-10-30
3 -1.506295 7 python 1 1 2011-01-04 2011-09-07
4 -0.578600 NaN shark 2 0 2011-01-05 2011-02-27
5 1.651437 7 python 2 1 2011-01-06 2011-02-03
Some notes:
一些注意事项:
np.repeatandnp.tile(columnsdande) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r'sexpand.grid()but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.- For a more direct replacement for r's
expand.grid()see theitertoolssolution in the pandas cookbookor thenp.meshgridsolution shown here. Those will allow any number of dimensions. - You can do quite a bit with
np.random.choice. For example, in columng, we have a random selection of 6 dates from 2011. Additionally, by settingreplace=Falsewe can assure these dates are unique -- very handy if we want to use this as an index with unique values.
np.repeatandnp.tile(columnsdande) 对于以非常规则的方式创建组和索引非常有用。对于 2 列,这可用于轻松复制 r,expand.grid()但在提供所有排列的子集方面也更加灵活。但是,对于 3 列或更多列,语法很快就会变得笨拙。- 为了更直接替换的r
expand.grid()看到itertools在溶液大熊猫食谱或np.meshgrid显示解决方案在这里。这些将允许任意数量的维度。 - 您可以使用
np.random.choice. 例如,在 column 中g,我们从 2011 年随机选择了 6 个日期。此外,通过设置replace=False我们可以确保这些日期是唯一的——如果我们想将其用作具有唯一值的索引,则非常方便。
Fake stock market data
虚假的股市数据
In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here's a short example that combines np.tileand date_rangeto create sample ticker data for 4 stocks covering the same dates:
除了采用上述代码的子集之外,您还可以进一步结合这些技术来做任何事情。例如,这里有一个简短的例子,它结合np.tile并date_range创建了涵盖相同日期的 4 只股票的样本代码数据:
stocks = pd.DataFrame({
'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
'price':(np.random.randn(100).cumsum() + 10) })
Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:
现在我们有一个包含 100 行(每个代码 25 个日期)的示例数据集,但我们只使用了 4 行来完成它,这使得其他人无需复制和粘贴 100 行代码即可轻松重现。如果它有助于解释您的问题,则您可以显示数据的子集:
>>> stocks.head(5)
date price ticker
0 2011-01-01 9.497412 aapl
1 2011-01-02 10.261908 aapl
2 2011-01-03 9.438538 aapl
3 2011-01-04 9.515958 aapl
4 2011-01-05 7.554070 aapl
>>> stocks.groupby('ticker').head(2)
date price ticker
0 2011-01-01 9.497412 aapl
1 2011-01-02 10.261908 aapl
25 2011-01-01 8.277772 goog
26 2011-01-02 7.714916 goog
50 2011-01-01 5.613023 yhoo
51 2011-01-02 6.397686 yhoo
75 2011-01-01 11.736584 msft
76 2011-01-02 11.944519 msft
回答by Alexander
The ChallengeOne of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don't have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you'd like help with, you can easily help yourself by providing data that others can then use to help solve your problem.
挑战回答 SO 问题最具挑战性的方面之一是重新创建问题(包括数据)所需的时间。没有明确方法来重现数据的问题不太可能得到回答。鉴于您正在花时间写一个问题并且您有一个问题需要帮助,您可以通过提供其他人可以用来帮助解决您的问题的数据来轻松地帮助自己。
The instructions provided by @Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to askand how to create Minimal, Complete, and Verifiable examples.
@Andy 提供的关于编写好的 Pandas 问题的说明是一个很好的起点。有关更多信息,请参阅如何提问以及如何创建最小、完整和可验证示例。
Please clearly state your question upfront.After taking the time to write your question and any sample code, try to read it and provide an 'Executive Summary' for your reader which summarizes the problem and clearly states the question.
请事先明确说明您的问题。在花时间编写您的问题和任何示例代码后,请尝试阅读它并为您的读者提供“执行摘要”,其中总结了问题并清楚地说明了问题。
Original question:
原始问题:
I have this data...
I want to do this...
I want my result to look like this...
However, when I try to do [this], I get the following problem...
I've tried to find solutions by doing [this] and [that].
How do I fix it?
我有这个数据...
我想做这个...
我希望我的结果看起来像这样......
但是,当我尝试执行 [this] 时,出现以下问题...
我试图通过做[这个]和[那个]来找到解决方案。
我如何解决它?
Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.
根据提供的数据量、示例代码和错误堆栈,读者需要走很长的路才能理解问题所在。尝试重申您的问题,使问题本身处于最重要的位置,然后提供必要的详细信息。
Revised Question:
修改后的问题:
Qustion:How can I do [this]?
I've tried to find solutions by doing [this] and [that].
When I've tried to do [this], I get the following problem...
I'd like my final results to look like this...
Here is some minimal code that can reproduce my problem...
And here is how to recreate my sample data:
df = pd.DataFrame({'A': [...], 'B': [...], ...})
问题:我该怎么做[这个]?
我试图通过做[这个]和[那个]来找到解决方案。
当我尝试做 [this] 时,我遇到了以下问题...
我希望我的最终结果看起来像这样......
这是一些可以重现我的问题的最小代码......
这是重新创建我的示例数据的方法:
df = pd.DataFrame({'A': [...], 'B': [...], ...})
PROVIDE SAMPLE DATA IF NEEDED!!!
如果需要,请提供样本数据!!!
Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by @JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:
有时只需要 DataFrame 的头部或尾部即可。您还可以使用@JohnE 提出的方法来创建可以被其他人复制的更大的数据集。使用他的示例生成一个 100 行的股票价格 DataFrame:
stocks = pd.DataFrame({
'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
'price':(np.random.randn(100).cumsum() + 10) })
If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):
如果这是您的实际数据,您可能只想包含数据帧的头部和/或尾部,如下所示(确保匿名化任何敏感数据):
>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
1: Timestamp('2011-01-01 00:00:00'),
2: Timestamp('2011-01-01 00:00:00'),
3: Timestamp('2011-01-01 00:00:00'),
4: Timestamp('2011-01-02 00:00:00')},
'price': {0: 10.284260107718254,
1: 11.930300761831457,
2: 10.93741046217319,
3: 10.884574289565609,
4: 11.78005850418319},
'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}
>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
1: Timestamp('2011-01-01 00:00:00'),
2: Timestamp('2011-01-01 00:00:00'),
3: Timestamp('2011-01-01 00:00:00'),
4: Timestamp('2011-01-02 00:00:00'),
5: Timestamp('2011-01-24 00:00:00'),
6: Timestamp('2011-01-25 00:00:00'),
7: Timestamp('2011-01-25 00:00:00'),
8: Timestamp('2011-01-25 00:00:00'),
9: Timestamp('2011-01-25 00:00:00')},
'price': {0: 10.284260107718254,
1: 11.930300761831457,
2: 10.93741046217319,
3: 10.884574289565609,
4: 11.78005850418319,
5: 10.017209045035006,
6: 10.57090128181566,
7: 11.442792747870204,
8: 11.592953372130493,
9: 12.864146419530938},
'ticker': {0: 'aapl',
1: 'aapl',
2: 'aapl',
3: 'aapl',
4: 'aapl',
5: 'msft',
6: 'msft',
7: 'msft',
8: 'msft',
9: 'msft'}}
You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):
您可能还想提供 DataFrame 的描述(仅使用相关列)。这使得其他人更容易检查每列的数据类型并识别其他常见错误(例如日期为字符串、datetime64 与对象):
stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date 100 non-null datetime64[ns]
price 100 non-null float64
ticker 100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)
NOTE: If your DataFrame has a MultiIndex:
注意:如果您的 DataFrame 有一个 MultiIndex:
If your DataFrame has a multiindex, you must first reset before calling to_dict. You then need to recreate the index using set_index:
如果您的 DataFrame 具有多索引,则必须在调用to_dict. 然后,您需要使用set_index以下方法重新创建索引:
# MultiIndex example. First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
price
date ticker
2011-01-01 aapl 10.284260
aapl 11.930301
aapl 10.937410
aapl 10.884574
2011-01-02 aapl 11.780059
...
# After resetting the index and passing the DataFrame to `to_dict`, make sure to use
# `set_index` to restore the original MultiIndex. This DataFrame can then be restored.
d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
price
date ticker
2011-01-01 aapl 10.284260
aapl 11.930301
aapl 10.937410
aapl 10.884574
2011-01-02 aapl 11.780059
回答by piRSquared
Diary of an Answerer
回答者日记
My best advice for asking questions would be to play on the psychology of the people who answer questions. Being one of those people, I can give insight into why I answer certain questions and why I don't answer others.
我提出问题的最佳建议是利用回答问题的人的心理。作为这些人中的一员,我可以深入了解为什么我回答某些问题以及为什么我不回答其他问题。
Motivations
动机
I'm motivated to answer questions for several reasons
我有动力回答问题有几个原因
- Stackoverflow.com has been a tremendously valuable resource to me. I wanted to give back.
- In my efforts to give back, I've found this site to be an even more powerful resource than before. Answering questions is a learning experience for me and I like to learn. Read this answer and comment from another vet. This kind of interaction makes me happy.
- I like points!
- See #3.
- I like interesting problems.
- Stackoverflow.com 对我来说是非常宝贵的资源。我想回馈。
- 在我回馈的努力中,我发现这个网站是一个比以前更强大的资源。回答问题对我来说是一种学习体验,我喜欢学习。 阅读另一个兽医的这个答案和评论。这种互动让我很开心。
- 我喜欢积分!
- 见#3。
- 我喜欢有趣的问题。
All my purest intentions are great and all, but I get that satisfaction if I answer 1 question or 30. What drives my choicesfor which questions to answer has a huge component of point maximization.
我所有最纯粹的意图都很棒,但是如果我回答 1 个问题或 30 个问题,我就会感到满意。 是什么驱使我选择要回答的问题,这在很大程度上是点最大化的组成部分。
I'll also spend time on interesting problems but that is few and far between and doesn't help an asker who needs a solution to a non-interesting question. Your best bet to get me to answer a question is to serve that question up on a platter ripe for me to answer it with as little effort as possible. If I'm looking at two questions and one has code I can copy paste to create all the variables I need... I'm taking that one! I'll come back to the other one if I have time, maybe.
我也会花时间在有趣的问题上,但这很少,而且对需要解决不有趣问题的提问者没有帮助。让我回答问题的最好办法是把这个问题放在一个成熟的盘子里,让我尽可能不费力地回答它。如果我在看两个问题并且一个有代码,我可以复制粘贴来创建我需要的所有变量......我正在接受那个!如果我有时间,我可能会回到另一个。
Main Advice
主要建议
Make it easy for the people answering questions.
方便人们回答问题。
- Provide code that creates variables that are needed.
- Minimize that code. If my eyes glaze over as I look at the post, I'm on to the next question or getting back to whatever else I'm doing.
- Think about what you're asking and be specific. We want to see what you've done because natural languages (English) are inexact and confusing. Code samples of what you've tried help resolve inconsistencies in a natural language description.
- PLEASE show what you expect!!! I have to sit down and try things. I almost never know the answer to a question without trying some things out. If I don't see an example of what you're looking for, I might pass on the question because I don't feel like guessing.
- 提供创建所需变量的代码。
- 最小化该代码。如果我在看帖子时眼睛呆滞,我会继续下一个问题或回到我正在做的其他事情。
- 想一想你在问什么,并且要具体。我们想看看你做了什么,因为自然语言(英语)是不准确和混乱的。您尝试过的代码示例有助于解决自然语言描述中的不一致问题。
- 请展示您的期望!!!我必须坐下来尝试一些事情。如果不尝试一些事情,我几乎永远不会知道问题的答案。如果我没有看到您正在寻找的示例,我可能会忽略这个问题,因为我不想猜测。
Your reputation is more than just your reputation.
您的声誉不仅仅是您的声誉。
I like points (I mentioned that above). But those points aren't really really my reputation. My real reputation is an amalgamation of what others on the site think of me. I strive to be fair and honest and I hope others can see that. What that means for an asker is, we remember the behaviors of askers. If you don't select answers and upvote good answers, I remember. If you behave in ways I don't like or in ways I do like, I remember. This also plays into which questions I'll answer.
我喜欢点(我在上面提到过)。但这些点并不是我真正的声誉。我真正的声誉是网站上其他人对我的看法的融合。我努力做到公平和诚实,我希望其他人能看到这一点。这对提问者意味着什么,我们会记住提问者的行为。如果您不选择答案并为好的答案投票,我记得。如果你以我不喜欢或我喜欢的方式行事,我记得。这也适用于我将回答的问题。
Anyway, I can probably go on, but I'll spare all of you who actually read this.
无论如何,我可能会继续下去,但我会饶恕所有真正阅读本文的人。
回答by sds
Here is my version of dput- the standard R tool to produce reproducible reports - for Pandas DataFrames.
It will probably fail for more complex frames, but it seems to do the job in simple cases:
这是我的版本dput- 用于生成可重现报告的标准 R 工具 - 用于 Pandas DataFrame。对于更复杂的帧,它可能会失败,但它似乎可以在简单的情况下完成这项工作:
import pandas as pd
def dput (x):
if isinstance(x,pd.Series):
return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
if isinstance(x,pd.DataFrame):
return "pd.DataFrame({" + ", ".join([
"'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
"}, index=pd.%s)" % (x.index))
raise NotImplementedError("dput",type(x),x)
now,
现在,
df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))
Notethat this produces a much more verbose output than DataFrame.to_dict, e.g.,
请注意,这会产生比 更详细的输出DataFrame.to_dict,例如,
pd.DataFrame({ 'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)), 'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)), 'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)), 'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))}, index=pd.RangeIndex(start=0, stop=8, step=1))
pd.DataFrame({ 'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)), 'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)), 'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)), 'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))}, index=pd.RangeIndex(start=0, stop=8, step=1))
vs
对比
{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}
{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}
for duabove, but it preserves column types.
E.g., in the above test case,
对于du上面,但它保留了列类型。例如,在上面的测试用例中,
du.equals(pd.DataFrame(du.to_dict()))
==> False
because du.dtypesis uint8and pd.DataFrame(du.to_dict()).dtypesis int64.
因为du.dtypes是uint8和pd.DataFrame(du.to_dict()).dtypes是int64。

