Python 使用熊猫数据框进行线性回归

Question

提问by TimStuart

I have a dataframe in pandas that I'm using to produce a scatterplot, and want to include a regression line for the plot. Right now I'm trying to do this with polyfit.

我在 Pandas 中有一个数据框，用于生成散点图，并希望包含该图的回归线。现在我正在尝试使用 polyfit 来做到这一点。

Here's my code:

这是我的代码：

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from numpy import *

table1 = pd.DataFrame.from_csv('upregulated_genes.txt', sep='\t', header=0, index_col=0)
table2 = pd.DataFrame.from_csv('misson_genes.txt', sep='\t', header=0, index_col=0)
table1 = table1.join(table2, how='outer')

table1 = table1.dropna(how='any')
table1 = table1.replace('#DIV/0!', 0)

# scatterplot
plt.scatter(table1['log2 fold change misson'], table1['log2 fold change'])
plt.ylabel('log2 expression fold change')
plt.xlabel('log2 expression fold change Misson et al. 2005')
plt.title('Root Early Upregulated Genes')
plt.axis([0,12,-5,12])

# this is the part I'm unsure about
regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1)

plt.show()

But I get the following error:

但我收到以下错误：

TypeError: cannot concatenate 'str' and 'float' objects

Does anyone know where I'm going wrong here? I'm also unsure how to add the regression line to my plot. Any other general comments on my code would also be hugely appreciated, I'm still a beginner.

有谁知道我哪里出错了？我也不确定如何将回归线添加到我的图中。对我的代码的任何其他一般性评论也将不胜感激，我仍然是初学者。

Answer 1

采纳答案by Dan Allan

Instead of replacing '#DIV/0!' by hand, force the data to be numeric. This does two things at once: it ensures that the result is numeric type (not str), and it substitutes NaNfor any entries that cannot be parsed as a number. Example:

而不是替换 '#DIV/0!' 手动，强制数据为数字。这一次做了两件事：它确保结果是数字类型（而不是 str），并替换NaN任何无法解析为数字的条目。例子：

In [5]: Series([1, 2, 'blah', '#DIV/0!']).convert_objects(convert_numeric=True)
Out[5]: 
0     1
1     2
2   NaN
3   NaN
dtype: float64

This should fix your error. But, on the general subject of fitting a line to data, I keep handy two ways of doing this that I like better than polyfit. The second of the two is more robust (and can potentially return much more detailed information about the statistics) but it requires statsmodels.

这应该可以解决您的错误。但是，在拟合数据线的一般主题上，我保留了两种比 polyfit 更喜欢的方法来做到这一点。两者中的第二个更健壮（并且可能返回有关统计信息的更详细的信息），但它需要 statsmodels。

from scipy.stats import linregress
def fit_line1(x, y):
    """Return slope, intercept of best fit line."""
    # Remove entries where either x or y is NaN.
    clean_data = pd.concat([x, y], 1).dropna(0) # row-wise
    (_, x), (_, y) = clean_data.iteritems()
    slope, intercept, r, p, stderr = linregress(x, y)
    return slope, intercept # could also return stderr

import statsmodels.api as sm
def fit_line2(x, y):
    """Return slope, intercept of best fit line."""
    X = sm.add_constant(x)
    model = sm.OLS(y, X, missing='drop') # ignores entires where x or y is NaN
    fit = model.fit()
    return fit.params[1], fit.params[0] # could also return stderr in each via fit.bse

To plot it, do something like

要绘制它，请执行以下操作

m, b = fit_line2(x, y)
N = 100 # could be just 2 if you are only drawing a straight line...
points = np.linspace(x.min(), x.max(), N)
plt.plot(points, m*points + b)

Python 使用熊猫数据框进行线性回归

提问by TimStuart

采纳答案by Dan Allan

相关推荐

最近更新

标签

Python 使用熊猫数据框进行线性回归

提问by TimStuart

采纳答案by Dan Allan

相关推荐

Python 替换列表中的 None 值？

将循环中的值存储在 Python 中的列表或元组中

Python 在 Pycharm 中安装包

Python 打印人类友好的 Protobuf 消息

相关推荐

最近更新

标签