将 Pandas DataFrame 转换为 Orange Table

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26320638/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:34:14  来源:igfitidea点击:

Converting Pandas DataFrame to Orange Table

pythonpandasdataframeorange

提问by hlin117

I notice that this is an issue on GitHub already. Does anyone have any code that converts a Pandas DataFrame to an Orange Table?

我注意到这已经是GitHub 上的一个问题。有没有人有将 Pandas DataFrame 转换为 Orange Table 的代码?

Explicitly, I have the following table.

明确地说,我有下表。

       user  hotel  star_rating  user  home_continent  gender
0         1     39          4.0     1               2  female
1         1     44          3.0     1               2  female
2         2     63          4.5     2               3  female
3         2      2          2.0     2               3  female
4         3     26          4.0     3               1    male
5         3     37          5.0     3               1    male
6         3     63          4.5     3               1    male

回答by TurtleIzzy

The documentation of Orange package didn't cover all the details. Table._init__(Domain, numpy.ndarray)works only for intand floataccording to lib_kernel.cpp.

Orange 包的文档没有涵盖所有细节。Table._init__(Domain, numpy.ndarray)仅适用于intfloat根据lib_kernel.cpp.

They really should provide an C-level interface for pandas.DataFrames, or at least numpy.dtype("str")support.

他们确实应该为 提供 C 级接口pandas.DataFrames,或者至少numpy.dtype("str")提供支持。

Update: Adding table2df, df2tableperformance improved greatly by utilizing numpy for int and float.

更新:添加table2dfdf2table通过将 numpy 用于 int 和 float 大大提高了性能。

Keep this piece of script in your orange python script collections, now you are equipped with pandas in your orange environment.

将这段脚本保存在您的橙色 python 脚本集合中,现在您在橙色环境中配备了Pandas。

Usage: a_pandas_dataframe = table2df( a_orange_table ), a_orange_table = df2table( a_pandas_dataframe )

用法a_pandas_dataframe = table2df( a_orange_table )a_orange_table = df2table( a_pandas_dataframe )

Note: This script works only in Python 2.x, refer to @DustinTang 's answerfor Python 3.x compatible script.

注意:此脚本仅适用于 Python 2.x,请参阅 @DustinTang对 Python 3.x 兼容脚本的回答

import pandas as pd
import numpy as np
import Orange

#### For those who are familiar with pandas
#### Correspondence:
####    value <-> Orange.data.Value
####        NaN <-> ["?", "~", "."] # Don't know, Don't care, Other
####    dtype <-> Orange.feature.Descriptor
####        category, int <-> Orange.feature.Discrete # category: > pandas 0.15
####        int, float <-> Orange.feature.Continuous # Continuous = core.FloatVariable
####                                                 # refer to feature/__init__.py
####        str <-> Orange.feature.String
####        object <-> Orange.feature.Python
####    DataFrame.dtypes <-> Orange.data.Domain
####    DataFrame.DataFrame <-> Orange.data.Table = Orange.orange.ExampleTable 
####                              # You will need this if you are reading sources

def series2descriptor(d, discrete=False):
    if d.dtype is np.dtype("float"):
        return Orange.feature.Continuous(str(d.name))
    elif d.dtype is np.dtype("int"):
        return Orange.feature.Continuous(str(d.name), number_of_decimals=0)
    else:
        t = d.unique()
        if discrete or len(t) < len(d) / 2:
            t.sort()
            return Orange.feature.Discrete(str(d.name), values=list(t.astype("str")))
        else:
            return Orange.feature.String(str(d.name))


def df2domain(df):
    featurelist = [series2descriptor(df.icol(col)) for col in xrange(len(df.columns))]
    return Orange.data.Domain(featurelist)


def df2table(df):
    # It seems they are using native python object/lists internally for Orange.data types (?)
    # And I didn't find a constructor suitable for pandas.DataFrame since it may carry
    # multiple dtypes
    #  --> the best approximate is Orange.data.Table.__init__(domain, numpy.ndarray),
    #  --> but the dtype of numpy array can only be "int" and "float"
    #  -->  * refer to src/orange/lib_kernel.cpp 3059:
    #  -->  *    if (((*vi)->varType != TValue::INTVAR) && ((*vi)->varType != TValue::FLOATVAR))
    #  --> Documents never mentioned >_<
    # So we use numpy constructor for those int/float columns, python list constructor for other

    tdomain = df2domain(df)
    ttables = [series2table(df.icol(i), tdomain[i]) for i in xrange(len(df.columns))]
    return Orange.data.Table(ttables)

    # For performance concerns, here are my results
    # dtndarray = np.random.rand(100000, 100)
    # dtlist = list(dtndarray)
    # tdomain = Orange.data.Domain([Orange.feature.Continuous("var" + str(i)) for i in xrange(100)])
    # tinsts = [Orange.data.Instance(tdomain, list(dtlist[i]) )for i in xrange(len(dtlist))] 
    # t = Orange.data.Table(tdomain, tinsts)
    #
    # timeit list(dtndarray)  # 45.6ms
    # timeit [Orange.data.Instance(tdomain, list(dtlist[i])) for i in xrange(len(dtlist))] # 3.28s
    # timeit Orange.data.Table(tdomain, tinsts) # 280ms

    # timeit Orange.data.Table(tdomain, dtndarray) # 380ms
    #
    # As illustrated above, utilizing constructor with ndarray can greatly improve performance
    # So one may conceive better converter based on these results


def series2table(series, variable):
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"):
        # Use numpy
        # Table._init__(Domain, numpy.ndarray)
        return Orange.data.Table(Orange.data.Domain(variable), series.values[:, np.newaxis])
    else:
        # Build instance list
        # Table.__init__(Domain, list_of_instances)
        tdomain = Orange.data.Domain(variable)
        tinsts = [Orange.data.Instance(tdomain, [i]) for i in series]
        return Orange.data.Table(tdomain, tinsts)
        # 5x performance


def column2df(col):
    if type(col.domain[0]) is Orange.feature.Continuous:
        return (col.domain[0].name, pd.Series(col.to_numpy()[0].flatten()))
    else:
        tmp = pd.Series(np.array(list(col)).flatten())  # type(tmp) -> np.array( dtype=list (Orange.data.Value) )
        tmp = tmp.apply(lambda x: str(x[0]))
        return (col.domain[0].name, tmp)

def table2df(tab):
    # Orange.data.Table().to_numpy() cannot handle strings
    # So we must build the array column by column,
    # When it comes to strings, python list is used
    series = [column2df(tab.select(i)) for i in xrange(len(tab.domain))]
    series_name = [i[0] for i in series]  # To keep the order of variables unchanged
    series_data = dict(series)
    print series_data
    return pd.DataFrame(series_data, columns=series_name)

回答by Creo

Answer below from a closed issue on github

下面从github 上的一个已关闭问题中回答

from Orange.data.pandas_compat import table_from_frame
out_data = table_from_frame(df)

Where df is your dataFrame. So far I've only noticed a need to manually define a domain to handle dates if the data source wasn't 100% clean and to the required ISO standard.

其中 df 是您的数据帧。到目前为止,我只注意到如果数据源不是 100% 干净且符合所需的 ISO 标准,则需要手动定义一个域来处理日期。

I realize this is an old question and a lot changed from when it was first asked - but this question comes up top on google search results on the topic.

我意识到这是一个老问题,与第一次提出时相比发生了很大变化 - 但这个问题在该主题的谷歌搜索结果中名列前茅。

回答by astaric

In order to convert pandas DataFrame to Orange Table you need to construct a domain, which specifies the column types.

为了将 Pandas DataFrame 转换为 Orange Table,您需要构造一个域,该域指定列类型。

For continuous variables, you only need to provide the name of the variable, but for Discrete variables, you also need to provide a list of all possible values.

对于连续变量,您只需要提供变量的名称,但对于离散变量,您还需要提供所有可能值的列表。

The following code will construct a domain for your DataFrame and convert it to an Orange Table:

以下代码将为您的 DataFrame 构建一个域并将其转换为橙色表:

import numpy as np
from Orange.feature import Discrete, Continuous
from Orange.data import Domain, Table
domain = Domain([
    Discrete('user', values=[str(v) for v in np.unique(df.user)]),
    Discrete('hotel', values=[str(v) for v in np.unique(df.hotel)]),
    Continuous('star_rating'),
    Discrete('user', values=[str(v) for v in np.unique(df.user)]),
    Discrete('home_continent', values=[str(v) for v in np.unique(df.home_continent)]),
    Discrete('gender', values=['male', 'female'])], False)
table = Table(domain, [map(str, row) for row in df.as_matrix()])

The map(str, row) step is needed so Orange know that the data contains values of discrete features (and not the indices of values in the values list).

需要 map(str, row) 步骤,以便 Orange 知道数据包含离散特征的值(而不是值列表中值的索引)。

回答by DustinTang

This code is revised from @TurtleIzzy for Python3.

此代码是从 @TurtleIzzy 为 Python3 修订的。

import numpy as np
from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable


def series2descriptor(d):
    if d.dtype is np.dtype("float") or d.dtype is np.dtype("int"):
        return ContinuousVariable(str(d.name))
    else:
        t = d.unique()
        t.sort()
        return DiscreteVariable(str(d.name), list(t.astype("str")))

def df2domain(df):
    featurelist = [series2descriptor(df.iloc[:,col]) for col in range(len(df.columns))]
    return Domain(featurelist)

def df2table(df):
    tdomain = df2domain(df)
    ttables = [series2table(df.iloc[:,i], tdomain[i]) for i in range(len(df.columns))]
    ttables = np.array(ttables).reshape((len(df.columns),-1)).transpose()
    return Table(tdomain , ttables)

def series2table(series, variable):
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"):
        series = series.values[:, np.newaxis]
        return Table(series)
    else:
        series = series.astype('category').cat.codes.reshape((-1,1))
        return Table(series)

回答by JanezD

Something like this?

像这样的东西?

table = Orange.data.Table(df.as_matrix())

The columns in Orange will get generic names (a1, a2...). If you want to copy the names and the types from the data frame, construct Orange.data.Domain object (http://docs.orange.biolab.si/reference/rst/Orange.data.domain.html#Orange.data.Domain.init) from the data frame and pass it as the first argument above.

Orange 中的列将获得通用名称 (a1, a2...)。如果要从数据框中复制名称和类型,请构造 Orange.data.Domain 对象(http://docs.orange.biolab.si/reference/rst/Orange.data.domain.html#Orange.data 。域。初始化从数据帧),并把它传递与上述第一个参数。

See the constructors in http://docs.orange.biolab.si/reference/rst/Orange.data.table.html.

请参阅http://docs.orange.biolab.si/reference/rst/Orange.data.table.html 中的构造函数。

回答by omer sagy

table_from_frame, which is available in Python 3, doesn't allow the definition of a class column and therefore, the generated table cannot be used directly to train a classification model. I tweaked the table_from_frame function so it'll allow the definition of a class column. Notice that the class name should be given as an additional parameter.

Python 3 中提供的 table_from_frame 不允许定义类列,因此生成的表不能直接用于训练分类模型。我调整了 table_from_frame 函数,以便它允许定义类列。请注意,应将类名作为附加参数给出。

"""Pandas DataFrame?Table conversion helpers"""
import numpy as np
import pandas as pd
from pandas.api.types import (
    is_categorical_dtype, is_object_dtype,
    is_datetime64_any_dtype, is_numeric_dtype,
)

from Orange.data import (
    Table, Domain, DiscreteVariable, StringVariable, TimeVariable,
    ContinuousVariable,
)

__all__ = ['table_from_frame', 'table_to_frame']


def table_from_frame(df,class_name, *, force_nominal=False):
    """
    Convert pandas.DataFrame to Orange.data.Table

    Parameters
    ----------
    df : pandas.DataFrame
    force_nominal : boolean
        If True, interpret ALL string columns as nominal (DiscreteVariable).

    Returns
    -------
    Table
    """

    def _is_discrete(s):
        return (is_categorical_dtype(s) or
                is_object_dtype(s) and (force_nominal or
                                        s.nunique() < s.size**.666))

    def _is_datetime(s):
        if is_datetime64_any_dtype(s):
            return True
        try:
            if is_object_dtype(s):
                pd.to_datetime(s, infer_datetime_format=True)
                return True
        except Exception:  # pylint: disable=broad-except
            pass
        return False

    # If df index is not a simple RangeIndex (or similar), put it into data
    if not (df.index.is_integer() and (df.index.is_monotonic_increasing or
                                       df.index.is_monotonic_decreasing)):
        df = df.reset_index()

    attrs, metas,calss_vars = [], [],[]
    X, M = [], []

    # Iter over columns
    for name, s in df.items():
        name = str(name)
        if name == class_name:
            discrete = s.astype('category').cat
            calss_vars.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
            X.append(discrete.codes.replace(-1, np.nan).values)
        elif _is_discrete(s):
            discrete = s.astype('category').cat
            attrs.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
            X.append(discrete.codes.replace(-1, np.nan).values)
        elif _is_datetime(s):
            tvar = TimeVariable(name)
            attrs.append(tvar)
            s = pd.to_datetime(s, infer_datetime_format=True)
            X.append(s.astype('str').replace('NaT', np.nan).map(tvar.parse).values)
        elif is_numeric_dtype(s):
            attrs.append(ContinuousVariable(name))
            X.append(s.values)
        else:
            metas.append(StringVariable(name))
            M.append(s.values.astype(object))

    return Table.from_numpy(Domain(attrs, calss_vars, metas),
                            np.column_stack(X) if X else np.empty((df.shape[0], 0)),
                            None,
                            np.column_stack(M) if M else None)

回答by Shimon Doodkin

from Orange.data.pandas_compat import table_from_frame,table_to_frame
df= table_to_frame(in_data)
#here you go
out_data = table_from_frame(df)

based on answer of Creo

基于 Creo 的回答