Python 从熊猫数据框列中获取列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22341271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:45:38  来源:igfitidea点击:

get list from pandas dataframe column

pythonlistpandas

提问by yoshiserry

I have an excel document which looks like this..

我有一个看起来像这样的excel文档..

cluster load_date   budget  actual  fixed_price
A   1/1/2014    1000    4000    Y
A   2/1/2014    12000   10000   Y
A   3/1/2014    36000   2000    Y
B   4/1/2014    15000   10000   N
B   4/1/2014    12000   11500   N
B   4/1/2014    90000   11000   N
C   7/1/2014    22000   18000   N
C   8/1/2014    30000   28960   N
C   9/1/2014    53000   51200   N

I want to be able to return the contents of column 1 - cluster as a list, so I can run a for loop over it, and create an excel worksheet for every cluster.

我希望能够将第 1 列的内容 - 集群作为列表返回,因此我可以对其运行 for 循环,并为每个集群创建一个 excel 工作表。

Is it also possible, to return the contents of a whole row to a list? e.g.

是否也可以将整行的内容返回到列表?例如

list = [], list[column1] or list[df.ix(row1)]

采纳答案by Ben

Pandas DataFrame columns are Pandas Series when you pull them out, which you can then call x.tolist()on to turn them into a Python list. Alternatively you cast it with list(x).

当您将 Pandas DataFrame 列拉出时,它们就是 Pandas 系列,然后您可以调用x.tolist()它们将它们转换为 Python 列表。或者,您可以使用list(x).

import pandas as pd

data_dict = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
             'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(data_dict)

print(f"DataFrame:\n{df}\n")
print(f"column types:\n{df.dtypes}")

col_one_list = df['one'].tolist()

col_one_arr = df['one'].to_numpy()

print(f"\ncol_one_list:\n{col_one_list}\ntype:{type(col_one_list)}")
print(f"\ncol_one_arr:\n{col_one_arr}\ntype:{type(col_one_arr)}")

Output:

输出:

DataFrame:
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

column types:
one    float64
two      int64
dtype: object

col_one_list:
[1.0, 2.0, 3.0, nan]
type:<class 'list'>

col_two_arr:
[ 1.  2.  3. nan]
type:<class 'numpy.ndarray'>

回答by Anirudh Bandi

This returns a numpy array:

这将返回一个 numpy 数组:

arr = df["cluster"].to_numpy()


This returns a numpy array of uniquevalues:

这将返回一个由唯一值组成的 numpy 数组:

unique_arr = df["cluster"].unique()

You can also use numpy to get the unique values, although there are differences between the two methods:

您也可以使用 numpy 来获取唯一值,尽管这两种方法之间存在差异:

arr = df["cluster"].to_numpy()
unique_arr = np.unique(arr)

回答by Harvey

Example conversion:

示例转换:

Numpy Array -> Panda Data Frame -> List from one Panda Column

Numpy Array -> Panda Data Frame -> 来自一个 Panda 列的列表

Numpy Array

Numpy 数组

data = np.array([[10,20,30], [20,30,60], [30,60,90]])

Convert numpy array into Panda data frame

将 numpy 数组转换为 Panda 数据框

dataPd = pd.DataFrame(data = data)

print(dataPd)
0   1   2
0  10  20  30
1  20  30  60
2  30  60  90

Convert one Panda Frame to list

将一个 Panda Frame 转换为列表

pdToList = list(dataPd['2'])

pdToList = list(dataPd['2'])

回答by Natasha

Assuming the name of the dataframe after reading the excel sheet is df, take an empty list (e.g. dataList), iterate through the dataframe row by row and append to your empty list like-

假设读取excel表后数据框的名称是df,取一个空列表(例如dataList),逐行遍历数据框并附加到您的空列表中,例如-

dataList = [] #empty list
for index, row in df.iterrows(): 
    mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
    dataList.append(mylist)

Or,

或者,

dataList = [] #empty list
for row in df.itertuples(): 
    mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
    dataList.append(mylist)

No, if you print the dataList, you will get each rows as a list in the dataList.

不,如果您打印dataList,您将获得每一行作为dataList.

回答by Markus Dutschke

As this question attained a lot of attention and there are several ways to fulfill your task, let me present several options.

由于这个问题引起了很多关注,并且有多种方法可以完成您的任务,让我介绍几个选项。

Those are all one-liners by the way ;)

顺便说一下,这些都是单行的;)

Starting with:

从...开始:

df
  cluster load_date budget actual fixed_price
0       A  1/1/2014   1000   4000           Y
1       A  2/1/2014  12000  10000           Y
2       A  3/1/2014  36000   2000           Y
3       B  4/1/2014  15000  10000           N
4       B  4/1/2014  12000  11500           N
5       B  4/1/2014  90000  11000           N
6       C  7/1/2014  22000  18000           N
7       C  8/1/2014  30000  28960           N
8       C  9/1/2014  53000  51200           N

Overview of potential operations:

潜在操作概述:

ser_aggCol (collapse each column to a list)
cluster          [A, A, A, B, B, B, C, C, C]
load_date      [1/1/2014, 2/1/2014, 3/1/2...
budget         [1000, 12000, 36000, 15000...
actual         [4000, 10000, 2000, 10000,...
fixed_price      [Y, Y, Y, N, N, N, N, N, N]
dtype: object


ser_aggRows (collapse each row to a list)
0     [A, 1/1/2014, 1000, 4000, Y]
1    [A, 2/1/2014, 12000, 10000...
2    [A, 3/1/2014, 36000, 2000, Y]
3    [B, 4/1/2014, 15000, 10000...
4    [B, 4/1/2014, 12000, 11500...
5    [B, 4/1/2014, 90000, 11000...
6    [C, 7/1/2014, 22000, 18000...
7    [C, 8/1/2014, 30000, 28960...
8    [C, 9/1/2014, 53000, 51200...
dtype: object


df_gr (here you get lists for each cluster)
                             load_date                 budget                 actual fixed_price
cluster                                                                                         
A        [1/1/2014, 2/1/2014, 3/1/2...   [1000, 12000, 36000]    [4000, 10000, 2000]   [Y, Y, Y]
B        [4/1/2014, 4/1/2014, 4/1/2...  [15000, 12000, 90000]  [10000, 11500, 11000]   [N, N, N]
C        [7/1/2014, 8/1/2014, 9/1/2...  [22000, 30000, 53000]  [18000, 28960, 51200]   [N, N, N]


a list of separate dataframes for each cluster

df for cluster A
  cluster load_date budget actual fixed_price
0       A  1/1/2014   1000   4000           Y
1       A  2/1/2014  12000  10000           Y
2       A  3/1/2014  36000   2000           Y

df for cluster B
  cluster load_date budget actual fixed_price
3       B  4/1/2014  15000  10000           N
4       B  4/1/2014  12000  11500           N
5       B  4/1/2014  90000  11000           N

df for cluster C
  cluster load_date budget actual fixed_price
6       C  7/1/2014  22000  18000           N
7       C  8/1/2014  30000  28960           N
8       C  9/1/2014  53000  51200           N

just the values of column load_date
0    1/1/2014
1    2/1/2014
2    3/1/2014
3    4/1/2014
4    4/1/2014
5    4/1/2014
6    7/1/2014
7    8/1/2014
8    9/1/2014
Name: load_date, dtype: object


just the values of column number 2
0     1000
1    12000
2    36000
3    15000
4    12000
5    90000
6    22000
7    30000
8    53000
Name: budget, dtype: object


just the values of row number 7
cluster               C
load_date      8/1/2014
budget            30000
actual            28960
fixed_price           N
Name: 7, dtype: object


============================== JUST FOR COMPLETENESS ==============================


you can convert a series to a list
['C', '8/1/2014', '30000', '28960', 'N']
<class 'list'>


you can convert a dataframe to a nested list
[['A', '1/1/2014', '1000', '4000', 'Y'], ['A', '2/1/2014', '12000', '10000', 'Y'], ['A', '3/1/2014', '36000', '2000', 'Y'], ['B', '4/1/2014', '15000', '10000', 'N'], ['B', '4/1/2014', '12000', '11500', 'N'], ['B', '4/1/2014', '90000', '11000', 'N'], ['C', '7/1/2014', '22000', '18000', 'N'], ['C', '8/1/2014', '30000', '28960', 'N'], ['C', '9/1/2014', '53000', '51200', 'N']]
<class 'list'>

the content of a dataframe can be accessed as a numpy.ndarray
[['A' '1/1/2014' '1000' '4000' 'Y']
 ['A' '2/1/2014' '12000' '10000' 'Y']
 ['A' '3/1/2014' '36000' '2000' 'Y']
 ['B' '4/1/2014' '15000' '10000' 'N']
 ['B' '4/1/2014' '12000' '11500' 'N']
 ['B' '4/1/2014' '90000' '11000' 'N']
 ['C' '7/1/2014' '22000' '18000' 'N']
 ['C' '8/1/2014' '30000' '28960' 'N']
 ['C' '9/1/2014' '53000' '51200' 'N']]
<class 'numpy.ndarray'>

code:

代码:

# prefix ser refers to pd.Series object
# prefix df refers to pd.DataFrame object
# prefix lst refers to list object

import pandas as pd
import numpy as np

df=pd.DataFrame([
        ['A',   '1/1/2014',    '1000',    '4000',    'Y'],
        ['A',   '2/1/2014',    '12000',   '10000',   'Y'],
        ['A',   '3/1/2014',    '36000',   '2000',    'Y'],
        ['B',   '4/1/2014',    '15000',   '10000',   'N'],
        ['B',   '4/1/2014',    '12000',   '11500',   'N'],
        ['B',   '4/1/2014',    '90000',   '11000',   'N'],
        ['C',   '7/1/2014',    '22000',   '18000',   'N'],
        ['C',   '8/1/2014',    '30000',   '28960',   'N'],
        ['C',   '9/1/2014',    '53000',   '51200',   'N']
        ], columns=['cluster', 'load_date',   'budget',  'actual',  'fixed_price'])
print('df',df, sep='\n', end='\n\n')

ser_aggCol=df.aggregate(lambda x: [x.tolist()], axis=0).map(lambda x:x[0])
print('ser_aggCol (collapse each column to a list)',ser_aggCol, sep='\n', end='\n\n\n')

ser_aggRows=pd.Series(df.values.tolist()) 
print('ser_aggRows (collapse each row to a list)',ser_aggRows, sep='\n', end='\n\n\n')

df_gr=df.groupby('cluster').agg(lambda x: list(x))
print('df_gr (here you get lists for each cluster)',df_gr, sep='\n', end='\n\n\n')

lst_dfFiltGr=[ df.loc[df['cluster']==val,:] for val in df['cluster'].unique() ]
print('a list of separate dataframes for each cluster', sep='\n', end='\n\n')
for dfTmp in lst_dfFiltGr:
    print('df for cluster '+str(dfTmp.loc[dfTmp.index[0],'cluster']),dfTmp, sep='\n', end='\n\n')

ser_singleColLD=df.loc[:,'load_date']
print('just the values of column load_date',ser_singleColLD, sep='\n', end='\n\n\n')

ser_singleCol2=df.iloc[:,2]
print('just the values of column number 2',ser_singleCol2, sep='\n', end='\n\n\n')

ser_singleRow7=df.iloc[7,:]
print('just the values of row number 7',ser_singleRow7, sep='\n', end='\n\n\n')

print('='*30+' JUST FOR COMPLETENESS '+'='*30, end='\n\n\n')

lst_fromSer=ser_singleRow7.tolist()
print('you can convert a series to a list',lst_fromSer, type(lst_fromSer), sep='\n', end='\n\n\n')

lst_fromDf=df.values.tolist()
print('you can convert a dataframe to a nested list',lst_fromDf, type(lst_fromDf), sep='\n', end='\n\n')

arr_fromDf=df.values
print('the content of a dataframe can be accessed as a numpy.ndarray',arr_fromDf, type(arr_fromDf), sep='\n', end='\n\n')

as pointed out by cs95other methods should be preferred over pandas .valuesattribute from pandas version 0.24 on see here. I use it here, because most people will (by 2019) still have an older version, which does not support the new recommendations. You can check your version with print(pd.__version__)

正如cs95所指出的,其他方法应该优先于Pandas.values0.24 版中的Pandas属性,请参见此处。我在这里使用它,因为大多数人(到 2019 年)仍然拥有不支持新建议的旧版本。你可以检查你的版本print(pd.__version__)

回答by kamran kausar

 amount = list()
    for col in df.columns:
        val = list(df[col])
        for v in val:
            amount.append(v)

回答by Ramin Melikov

If your column will only have one value something like pd.series.tolist()will produce an error. To guarantee that it will work for all cases, use the code below:

如果您的列只有一个值,pd.series.tolist()则会产生错误。为了保证它适用于所有情况,请使用以下代码:

(
    df
        .filter(['column_name'])
        .values
        .reshape(1, -1)
        .ravel()
        .tolist()
)