pandas read_excel 同一张表上的多个表

Question

提问by bsd

Is it possible to read multiple tables from a sheet excel file using pandas ? Something like: read table1 from row0 until row100 read table2 from row 102 until row202 ...

是否可以使用 Pandas 从 Excel 表格文件中读取多个表格？类似于：从 row0 到 row100 读取 table1 从第 102 行到 row202 读取 table2 ...

Answer 1

回答by MaxU

Assuming we have the following Excel file:

假设我们有以下 Excel 文件：

Solution:we are parsing the first sheet (index: 0)

解决方案：我们解析所述第一薄片（指数：0）

xl = pd.ExcelFile(fn)
nrows = xl.book.sheet_by_index(0).nrows

df1 = xl.parse(0, skipfooter= nrows-(10+1)).dropna(axis=1, how='all')
df2 = xl.parse(0, skiprows=12).dropna(axis=1, how='all')

EDIT: skip_footerwas replaced with skipfooter

编辑：skip_footer被替换为skipfooter

Result:

结果：

In [123]: df1
Out[123]:
    a   b   c
0  78  68  33
1  62  26  30
2  99  35  13
3  73  97   4
4  85   7  53
5  80  20  95
6  40  52  96
7  36  23  76
8  96  73  37
9  39  35  24

In [124]: df2
Out[124]:
   c1  c2  c3 c4
0  78  88  59  a
1  82   4  64  a
2  35   9  78  b
3   0  11  23  b
4  61  53  29  b
5  51  36  72  c
6  59  36  45  c
7   7  64   8  c
8   1  83  46  d
9  30  47  84  d

Answer 2

回答by Rotem

I wrote the following code to identify the multiple tables automatically, in case you have many files you need to process and don't want to look in each one to get the right row numbers. The code also looks for non-empty rows above each table and reads those as table metadata.

我编写了以下代码来自动识别多个表，以防您需要处理许多文件并且不想查看每个文件以获得正确的行号。该代码还会查找每个表上方的非空行并将其作为表元数据读取。

def parse_excel_sheet(file, sheet_name=0, threshold=5):
    '''parses multiple tables from an excel sheet into multiple data frame objects. Returns [dfs, df_mds], where dfs is a list of data frames and df_mds their potential associated metadata'''
    xl = pd.ExcelFile(file)
    entire_sheet = xl.parse(sheet_name=sheet_name)

    # count the number of non-Nan cells in each row and then the change in that number between adjacent rows
    n_values = np.logical_not(entire_sheet.isnull()).sum(axis=1)
    n_values_deltas = n_values[1:] - n_values[:-1].values

    # define the beginnings and ends of tables using delta in n_values
    table_beginnings = n_values_deltas > threshold
    table_beginnings = table_beginnings[table_beginnings].index
    table_endings = n_values_deltas < -threshold
    table_endings = table_endings[table_endings].index
    if len(table_beginnings) < len(table_endings) or len(table_beginnings) > len(table_endings)+1:
        raise BaseException('Could not detect equal number of beginnings and ends')

    # look for metadata before the beginnings of tables
    md_beginnings = []
    for start in table_beginnings:
        md_start = n_values.iloc[:start][n_values==0].index[-1] + 1
        md_beginnings.append(md_start)

    # make data frames
    dfs = []
    df_mds = []
    for ind in range(len(table_beginnings)):
        start = table_beginnings[ind]+1
        if ind < len(table_endings):
            stop = table_endings[ind]
        else:
            stop = entire_sheet.shape[0]
        df = xl.parse(sheet_name=sheet_name, skiprows=start, nrows=stop-start)
        dfs.append(df)

        md = xl.parse(sheet_name=sheet_name, skiprows=md_beginnings[ind], nrows=start-md_beginnings[ind]-1).dropna(axis=1)
        df_mds.append(md)
    return dfs, df_mds

Answer 3

回答by splinter

First read in the entire csvfile:

首先读入整个csv文件：

import pandas as pd
df = pd.read_csv('path_to\your_data.csv')

and then obtain the individual frames, for example using:

然后获取单个帧，例如使用：

df1 = df.iloc[:100,:]
df2 = df.iloc[100:200,:]

pandas read_excel 同一张表上的多个表

提问by bsd

回答by MaxU

回答by Rotem

回答by splinter

相关推荐

最近更新

标签

pandas read_excel 同一张表上的多个表

提问by bsd

回答by MaxU

回答by Rotem

回答by splinter

相关推荐

pandas 根据其他列的条件在pandas中创建一个新列

将函数应用于 Pandas 数据框中的列

pandas 初始化一个空的 DataFrame 并附加行

pandas 如何在pandas中实现sql合并

相关推荐

最近更新

标签