Python Pandas - 读取包含多个表的 csv 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34184841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:21:50  来源:igfitidea点击:

Python Pandas - Read csv file containing multiple tables

pythonexcelpython-2.7csvpandas

提问by JahMyst

I have a single .csvfile containing multiple tables.

我有一个.csv包含多个表的文件。

Using Pandas, what would be the best strategy to get two DataFrame inventoryand HPBladeSystemRackfrom this one file ?

使用Pandas,这将是拿到两数据帧的最佳策略inventory,并HPBladeSystemRack从这个文件?

The input .csvlooks like this:

输入.csv如下所示:

Inventory       
System Name            IP Address    System Status
dg-enc05                             Normal
dg-enc05_vc_domain                   Unknown
dg-enc05-oa1           172.20.0.213  Normal

HP BladeSystem Rack         
System Name               Rack Name   Enclosure Name
dg-enc05                  BU40  
dg-enc05-oa1              BU40        dg-enc05
dg-enc05-oa2              BU40        dg-enc05

The best I've come up with so far is to convert this .csvfile into Excel workbook (xlxs), split the tables into sheets and use:

到目前为止,我想出的最好方法是将此.csv文件转换为 Excel 工作簿 ( xlxs),将表格拆分为工作表并使用:

inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)

However:

然而:

  • This approach requires xlrdmodule.
  • Those log files have to be analyzed in real time, so that it would be way better to find a way to analyze them as they come from the logs.
  • The real logs have far more tables than those two.
  • 这种方法需要xlrd模块。
  • 这些日志文件必须实时分析,因此最好找到一种方法来分析它们,因为它们来自日志。
  • 真正的日志比这两个表多得多。

回答by DSM

If you know the table names beforehand, then something like this:

如果您事先知道表名,则如下所示:

df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}

should work to produce a dictionary with keys as the table names and values as the subtables.

应该努力生成一个字典,键作为表名,值作为子表。

>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
...     print("table:", k)
...     print(v)
...     print()
...     
table: HP BladeSystem Rack
              0          1               2
6   System Name  Rack Name  Enclosure Name
7      dg-enc05       BU40             NaN
8  dg-enc05-oa1       BU40        dg-enc05
9  dg-enc05-oa2       BU40        dg-enc05

table: Inventory
                    0             1              2
1         System Name    IP Address  System Status
2            dg-enc05           NaN         Normal
3  dg-enc05_vc_domain           NaN        Unknown
4        dg-enc05-oa1  172.20.0.213         Normal

Once you've got that, you can set the column names to the first rows, etc.

完成后,您可以将列名设置为第一行等。

回答by Stefan

I assume you know the names of the tables you want to parse out of the csvfile. If so, you could retrieve the indexpositions of each, and select the relevant slices accordingly. As a sketch, this could look like:

我假设您知道要从csv文件中解析出的表的名称。如果是这样,您可以检索index每个的位置,并相应地选择相关的切片。作为草图,这可能如下所示:

df = pd.read_csv('path_to_file')    
index_positions = []
for table in table_names:
    index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])

## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])

tables = {}
for position in index_positions[:-1]:
    table_no = index_position.index(position)
    tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]

There are certainly more elegant solutions but this should give you a dictionarywith the table names as keysand the corresponding tables as values.

当然还有更优雅的解决方案,但这应该会给你一个dictionary表名 askeys和相应的表 as values

回答by JahMyst

Pandas doesn't seem to be ready to do this easily, so I ended up doing my own split_csvfunction. It only requires table names and will output .csvfiles named after each table.

Pandas 似乎还没有准备好轻松地做到这一点,所以我最终完成了自己的split_csv功能。它只需要表名,并将输出.csv以每个表命名的文件。

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths

table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]

def split_csv(csv_path, table_names):
    tables_infos = detect_tables_from_csv(csv_path, table_names)
    for table_info in tables_infos:
        split_csv_by_indexes(csv_path, table_info)

def split_csv_by_indexes(csv_path, table_info):
    title, start_index, end_index = table_info
    print title, start_index, end_index
    dir_ = dirname(csv_path)
    output_path = join(dir_, title) + ".csv"
    with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
        writer = csv.writer(output_file)
        reader = csv.reader(input_file)
        for i, line in enumerate(reader):
            if i < start_index:
                continue
            if i > end_index:
                break
            writer.writerow(line)

def detect_tables_from_csv(csv_path, table_names):
    output = []
    with open(csv_path, 'rb') as csv_file:
        reader = csv.reader(csv_file)
        for idx, row in enumerate(reader):
            for col in row:
                match = [title for title in table_names if title in col]
                if match:
                    match = match[0] # get the first matching element
                    try:
                        end_index = idx - 1
                        start_index
                    except NameError:
                        start_index = 0
                    else:
                        output.append((previous_match, start_index, end_index))
                    print "Found new table", col
                    start_index = idx
                    previous_match = match
                    match = False

        end_index = idx  # last 'end_index' set to EOF
        output.append((previous_match, start_index, end_index))
        return output


if __name__ == '__main__':
    csv_path = 'switch_records.csv'
    try:
        split_csv(csv_path, table_names)
    except IOError as e:
        print "This file doesn't exist. Aborting."
        print e
        exit(1)