Python Pandas 数据框读取 Excel 工作表中的精确指定范围
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38560748/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas dataframe reading exact specified range in an excel sheet
提问by spiff
I have a lot of different table (and other unstructured data in an excel sheet) .. I need to create a dataframe out of range 'A3:D20' from 'Sheet2' of Excel sheet 'data'.
我有很多不同的表(以及 Excel 工作表中的其他非结构化数据).. 我需要从 Excel 工作表“数据”的“Sheet2”中创建一个超出“A3:D20”范围的数据框。
All examples that I come across drilldown up to sheet level, but not how to pick it from an exact range.
我遇到的所有示例都向下钻取到工作表级别,但不是如何从精确范围中选择它。
import openpyxl
import pandas as pd
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.get_sheet_by_name('Sheet2')
range = ['A3':'D20'] #<-- how to specify this?
spots = pd.DataFrame(sheet.range) #what should be the exact syntax for this?
print (spots)
Once I get this, I plan to look up data in column A and find its corresponding value in column B.
一旦我得到这个,我计划在 A 列中查找数据并在 B 列中找到其对应的值。
Edit 1: I realised that openpyxl takes too long, and so have changed that to pandas.read_excel('data.xlsx','Sheet2')
instead, and it is much faster at that stage at least.
编辑 1:我意识到 openpyxl 花费的时间太长,因此已将其更改为pandas.read_excel('data.xlsx','Sheet2')
,并且至少在那个阶段要快得多。
Edit 2: For the time being, I have put my data in just one sheet and:
编辑 2:目前,我只将我的数据放在一张纸上,并且:
- removed all other info
- added column names,
- applied
index_col
on my leftmost column - then used
wb.loc[]
- 删除所有其他信息
- 添加了列名,
- 应用于
index_col
我最左边的列 - 然后使用
wb.loc[]
采纳答案by ???S???
One way to do this is to use the openpyxlmodule.
一种方法是使用openpyxl模块。
Here's an example:
下面是一个例子:
from openpyxl import load_workbook
wb = load_workbook(filename='data.xlsx',
read_only=True)
ws = wb['Sheet2']
# Read the cell values into a list of lists
data_rows = []
for row in ws['A3':'D20']:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
# Transform into dataframe
import pandas as pd
df = pd.DataFrame(data_rows)
回答by shane
Use the following arguments from pandas read_excel documentation:
使用pandas read_excel 文档中的以下参数:
- skiprows : list-like
- Rows to skip at the beginning (0-indexed)
- parse_cols : int or list, default None
- If None then parse all columns,
- If int then indicates last column to be parsed
- If list of ints then indicates list of column numbers to be parsed
- If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)
- skiprows :类似列表
- 在开头跳过的行(0 索引)
- parse_cols : 整数或列表,默认无
- 如果 None 则解析所有列,
- 如果 int 则表示要解析的最后一列
- 如果整数列表则指示要解析的列号列表
- 如果字符串则表示列名和列范围的逗号分隔列表(例如“A:E”或“A,C,E:F”)
I imagine the call will look like:
我想电话会是这样的:
df = read_excel(filename, 'Sheet2', skiprows = 2, parse_cols = 'A:D')
回答by ddnsimplon
my answer with pandas O.25 tested and worked well
我对 Pandas O.25 的回答经过测试并且运行良好
pd.read_excel('resultat-elections-2012.xls', sheet_name = 'France entière T1T2', skiprows = 2, nrows= 5, usecols = 'A:H')
pd.read_excel('resultat-elections-2012.xls', index_col = None, skiprows= 2, nrows= 5, sheet_name='France entière T1T2', usecols=range(0,8))
So :
i need data after two first lines ; selected desired lines (5) and col A to H.
Be carefull @shane answer's need to be improved and updated with the new parameters of Pandas
所以:我需要两行后的数据;选择所需的行 (5) 和 col A 到 H。
小心@shane 答案需要改进和更新 Pandas 的新参数