如何使用 Pandas 从 Word 文档 (.docx) 文件中的表格创建数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47977367/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to create a dataframe from a table in a word document (.docx) file using pandas
提问by pyd
I have a word file (.docx) with table of data, I am trying to create a pandas data frame using that table, I have used docx and pandas module. But I could not create a data frame.
我有一个带有数据表的 word 文件 (.docx),我正在尝试使用该表创建一个 Pandas 数据框,我使用了 docx 和 Pandas 模块。但我无法创建数据框。
from docx import Document
document = Document('req.docx')
for table in document.tables:
for row in table.rows:
for cell in row.cells:
print (cell.text)
and also tried to read table as df pd.read_table("path of the file")
并尝试将表读取为 df pd.read_table("path of the file")
I can read the data cell by cell but I want to read the entire table or any particular column. Thanks in advance
我可以逐个单元格读取数据,但我想读取整个表格或任何特定列。提前致谢
回答by MaxU
docx
always reads data from Word tables as text (strings).
docx
始终以文本(字符串)形式从 Word 表中读取数据。
If we want to parse data with correct dtypes we can do one of the following:
如果我们想用正确的 dtypes 解析数据,我们可以执行以下操作之一:
- manually specify
dtype
for all columns (not flexible) - write our own code to guess correct dtypes (too difficult and , Pandas IO methods do it well)
- convert data into CSV format and let
pd.read_csv()
guess/infer correct dtypes (I've chosen this way)
- 手动指定
dtype
所有列(不灵活) - 编写我们自己的代码来猜测正确的 dtypes(太难了,Pandas IO 方法做得很好)
- 将数据转换为 CSV 格式并让
pd.read_csv()
猜测/推断正确的 dtypes(我选择了这种方式)
Many thanks to @Anton vBRfor improving the function!
非常感谢@Anton vBR改进功能!
import pandas as pd
import io
import csv
from docx import Document
def read_docx_tables(filename, tab_id=None, **kwargs):
"""
parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)
Parameters:
filename: file name of a Word Document
tab_id: parse a single table with the index: [tab_id] (counting from 0).
When [None] - return a list of DataFrames (parse all tables)
kwargs: arguments to pass to `pd.read_csv()` function
Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
"""
def read_docx_tab(tab, **kwargs):
vf = io.StringIO()
writer = csv.writer(vf)
for row in tab.rows:
writer.writerow(cell.text for cell in row.cells)
vf.seek(0)
return pd.read_csv(vf, **kwargs)
doc = Document(filename)
if tab_id is None:
return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
else:
try:
return read_docx_tab(doc.tables[tab_id], **kwargs)
except IndexError:
print('Error: specified [tab_id]: {} does not exist.'.format(tab_id))
raise
NOTE: you may want to add more checks and exception catching...
注意:您可能想要添加更多检查和异常捕获...
Examples:
例子:
In [209]: dfs = read_docx_tables(fn)
In [210]: dfs[0]
Out[210]:
A B C,X
0 1 B1 C1
1 2 B2 C2
2 3 B3 val1, val2, val3
In [211]: dfs[0].dtypes
Out[211]:
A int64
B object
C,X object
dtype: object
In [212]: dfs[0].columns
Out[212]: Index(['A', 'B', 'C,X'], dtype='object')
In [213]: dfs[1]
Out[213]:
C1 C2 C3 Text column
0 11 21 NaN Test "quotes"
1 12 23 2017-12-31 NaN
In [214]: dfs[1].dtypes
Out[214]:
C1 int64
C2 int64
C3 object
Text column object
dtype: object
In [215]: dfs[1].columns
Out[215]: Index(['C1', 'C2', 'C3', 'Text column'], dtype='object')
parsing dates:
解析日期:
In [216]: df = read_docx_tables(fn, tab_id=1, parse_dates=['C3'])
In [217]: df
Out[217]:
C1 C2 C3 Text column
0 11 21 NaT Test "quotes"
1 12 23 2017-12-31 NaN
In [218]: df.dtypes
Out[218]:
C1 int64
C2 int64
C3 datetime64[ns]
Text column object
dtype: object