bash 将 HTML 表从 shell 转换为 CSV 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22018003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting HTML table to CSV file from shell
提问by R. Leroi
im trying to convert a file with an HTML table to CSV format. An excerpt from this file follows:
我正在尝试将带有 HTML 表格的文件转换为 CSV 格式。该文件的摘录如下:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
</title></head>
<body>
<form name="form1" method="post" action="mypricelist.aspx" id="form1">
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />
<div>
<table id="price_list" border="0">
<tr>
<td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
<td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
<td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>
i tried using
我试过使用
xls2csv -x -c\; evprice.xls > evprice.csv
but that gives me an error saying
但这给了我一个错误说
evprice.xls is not OLE file or Error
I googled. it said that is was because the file wasn't proper xls but just html.
我用谷歌搜索。它说那是因为文件不是正确的 xls,而只是 html。
When i try
当我尝试
file evprice.xls
its says its html so found a 'solution', using libreoffice.
它说它的 html 所以找到了一个“解决方案”,使用 libreoffice。
libreoffice --headless -convert-to csv ./evprice.xls
well this does not give an error but the csv output file is all weird, like opening an exe file in notepad.
好吧,这不会出错,但是 csv 输出文件很奇怪,就像在记事本中打开一个 exe 文件一样。
it contains a lot of strange characters like these
它包含很多像这样的奇怪字符
—??-t9ü~?óXtK¢
anyone know why this is happening, and got a working solution?
有谁知道为什么会发生这种情况,并得到了有效的解决方案?
回答by Richard
I have built a Python utility which converts all the tables in an HTML file into separate CSV files.
我构建了一个 Python 实用程序,它将 HTML 文件中的所有表转换为单独的 CSV 文件。
You can find it here.
你可以在这里找到它。
The crux of the script is this:
脚本的关键是这样的:
from BeautifulSoup import BeautifulSoup
import csv
filename = "MY_HTML_FILE"
fin = open(filename,'r')
print "Opening file"
fin = fin.read()
print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)
print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]
print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
tablecount += 1
print "Processing Table #%d" % (tablecount)
with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in table.findAll('tr'):
cols = row.findAll(['td','th'])
if cols:
cols = [x.text for x in cols]
fout.writerow(cols)