bash 将 HTML 表从 shell 转换为 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22018003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 09:41:29  来源:igfitidea点击:

Converting HTML table to CSV file from shell

linuxbashcsvlibreoffice

提问by R. Leroi

im trying to convert a file with an HTML table to CSV format. An excerpt from this file follows:

我正在尝试将带有 HTML 表格的文件转换为 CSV 格式。该文件的摘录如下:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
    </title></head>
    <body>
        <form name="form1" method="post" action="mypricelist.aspx" id="form1">
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />

<div>
    <table id="price_list" border="0">
<tr>
    <td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
    <td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
    <td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

i tried using

我试过使用

    xls2csv -x -c\; evprice.xls > evprice.csv

but that gives me an error saying

但这给了我一个错误说

    evprice.xls is not OLE file or Error

I googled. it said that is was because the file wasn't proper xls but just html.

我用谷歌搜索。它说那是因为文件不是正确的 xls,而只是 html。

When i try

当我尝试

    file evprice.xls

its says its html so found a 'solution', using libreoffice.

它说它的 html 所以找到了一个“解决方案”,使用 libreoffice。

    libreoffice --headless -convert-to csv ./evprice.xls 

well this does not give an error but the csv output file is all weird, like opening an exe file in notepad.

好吧,这不会出错,但是 csv 输出文件很奇怪,就像在记事本中打开一个 exe 文件一样。

it contains a lot of strange characters like these

它包含很多像这样的奇怪字符

    —??-t9ü~?óXtK¢

anyone know why this is happening, and got a working solution?

有谁知道为什么会发生这种情况,并得到了有效的解决方案?

回答by Richard

I have built a Python utility which converts all the tables in an HTML file into separate CSV files.

我构建了一个 Python 实用程序,它将 HTML 文件中的所有表转换为单独的 CSV 文件。

You can find it here.

你可以在这里找到它。

The crux of the script is this:

脚本的关键是这样的:

from BeautifulSoup import BeautifulSoup
import csv

filename = "MY_HTML_FILE"
fin      = open(filename,'r')

print "Opening file"
fin  = fin.read()

print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
    fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in table.findAll('tr'):
      cols = row.findAll(['td','th'])
      if cols:
        cols = [x.text for x in cols]
        fout.writerow(cols)