bash 将 HTML 表从 shell 转换为 CSV 文件

Question

提问by R. Leroi

im trying to convert a file with an HTML table to CSV format. An excerpt from this file follows:

我正在尝试将带有 HTML 表格的文件转换为 CSV 格式。该文件的摘录如下：

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
    </title></head>
    <body>
        <form name="form1" method="post" action="mypricelist.aspx" id="form1">
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />

<div>
    <table id="price_list" border="0">
<tr>
    <td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
    <td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
    <td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

i tried using

我试过使用

    xls2csv -x -c\; evprice.xls > evprice.csv

but that gives me an error saying

但这给了我一个错误说

    evprice.xls is not OLE file or Error

I googled. it said that is was because the file wasn't proper xls but just html.

我用谷歌搜索。它说那是因为文件不是正确的 xls，而只是 html。

When i try

当我尝试

    file evprice.xls

its says its html so found a 'solution', using libreoffice.

它说它的 html 所以找到了一个“解决方案”，使用 libreoffice。

    libreoffice --headless -convert-to csv ./evprice.xls

well this does not give an error but the csv output file is all weird, like opening an exe file in notepad.

好吧，这不会出错，但是 csv 输出文件很奇怪，就像在记事本中打开一个 exe 文件一样。

it contains a lot of strange characters like these

它包含很多像这样的奇怪字符

    —??-t9ü~?óXtK￠

anyone know why this is happening, and got a working solution?

有谁知道为什么会发生这种情况，并得到了有效的解决方案？

Answer 1

回答by Richard

I have built a Python utility which converts all the tables in an HTML file into separate CSV files.

我构建了一个 Python 实用程序，它将 HTML 文件中的所有表转换为单独的 CSV 文件。

You can find it here.

你可以在这里找到它。

The crux of the script is this:

脚本的关键是这样的：

from BeautifulSoup import BeautifulSoup
import csv

filename = "MY_HTML_FILE"
fin      = open(filename,'r')

print "Opening file"
fin  = fin.read()

print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
    fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in table.findAll('tr'):
      cols = row.findAll(['td','th'])
      if cols:
        cols = [x.text for x in cols]
        fout.writerow(cols)

bash 将 HTML 表从 shell 转换为 CSV 文件

提问by R. Leroi

回答by Richard

相关推荐

最近更新

标签

bash 将 HTML 表从 shell 转换为 CSV 文件

提问by R. Leroi

回答by Richard

相关推荐

bash “echo”命令有什么替代方法吗？

如何强制 os.system() 使用 bash 而不是 shell

enum 数据类型在 bash 中似乎不可用

bash 使用 awk 将标题放入文本文件中

相关推荐

最近更新

标签