如何将 HTML 表格转换为 CSV?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1403087/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 00:49:50  来源:igfitidea点击:

How can I convert an HTML table to CSV?

htmlcsvhtml-table

提问by pavium

How do I convert the contents of an HTML table (<table>) to CSV format? Is there a library or linux program that does this? This is similar to copy tables in Internet Explorer, and pasting them into Excel.

如何将 HTML 表格 ( <table>)的内容转换为 CSV 格式?是否有一个库或 linux 程序可以做到这一点?这类似于在 Internet Explorer 中复制表格,然后将它们粘贴到 Excel 中。

回答by pavium

This method is not really a library OR a program, but for ad hoc conversions you can

这种方法不是真正的库或程序,但对于临时转换,您可以

  • put the HTML for a table in a textfile called something.xls
  • open it with a spreadsheet
  • save it as CSV.
  • 将表格的 HTML 放在名为something.xls文本文件中
  • 用电子表格打开它
  • 将其另存为 CSV。

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

我知道这适用于 Excel,而且我相信我已经使用 OpenOffice 电子表格做到了。

But you probably would prefer a Perl or Ruby script...

但是您可能更喜欢 Perl 或 Ruby 脚本...

回答by DRendar

Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.

抱歉复活了一个古老的线程,但我最近想这样做,但我想要一个 100% 可移植的 bash 脚本来做到这一点。所以这是我只使用 grep 和 sed 的解决方案。

The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...

下面的内容很快就被淘汰了,所以可以做得更优雅,但我才刚刚开始使用 sed/awk 等......

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.

如您所见,我使用 curl 获得了页面源,但您也可以轻松地从其他地方输入表源。

Here's the explanation:

这是解释:

Get the Contents of the URL using cURL, dump stderr to null (no progress meter)

使用 cURL 获取 URL 的内容,将 stderr 转储为 null(无进度表)

curl "http://www.webpagewithtableinit.com/" 2>/dev/null 

.

.

I only want Table elements (return only lines with TABLE,TR,TH,TD tags)

我只想要表格元素(只返回带有 TABLE、TR、TH、TD 标签的行)

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

.

.

Remove any Whitespace at the beginning of the line.

删除行首的所有空格。

| sed 's/^[\ \t]*//g' 

.

.

Remove newlines

删除换行符

| tr -d '\n\r' 

.

.

Replace </TR>with newline

</TR>用换行符替换

| sed 's/<\/TR[^>]*>/\n/Ig'  

.

.

Remove TABLE and TR tags

删除 TABLE 和 TR 标签

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' 

.

.

Remove ^<TD>, ^<TH>, </TD>$, </TH>$

删除^<TD>, ^<TH>, </TD>$,</TH>$

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' 

.

.

Replace </TD><TD>with comma

</TD><TD>用逗号代替

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

.

.

Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.

请注意,如果任何表格单元格包含逗号,您可能需要先对它们进行转义,或者使用不同的分隔符。

Hope this helps someone!

希望这可以帮助某人!

回答by audiodude

Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

下面是一个Ruby脚本使用引入nokogiri - http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, ''), "\", "
  end
  print "\n"
end

Worked for my basic test case.

为我的基本测试用例工作。

回答by Yuval

Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspanor rowspan. Enjoy.

这是我为完成此任务而编写的简短 Python 程序。它是在几分钟内写的,所以它可能会做得更好。不确定它将如何处理嵌套表(可能会做坏事)或多个表(可能它们只会一个接一个地出现)。它不处理colspanrowspan。享受。

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read()) 

回答by toms.work

Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheetsis your spreadsheeting program of choice. Simply do these two things.

只是为了添加这些答案(因为我最近一直在尝试类似的事情) - 如果Google 电子表格是您选择的电子表格程序。简单地做这两件事。

1.Strip everything out of your html file around the Table opening/closing tagsand resave it as another html file.

1.去掉表格开始/结束标签周围的 html 文件中的所有内容,并将其重新保存为另一个 html 文件。

2.Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)

2.将该 html 文件直接导入谷歌电子表格,您的信息将被精美导入(重要提示:如果您在表格中使用了内联样式,它们也会被导入!)

Saved me loads of time and figuring out different conversions.

为我节省了大量时间并解决了不同的转换问题。

回答by Bhagirath

Assuming that you've designed an HTML page containing a table, I would recommend this solution. Worked like charm for me:

假设您设计了一个包含 的 HTML 页面table,我会推荐这个解决方案。对我来说很有魅力:

$(document).ready(() => {
  $("#buttonExport").click(e => {
    // Getting values of current time for generating the file name
    const dateTime = new Date();
    const day      = dateTime.getDate();
    const month    = dateTime.getMonth() + 1;
    const year     = dateTime.getFullYear();
    const hour     = dateTime.getHours();
    const minute   = dateTime.getMinutes();
    const postfix  = `${day}.${month}.${year}_${hour}.${minute}`;

    // Creating a temporary HTML link element (they support setting file names)
    const downloadElement = document.createElement('a');

    // Getting data from our `div` that contains the HTML table
    const dataType  = 'data:application/vnd.ms-excel';
    const tableDiv  = document.getElementById('divData');
    const tableHTML = tableDiv.outerHTML.replace(/ /g, '%20');

    // Setting the download source
    downloadElement.href = `${dataType},${tableHTML}`;

    // Setting the file name
    downloadElement.download = `exported_table_${postfix}.xls`;

    // Trigger the download
    downloadElement.click();

    // Just in case, prevent default behaviour
    e.preventDefault();
  });
});

Courtesy: http://www.kubilayerdogan.net/?p=218

礼貌:http://www.kubilayerdogan.net/?p= 218

You can edit the file format to .csvhere:

您可以在.csv此处编辑文件格式:

downloadElement.download = `exported_table_${postfix}.csv`;

回答by Chris Simmons

I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSVand HTML::Parser.

我不确定是否有为此预先制作的库,但是如果您愿意用一点 Perl 来弄脏您的手,您可能可以使用Text::CSVand做一些事情HTML::Parser

回答by jmcnamara

With Perl you can use the HTML::TableExtractmodule to extract the data from the table and then use Text::CSV_XSto create a CSV file or Spreadsheet::WriteExcelto create an Excel file.

使用 Perl,您可以使用该HTML::TableExtract模块从表中提取数据,然后用于Text::CSV_XS创建 CSV 文件或Spreadsheet::WriteExcel创建 Excel 文件。

回答by Met Kiani

Here a simple solution without any external lib:

这是一个没有任何外部库的简单解决方案:

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

It works for me without any issue

它对我有用,没有任何问题

回答by atomicules

Based on audiodude's answer, but simplified by using the built-in CSV library

基于audiodude 的回答,但通过使用内置的 CSV 库进行了简化

require 'nokogiri'
require 'csv'

doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')

doc.xpath('//table//tr').each do |row|
    tarray = [] #temporary array
    row.xpath('td').each do |cell|
        tarray << cell.text #Build array of that row of data.
    end
    csv << tarray #Write that row out to csv file
end

csv.close

I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.

我确实想知道是否有任何方法可以将 Nokogiri NodeSet ( row.xpath('td')) 作为数组写入 csv 文件中的一个步骤。但是我只能通过迭代每个单元格并构建每个单元格内容的临时数组来解决这个问题。