如何将 HTML 表格转换为 CSV?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1403087/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I convert an HTML table to CSV?
提问by pavium
How do I convert the contents of an HTML table (<table>
) to CSV format? Is there a library or linux program that does this? This is similar to copy tables in Internet Explorer, and pasting them into Excel.
如何将 HTML 表格 ( <table>
)的内容转换为 CSV 格式?是否有一个库或 linux 程序可以做到这一点?这类似于在 Internet Explorer 中复制表格,然后将它们粘贴到 Excel 中。
回答by pavium
This method is not really a library OR a program, but for ad hoc conversions you can
这种方法不是真正的库或程序,但对于临时转换,您可以
- put the HTML for a table in a textfile called something.xls
- open it with a spreadsheet
- save it as CSV.
- 将表格的 HTML 放在名为something.xls的文本文件中
- 用电子表格打开它
- 将其另存为 CSV。
I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.
我知道这适用于 Excel,而且我相信我已经使用 OpenOffice 电子表格做到了。
But you probably would prefer a Perl or Ruby script...
但是您可能更喜欢 Perl 或 Ruby 脚本...
回答by DRendar
Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.
抱歉复活了一个古老的线程,但我最近想这样做,但我想要一个 100% 可移植的 bash 脚本来做到这一点。所以这是我只使用 grep 和 sed 的解决方案。
The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...
下面的内容很快就被淘汰了,所以可以做得更优雅,但我才刚刚开始使用 sed/awk 等......
curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig' | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.
如您所见,我使用 curl 获得了页面源,但您也可以轻松地从其他地方输入表源。
Here's the explanation:
这是解释:
Get the Contents of the URL using cURL, dump stderr to null (no progress meter)
使用 cURL 获取 URL 的内容,将 stderr 转储为 null(无进度表)
curl "http://www.webpagewithtableinit.com/" 2>/dev/null
.
.
I only want Table elements (return only lines with TABLE,TR,TH,TD tags)
我只想要表格元素(只返回带有 TABLE、TR、TH、TD 标签的行)
| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'
.
.
Remove any Whitespace at the beginning of the line.
删除行首的所有空格。
| sed 's/^[\ \t]*//g'
.
.
Remove newlines
删除换行符
| tr -d '\n\r'
.
.
Replace </TR>
with newline
</TR>
用换行符替换
| sed 's/<\/TR[^>]*>/\n/Ig'
.
.
Remove TABLE and TR tags
删除 TABLE 和 TR 标签
| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'
.
.
Remove ^<TD>
, ^<TH>
, </TD>$
, </TH>$
删除^<TD>
, ^<TH>
, </TD>$
,</TH>$
| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'
.
.
Replace </TD><TD>
with comma
</TD><TD>
用逗号代替
| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
.
.
Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.
请注意,如果任何表格单元格包含逗号,您可能需要先对它们进行转义,或者使用不同的分隔符。
Hope this helps someone!
希望这可以帮助某人!
回答by audiodude
Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/
下面是一个Ruby脚本使用引入nokogiri - http://nokogiri.rubyforge.org/nokogiri/
require 'nokogiri'
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, ''), "\", "
end
print "\n"
end
Worked for my basic test case.
为我的基本测试用例工作。
回答by Yuval
Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan
or rowspan
.
Enjoy.
这是我为完成此任务而编写的简短 Python 程序。它是在几分钟内写的,所以它可能会做得更好。不确定它将如何处理嵌套表(可能会做坏事)或多个表(可能它们只会一个接一个地出现)。它不处理colspan
或rowspan
。享受。
from HTMLParser import HTMLParser
import sys
import re
class HTMLTableParser(HTMLParser):
def __init__(self, row_delim="\n", cell_delim="\t"):
HTMLParser.__init__(self)
self.despace_re = re.compile(r'\s+')
self.data_interrupt = False
self.first_row = True
self.first_cell = True
self.in_cell = False
self.row_delim = row_delim
self.cell_delim = cell_delim
def handle_starttag(self, tag, attrs):
self.data_interrupt = True
if tag == "table":
self.first_row = True
self.first_cell = True
elif tag == "tr":
if not self.first_row:
sys.stdout.write(self.row_delim)
self.first_row = False
self.first_cell = True
self.data_interrupt = False
elif tag == "td" or tag == "th":
if not self.first_cell:
sys.stdout.write(self.cell_delim)
self.first_cell = False
self.data_interrupt = False
self.in_cell = True
def handle_endtag(self, tag):
self.data_interrupt = True
if tag == "td" or tag == "th":
self.in_cell = False
def handle_data(self, data):
if self.in_cell:
#if self.data_interrupt:
# sys.stdout.write(" ")
sys.stdout.write(self.despace_re.sub(' ', data).strip())
self.data_interrupt = False
parser = HTMLTableParser()
parser.feed(sys.stdin.read())
回答by toms.work
Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheetsis your spreadsheeting program of choice. Simply do these two things.
只是为了添加这些答案(因为我最近一直在尝试类似的事情) - 如果Google 电子表格是您选择的电子表格程序。简单地做这两件事。
1.Strip everything out of your html file around the Table opening/closing tagsand resave it as another html file.
1.去掉表格开始/结束标签周围的 html 文件中的所有内容,并将其重新保存为另一个 html 文件。
2.Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)
2.将该 html 文件直接导入谷歌电子表格,您的信息将被精美导入(重要提示:如果您在表格中使用了内联样式,它们也会被导入!)
Saved me loads of time and figuring out different conversions.
为我节省了大量时间并解决了不同的转换问题。
回答by Bhagirath
Assuming that you've designed an HTML page containing a table
, I would recommend this solution. Worked like charm for me:
假设您设计了一个包含 的 HTML 页面table
,我会推荐这个解决方案。对我来说很有魅力:
$(document).ready(() => {
$("#buttonExport").click(e => {
// Getting values of current time for generating the file name
const dateTime = new Date();
const day = dateTime.getDate();
const month = dateTime.getMonth() + 1;
const year = dateTime.getFullYear();
const hour = dateTime.getHours();
const minute = dateTime.getMinutes();
const postfix = `${day}.${month}.${year}_${hour}.${minute}`;
// Creating a temporary HTML link element (they support setting file names)
const downloadElement = document.createElement('a');
// Getting data from our `div` that contains the HTML table
const dataType = 'data:application/vnd.ms-excel';
const tableDiv = document.getElementById('divData');
const tableHTML = tableDiv.outerHTML.replace(/ /g, '%20');
// Setting the download source
downloadElement.href = `${dataType},${tableHTML}`;
// Setting the file name
downloadElement.download = `exported_table_${postfix}.xls`;
// Trigger the download
downloadElement.click();
// Just in case, prevent default behaviour
e.preventDefault();
});
});
Courtesy: http://www.kubilayerdogan.net/?p=218
礼貌:http://www.kubilayerdogan.net/?p= 218
You can edit the file format to .csv
here:
您可以在.csv
此处编辑文件格式:
downloadElement.download = `exported_table_${postfix}.csv`;
回答by Chris Simmons
I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV
and HTML::Parser
.
我不确定是否有为此预先制作的库,但是如果您愿意用一点 Perl 来弄脏您的手,您可能可以使用Text::CSV
and做一些事情HTML::Parser
。
回答by jmcnamara
With Perl you can use the HTML::TableExtract
module to extract the data from the table and then use Text::CSV_XS
to create a CSV file or Spreadsheet::WriteExcel
to create an Excel file.
使用 Perl,您可以使用该HTML::TableExtract
模块从表中提取数据,然后用于Text::CSV_XS
创建 CSV 文件或Spreadsheet::WriteExcel
创建 Excel 文件。
回答by Met Kiani
Here a simple solution without any external lib:
这是一个没有任何外部库的简单解决方案:
https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/
https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/
It works for me without any issue
它对我有用,没有任何问题
回答by atomicules
Based on audiodude's answer, but simplified by using the built-in CSV library
基于audiodude 的回答,但通过使用内置的 CSV 库进行了简化
require 'nokogiri'
require 'csv'
doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')
doc.xpath('//table//tr').each do |row|
tarray = [] #temporary array
row.xpath('td').each do |cell|
tarray << cell.text #Build array of that row of data.
end
csv << tarray #Write that row out to csv file
end
csv.close
I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')
) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.
我确实想知道是否有任何方法可以将 Nokogiri NodeSet ( row.xpath('td')
) 作为数组写入 csv 文件中的一个步骤。但是我只能通过迭代每个单元格并构建每个单元格内容的临时数组来解决这个问题。