string 从 R 中的字符串中删除 html 标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17227294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing html tags from a string in R
提问by Ryan Warnick
I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:
我正在尝试将网页源代码读入 R 并将其作为字符串处理。我试图取出段落并从段落文本中删除 html 标签。我遇到了以下问题:
I tried implementing a function to remove the html tags:
我尝试实现一个函数来删除 html 标签:
cleanFun=function(fullStr)
{
#find location of tags and citations
tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);
#create storage for tag strings
tagStrings=list()
#extract and store tag strings
for(i in 1:dim(tagLoc)[1])
{
tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
}
#remove tag strings from paragraph
newStr=fullStr
for(i in 1:length(tagStrings))
{
newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
}
return(newStr)
};
This works for some tags but not all tags, an example where this fails is following string:
这适用于某些标签但不适用于所有标签,失败的示例如下字符串:
test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
The goal would be to obtain:
目标是获得:
cleanFun(test)="junk junk junk junk"
However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.
但是,这似乎不起作用。我认为这可能与字符串长度或转义字符有关,但我找不到涉及这些的解决方案。
回答by Scott Ritchie
This can be achieved simply through regular expressions and the grep family:
这可以通过正则表达式和 grep 系列简单地实现:
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
This will also work with multiple html tags in the same string!
这也适用于同一字符串中的多个 html 标签!
This finds any instances of the pattern <.*?>
in the htmlString and replaces it with the empty string "". The ? in .*?
makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>
) it will match <a>
and </a>
instead of the whole string.
这会<.*?>
在 htmlString 中找到模式的任何实例,并将其替换为空字符串“”。这 ?in.*?
使它不贪婪,因此如果您有多个标签(例如,<a> junk </a>
),它将匹配<a>
而</a>
不是整个字符串。
回答by David Robinson
You can also do this with two functions in the rvestpackage:
library(rvest)
strip_html <- function(s) {
html_text(read_html(s))
}
Example output:
示例输出:
> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
回答by Peyton
Another approach, using tm.plugin.webmining
, which uses XML
internally.
另一种方法 using tm.plugin.webmining
,它在XML
内部使用。
> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
回答by Tyler Rinker
An approach using the qdap package:
使用 qdap 包的方法:
library(qdap)
bracketX(test, "angle")
## > bracketX(test, "angle")
## [1] "junk junk junk junk"
回答by user1609452
It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags
最好不要使用正则表达式解析html。RegEx 匹配除 XHTML 自包含标签之外的开放标签
Use a package like XML
. Source the html code in parse it using for example htmlParse
and use xpaths to find the quantities relevant to you.
使用像XML
. 使用例如解析它的源 html 代码htmlParse
并使用 xpaths 查找与您相关的数量。
UPDATE:
更新:
To answer the OP's question
回答 OP 的问题
require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)
回答by PAC
It may be easier with sub or gsub ?
使用 sub 或 gsub 可能更容易吗?
> test <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"
回答by Hong Ooi
First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \
to denote escaped characters for literal backslashes. In this case, \"
means the double quote mark, not the two literal characters \
and "
. You can use cat
to see what the string would actually look like if escaped characters were treated literally.
首先,您的主题行具有误导性;您发布的字符串中没有反斜杠。你已经成为一个经典错误的受害者:不像卷入亚洲的陆战那么糟糕,但同样值得注意。您误认为 R 使用 of\
来表示文字反斜杠的转义字符。在这种情况下,\"
表示双引号,而不是两个文字字符\
和"
。cat
如果按字面处理转义字符,您可以使用它来查看字符串的实际外观。
Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all
and str_replace_all
.) This is another of the classic blunders; see herefor more exposition.
其次,您正在使用正则表达式来解析 HTML。(它们没有出现在您的代码中,但它们在str_locate_all
和的幕后使用str_replace_all
。)这是另一个经典的错误;有关更多说明,请参见此处。
Third, you should have mentioned in your post that you're using the stringr
package, but this is only a minor blunder by comparison.
第三,您应该在帖子中提到您正在使用该stringr
软件包,但相比之下,这只是一个小错误。