string 从 R 中的字符串中删除 html 标签

Question

提问by Ryan Warnick

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:

我正在尝试将网页源代码读入 R 并将其作为字符串处理。我试图取出段落并从段落文本中删除 html 标签。我遇到了以下问题：

I tried implementing a function to remove the html tags:

我尝试实现一个函数来删除 html 标签：

cleanFun=function(fullStr)
{
 #find location of tags and citations
 tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);

 #create storage for tag strings
 tagStrings=list()

 #extract and store tag strings
 for(i in 1:dim(tagLoc)[1])
 {
   tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
 }

 #remove tag strings from paragraph
 newStr=fullStr
 for(i in 1:length(tagStrings))
 {
   newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
 }
 return(newStr)
};

This works for some tags but not all tags, an example where this fails is following string:

这适用于某些标签但不适用于所有标签，失败的示例如下字符串：

test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"

The goal would be to obtain:

目标是获得：

cleanFun(test)="junk junk junk junk"

However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.

但是，这似乎不起作用。我认为这可能与字符串长度或转义字符有关，但我找不到涉及这些的解决方案。

Answer 1

回答by Scott Ritchie

This can be achieved simply through regular expressions and the grep family:

这可以通过正则表达式和 grep 系列简单地实现：

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

This will also work with multiple html tags in the same string!

这也适用于同一字符串中的多个 html 标签！

This finds any instances of the pattern <.*?>in the htmlString and replaces it with the empty string "". The ? in .*?makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a>and </a>instead of the whole string.

这会<.*?>在 htmlString 中找到模式的任何实例，并将其替换为空字符串“”。这？in.*?使它不贪婪，因此如果您有多个标签（例如，<a> junk </a>），它将匹配<a>而</a>不是整个字符串。

Answer 2

回答by David Robinson

You can also do this with two functions in the rvestpackage:

您还可以使用rvest包中的两个函数执行此操作：

library(rvest)

strip_html <- function(s) {
    html_text(read_html(s))
}

Example output:

示例输出：

> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

Note that you should not use regexes to parse HTML.

请注意，您不应使用正则表达式来解析 HTML。

Answer 3

回答by Peyton

Another approach, using tm.plugin.webmining, which uses XMLinternally.

另一种方法 using tm.plugin.webmining，它在XML内部使用。

> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

Answer 4

回答by Tyler Rinker

An approach using the qdap package:

使用 qdap 包的方法：

library(qdap)
bracketX(test, "angle")

## > bracketX(test, "angle")
## [1] "junk junk junk junk"

Answer 5

回答by user1609452

It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags

最好不要使用正则表达式解析html。RegEx 匹配除 XHTML 自包含标签之外的开放标签

Use a package like XML. Source the html code in parse it using for example htmlParseand use xpaths to find the quantities relevant to you.

使用像XML. 使用例如解析它的源 html 代码htmlParse并使用 xpaths 查找与您相关的数量。

UPDATE:

更新：

To answer the OP's question

回答 OP 的问题

require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)

Answer 6

回答by PAC

It may be easier with sub or gsub ?

使用 sub 或 gsub 可能更容易吗？

> test  <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"

Answer 7

回答by Hong Ooi

First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \to denote escaped characters for literal backslashes. In this case, \"means the double quote mark, not the two literal characters \and ". You can use catto see what the string would actually look like if escaped characters were treated literally.

首先，您的主题行具有误导性；您发布的字符串中没有反斜杠。你已经成为一个经典错误的受害者：不像卷入亚洲的陆战那么糟糕，但同样值得注意。您误认为 R 使用 of\来表示文字反斜杠的转义字符。在这种情况下，\"表示双引号，而不是两个文字字符\和"。cat如果按字面处理转义字符，您可以使用它来查看字符串的实际外观。

Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_alland str_replace_all.) This is another of the classic blunders; see herefor more exposition.

其次，您正在使用正则表达式来解析 HTML。（它们没有出现在您的代码中，但它们在str_locate_all和的幕后使用str_replace_all。）这是另一个经典的错误；有关更多说明，请参见此处。

Third, you should have mentioned in your post that you're using the stringrpackage, but this is only a minor blunder by comparison.

第三，您应该在帖子中提到您正在使用该stringr软件包，但相比之下，这只是一个小错误。

string 从 R 中的字符串中删除 html 标签

提问by Ryan Warnick

回答by Scott Ritchie

回答by David Robinson

回答by Peyton

回答by Tyler Rinker

回答by user1609452

回答by PAC

回答by Hong Ooi

相关推荐

最近更新

标签

string 从 R 中的字符串中删除 html 标签

提问by Ryan Warnick

回答by Scott Ritchie

回答by David Robinson

回答by Peyton

回答by Tyler Rinker

回答by user1609452

回答by PAC

回答by Hong Ooi

相关推荐

string 如何在 Javascript 中生成随机的字母和数字字符串？

string 如何通过在 Windows 中使用批处理替换子字符串来重命名文件

string 将单元格内容与 Excel 中的字符串进行比较

string VB 捕获文本框中的前 4 个字符

相关推荐

最近更新

标签