javascript 如何从一个数组中的所有标签中获取所有文本？

Question

提问by smotru

I need to create an array which contains all text from a page without jQuery. This is my html:

我需要创建一个数组，其中包含没有 jQuery 的页面中的所有文本。这是我的 html：

<html>
<head>
    <title>Hello world!</title>
</head>
<body>
    <h1>Hello!</h1>
    <p>
        <div>What are you doing?</div>
        <div>Fine, and you?</div>
    </p>
    <a href="http://google.com">Thank you!</a>
</body>
</html>

Here is what i want to get

这是我想要的

text[1] = "Hello world!";
text[2] = "Hello!";
text[3] = "What are you doing?";
text[4] = "Fine, and you?";
text[5] = "Thank you!";

Here is what i have tried but seems to not work correctly in my browser:

这是我尝试过但在我的浏览器中似乎无法正常工作的内容：

var elements = document.getElementsByTagName('*');
console.log(elements);

PS. I need to use document.getElementsByTagName('*'); and exclude "script" and "style".

附注。我需要使用 document.getElementsByTagName('*'); 并排除“脚本”和“风格”。

Answer 1

回答by iConnor

  var array = [];

    var elements = document.body.getElementsByTagName("*");

    for(var i = 0; i < elements.length; i++) {
       var current = elements[i];
        if(current.children.length === 0 && current.textContent.replace(/ |\n/g,'') !== '') {
           // Check the element has no children && that it is not empty
           array.push(current.textContent);
        }
    }

You could do something like this

你可以做这样的事情

Demo

演示

result = ["What are you doing?", "Fine, and you?"]

结果 = ["What are you doing?", "Fine, and you?"]

or you could use document.documentElement.getElementsByTagName('*');

或者你可以使用 document.documentElement.getElementsByTagName('*');

Also make sure your code is inside this

还要确保你的代码在这个里面

document.addEventListener('DOMContentLoaded', function(){

   /// Code...
});

If it's just the title you need, you may aswell do this

如果这只是你需要的标题，你也可以这样做

array.push(document.title);

Saves looping through scripts & styles

通过脚本和样式保存循环

Answer 2

回答by Pointy

If you want the contents of the entire page, you should be able to use

如果你想要整个页面的内容，你应该可以使用

var allText = document.body.textContent;

In Internet Explorer before IE9, there was the property innerTextwhich is similar but not identical. The MDN page about textContenthas more detail.

在IE9之前的Internet Explorer中，有innerText类似但不完全相同的属性。关于MDN 页面textContent有更多详细信息。

Now one problem here is that textContentwill get you the content of any <style>or <script>tags, which may or may not be what you want. If you don't want that, you could use something like this:

现在这里的一个问题是，这textContent将使您获得 any<style>或<script>标签的内容，这可能是您想要的，也可能不是。如果你不想那样，你可以使用这样的东西：

function getText(startingPoint) {
  var text = "";
  function gt(start) {
    if (start.nodeType === 3)
      text += start.nodeValue;
    else if (start.nodeType === 1)
      if (start.tagName != "SCRIPT" && start.tagName != "STYLE")
        for (var i = 0; i < start.childNodes.length; ++i)
          gt(start.childNodes[i]);
  }
  gt(startingPoint);
  return text;
}

Then:

然后：

var allText = getText(document.body);

Note:this (or document.body.innerText) will get you all the text, but in a depth-first order. Getting all the text from a page in the order that a human actually sees it once the page is rendered is a muchmore difficult problem, because it'd require the code to understand the visual effects (and visual semantics!) of the layout as dictated by CSS (etc).

注意：this（或document.body.innerText）将为您提供所有文本，但以深度优先的顺序。获取该人实际看到一次呈现页面是一个订单中所有从网页中的文字多比较棘手的问题，因为它会要求代码来理解视觉效果（与视觉语义！）的布局的由 CSS（等）规定。

edit— if you want the text "stored into an array", I suppose on a node-by-node basis (?), you'd simply substitute array appends for the string concatenation in the above:

编辑- 如果您希望文本“存储到数组中”，我想在逐个节点的基础上（？），您只需将数组附加替换为上面的字符串连接：

function getTextArray(startingPoint) {
  var text = [];
  function gt(start) {
    if (start.nodeType === 3)
      text.push(start.nodeValue);
    else if (start.nodeType === 1)
      if (start.tagName != "SCRIPT" && start.tagName != "STYLE")
        for (var i = 0; i < start.childNodes.length; ++i)
          gt(start.childNodes[i]);
  }
  gt(startingPoint);
  return text;
}

Answer 3

回答by jinwei

    <html>
    <head>
            <title>Hello world!</title>
    </head>
    <body>
            <h1>Hello!</h1>
            <p>
                    <div>What are you doing?</div>
                    <div>Fine, 
                        <span> and you? </span>
                    </div>
            </p>
            <a href="http://google.com">Thank you!</a>
            <script type="text/javascript">
                function getLeafNodesOfHTMLTree(root) {
                    if (root.nodeType == 3) {
                        return [root];
                    } else {
                        var all = [];
                        for (var i = 0; i < root.childNodes.length; i++) {
                            var ret2 = getLeafNodesOfHTMLTree(root.childNodes[i]);
                            all = all.concat(ret2);
                        }
                        return all;
                    }
                }
                var allnodes = getLeafNodesOfHTMLTree(document.getElementsByTagName("html")[0]);
                console.log(allnodes);
                 //in modern browsers that surport array filter and map
                allnodes = allnodes.filter(function (node) {
                    return node && node.nodeValue && node.nodeValue.replace(/\s/g, '').length;
                });
                allnodes = allnodes.map(function (node) {
                    return node.nodeValue
                })
                 console.log(allnodes);
            </script>
    </body>
    </html>

Answer 4

回答by Louis Ricci

Walk the DOM tree, get all the text nodes, get the nodeValue of the text node.

遍历DOM树，获取所有文本节点，获取文本节点的nodeValue。

var result = [];
var itr = document.createTreeWalker(
    document.getElementsByTagName("html")[0],
    NodeFilter.SHOW_TEXT,
    null, // no filter
    false);
while(itr.nextNode()) {
    if(itr.currentNode.nodeValue != "")
        result.push(itr.currentNode.nodeValue);
}
alert(result);

Alternate method: Split on the HTML tag's textContent.

替代方法：在 HTML 标记的 textContent 上拆分。

var result = document.getElementsByTagName("html")[0].textContent.split("\n");
for(var i=0; i<result.length; i++)
    if(result[i] == "")
        result.splice(i, 1);
alert(result);

Answer 5

回答by Ilya Streltsyn

Seems to be a one line solution (fiddle):

似乎是一个单行解决方案（小提琴）：

document.body.innerHTML.replace(/^\s*<[^>]*>\s*|\s*<[^>]*>\s*$|>\s*</g,'').split(/<[^>]*>/g)

This may fail if there are complicated scripts in the body, though, and I know that parsing HTML with regular expressions is not a very clever idea, but for simple cases or for demo purposes it still can be suitable, can't it? :)

如果有在复杂的脚本，这可能会失败body，不过，我知道，解析HTML正则表达式是不是一个非常聪明的想法，但对于简单的情况下，或用于演示目的它仍然是合适的，不就可以了？:)

javascript 如何从一个数组中的所有标签中获取所有文本？

提问by smotru

回答by iConnor

回答by Pointy

回答by jinwei

回答by Louis Ricci

回答by Ilya Streltsyn

相关推荐

最近更新

标签

javascript 如何从一个数组中的所有标签中获取所有文本？

提问by smotru

回答by iConnor

回答by Pointy

回答by jinwei

回答by Louis Ricci

回答by Ilya Streltsyn

相关推荐

javascript 如何告诉 JSLint / JSHint 已经定义了哪些全局变量

javascript Globalize.js - 如何解析日期和时间而不仅仅是日期

javascript 为什么greet函数没有返回期望值？

在 ASP.NET 中使用 javascript 隐藏和显示 div

相关推荐

最近更新

标签