如何使用正则表达式提取json字段?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14349889/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 18:43:08  来源:igfitidea点击:

how to use a regular expression to extract json fields?

regexjsontextreplaceeditpad

提问by James Cooper

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?

初学者正则表达式问题。我在文本文件中有几行 JSON,每行的字段略有不同,但是如果有的话,我想为每行提取 3 个字段,忽略其他所有字段。我将如何使用正则表达式(在编辑板或其他任何地方)来做到这一点?

Example:

例子:

"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24

I want to extract URL,TITLE,TAGS,

我想提取 URL,TITLE,TAGS,

回答by FrankieTheKneeMan

/"(url|title|tags)":"((\"|[^"])*)"/i

I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by /- you probably won't have to put those in editpad) matches:

我想这就是你要问的。我将立即提供解释。此正则表达式(由分隔符/- 您可能不必将它们放在编辑板中)匹配:

"

A literal ".

一个字面意思"

(url|title|tags)

Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.

三个文字字符串“url”、“title”或“tags”中的任何一个——在正则表达式中,默认情况下括号用于创建组,管道字符用于交替——就像逻辑“或”一样。要匹配这些文字字符,您必须对它们进行转义。

":"

Another literal string.

另一个文字字符串。

(

The beginning of another group. (Group 2)

另一组的开始。(第 2 组)

    (

Another group (3)

另一组 (3)

        \"

The literal string \"- you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.

文字字符串\"- 你必须转义反斜杠,否则它将被解释为转义下一个字符,你永远不知道会做什么。

        |

or...

或者...

        [^"]

Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.

除双引号外的任何单个字符 括号表示字符类/集,或要匹配的字符列表。任何给定的类都与字符串中的一个字符完全匹配。^在类的开头使用克拉 ( ) 会否定它,导致匹配器匹配类中未包含的任何内容。

    )

End of group 3...

第三组结束...

    *

The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.

星号会导致前面的正则表达式(在本例中为第 3 组)重复零次或多次,在这种情况下,匹配器会匹配任何可能位于 JSON 字符串双引号内的内容。

)"

The end of group 2, and a literal ".

第 2 组的结尾,以及文字".

I've done a few non-obvious things here, that may come in handy:

我在这里做了一些不明显的事情,这可能会派上用场:

  1. Group 2 - when dereferenced using Backreferences- will be the actual string assigned to the field. This is useful when getting the actual value.
  2. The i at the end of the expression makes it case insensitive.
  3. Group 1 contains the name of the captured field.
  1. 第 2 组 - 当使用反向引用取消引用时- 将是分配给该字段的实际字符串。这在获取实际值时很有用。
  2. 表达式末尾的 i 使其不区分大小写。
  3. 第 1 组包含捕获字段的名称。

EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.

编辑:所以我看到标签是一个数组。当我有机会思考它时,我将在稍后更新这里的正则表达式。

Your new Regex is:

你的新正则表达式是:

/"(url|title|tags)":("(\"|[^"])*"|\[("(\"|[^"])*"(,"(\"|[^"])*")*)?\])/i

All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:

我在这里所做的只是将我一直使用的字符串正则表达式 ( "((\\"|[^"])*)") 替换为用于查找数组的正则表达式 ( \[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])。没有那么容易阅读,是吗?好吧,将我们的 String Regex 替换为 letter S,我们可以将其重写为:

\[(S(,S)*)?\]

Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.

它匹配文字左括号(因此是反斜杠),可选后跟逗号分隔的字符串列表和右括号。我在这里引入的唯一新概念是问号 ( ?),它本身就是一种重复。通常称为“使前一个表达式可选”,也可以将其视为恰好 0 或 1 个匹配项。

With our same SNotation, here's the whole dirty Regular Expression:

使用我们相同的S符号,这是整个肮脏的正则表达式:

/"(url|title|tags)":(S|\[(S(,S)*)?\])/i

If it helps to see it in action, here's a view of it in action.

如果它有助于看到它的运行,这里是它的运行视图。

回答by creep3007

This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.

这个问题有点老了,但我在我的电脑上浏览了一下,发现了那个表达。我通过他作为 GIST,可能对其他人有用。

EDIT:

编辑:

# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10

(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.$]*)

# test document
[
  {
    "_id": "56af331efbeca6240c61b2ca",
    "index": 120000,
    "guid": "bedb2018-c017-429E-b520-696ea3666692",
    "isActive": false,
    "balance": ",202,350",
    "object": {
        "name": "am",
        "lastname": "lang"
    }
  }
]

回答by Douglas G. Allen

Why does it have to be a Regular Expression object?

为什么它必须是正则表达式对象?

Here we can just use a Hash object first and then go search it.

在这里,我们可以先使用一个 Hash 对象,然后再去搜索它。

mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}

The output of which would be

其输出将是

=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}

Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.

并不是说我想避免使用 Regexp,但您不认为在获得要进一步搜索的数据之前一步一步地执行它会更容易吗?只是 MHO。

mh.values_at(:url, :title, :tags)

The output:

输出:

["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]

Taking the pattern that FrankieTheKneeman gave you:

以 FrankieTheKneeman 给你的模式为例:

pattern = /"(url|title|tags)":"((\"|[^"])*)"/i

we can search the mh hash by converting it to a json object.

我们可以通过将其转换为 json 对象来搜索 mh 哈希。

/#{pattern}/.match(mh.to_json)

The output:

输出:

=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">

Of course this is all done in Ruby which is not a tag that you have but relates I hope.

当然,这一切都是在 Ruby 中完成的,这不是您拥有的标签,但我希望与之相关。

But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.

但是哎呀!看起来我们不能用这种模式一次做所有三个,所以我会一次做一个。

pattern = /"(title)":"((\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">

pattern = /"(tags)":"((\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

=> nil

Sorry about that last one. It will have to be handled differently.

对不起最后一个。它必须以不同的方式处理。

回答by mikewhit

I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.

我调整了正则表达式以在我自己的库中使用 JSON。我在下面详细介绍了算法行为。

First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:

首先,将 JSON 对象字符串化。然后,您需要存储匹配子字符串的开头和长度。例如:

"matched".search("ch") // yields 3

For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).

对于 JSON 字符串,这完全相同(除非您明确搜索逗号和大括号,在这种情况下,我建议您在执行正则表达式之前先对 JSON 对象进行一些转换(即认为:、{、})。

Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:

接下来,您需要重建 JSON 对象。我编写的算法通过从匹配索引递归地返回来检测 JSON 语法来做到这一点。例如,伪代码可能如下所示:

find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain

With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.

有了这些信息,就可以使用正则表达式来过滤 JSON 对象以返回键、值和父对象链。

You can see the library and code I authored at http://json.spiritway.co/

您可以在http://json.spiritway.co/查看我编写的库和代码