使用 jsonlite 包解析 JSON 文件时出错
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26519455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Error parsing JSON file with the jsonlite package
提问by adolfohc
I'm trying to read a JSON file into R but I got this error:
我正在尝试将 JSON 文件读入 R,但出现此错误:
Error in parseJSON(txt) : parse error: trailing garbage
[ 33.816101, -117.979401 ] } { "a": "Mozilla\/4.0 (compatibl
(right here) ------^
I downloaded the file from http://1usagov.measuredvoice.com/and unzipped it using 7zip, then I used the following code in R:
我从http://1usagov.measuredvoice.com/下载了文件并使用 7zip 解压它,然后我在 R 中使用了以下代码:
library(jsonlite)
jsonData <- fromJSON("usagov_bitly_data2013-05-17-1368832207")
I'm not sure why this error happens, I looked up in Google but there's no information, someone that could help me? Is this a file problem or my code?
我不确定为什么会发生这个错误,我在谷歌上查了一下,但没有信息,有人可以帮助我吗?这是文件问题还是我的代码?
回答by hrbrmstr
ANOTHER UPDATE
另一个更新
You can use the ndjsonpackage to process this ndjson/streaming JSON data. It's faster than jsonlite::stream_in()and always produces a completely "flat" data frame:
您可以使用该ndjson包来处理此 ndjson/streaming JSON 数据。它比jsonlite::stream_in()并始终生成完全“扁平”的数据帧要快:
system.time(bitly01 <- ndjson::stream_in("usagov_bitly_data2013-05-17-1368832207.gz"))
## user system elapsed
## 0.146 0.004 0.154
system.time(bitly02 <- jsonlite::stream_in(file("usagov_bitly_data2013-05-17-1368832207.gz"), verbose=FALSE, pagesize=10000))
## user system elapsed
## 0.419 0.008 0.427
If we examine the resultant data frame2, you'll see ndjsonexpands llinto ll.0and ll.1where you get a listcolumn in jsonlitethat you have to deal with later.
如果我们检查结果数据框 2,您将看到ndjson扩展ll到ll.0以及ll.1您list在其中获得一列的位置jsonlite,您必须稍后处理。
ndjson:
ndjson:
dplyr::glimpse(bitly01)
## Observations: 3,959
## Variables: 19
## $ a <chr> "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 ...
## $ al <chr> "en-US", "en-us", "en-US,en;q=0.5", "en-US", "en", "en-US", "en-US,en;q=0.5", "en-us", "e...
## $ c <chr> "US", NA, "US", "US", NA, "US", "US", NA, "AU", NA, "US", "US", "US", "US", "US", "US", "...
## $ cy <chr> "Anaheim", NA, "Fort Huachuca", "Houston", NA, "Mishawaka", "Hammond", NA, "Sydney", NA, ...
## $ g <chr> "15r91", "ifIpBW", "10DaxOu", "TysVFU", "10IGW7m", "13GrCeP", "YmtpnZ", "13oM0hV", "15r91...
## $ gr <chr> "CA", NA, "AZ", "TX", NA, "IN", "WI", NA, "02", NA, "OH", "MD", "KY", "OR", "IL", "TX", "...
## $ h <chr> "10OBm3W", "ifIpBW", "10DaxOt", "TChsoQ", "10IGW7l", "13GrCeP", "YmtpnZ", "15PUeH0", "10O...
## $ hc <dbl> 1365701422, 1302189369, 1368814585, 1354719206, 1368738258, 1368130510, 1363711958, 13687...
## $ hh <chr> "j.mp", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "go...
## $ l <chr> "pontifier", "bitly", "jaxstrong", "o_5004fs3lvd", "peacecorps", "bitly", "bitly", "nasat...
## $ ll.0 <dbl> 33.8161, NA, 31.5273, 29.7633, NA, 41.6123, 45.0070, NA, -33.8615, NA, 39.5151, 39.1317, ...
## $ ll.1 <dbl> -117.9794, NA, -110.3607, -95.3633, NA, -86.1381, -92.4591, NA, 151.2055, NA, -84.3983, -...
## $ nk <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ r <chr> "direct", "http://www.usa.gov/", "http://www.facebook.com/l.php?u=http%3A%2F%2F1.usa.gov%...
## $ t <dbl> 1368832205, 1368832207, 1368832209, 1368832209, 1368832208, 1368832209, 1368832210, 13688...
## $ tz <chr> "America/Los_Angeles", "", "America/Phoenix", "America/Chicago", "", "America/Indianapoli...
## $ u <chr> "http://www.nsa.gov/", "http://answers.usa.gov/system/selfservice.controller?CONFIGURATIO...
## $ _heartbeat_ <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kw <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
jsonlite:
jsonlite:
dplyr::glimpse(bitly02)
## Observations: 3,959
## Variables: 18
## $ a <chr> "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 ...
## $ c <chr> "US", NA, "US", "US", NA, "US", "US", NA, "AU", NA, "US", "US", "US", "US", "US", "US", "...
## $ nk <int> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ tz <chr> "America/Los_Angeles", "", "America/Phoenix", "America/Chicago", "", "America/Indianapoli...
## $ gr <chr> "CA", NA, "AZ", "TX", NA, "IN", "WI", NA, "02", NA, "OH", "MD", "KY", "OR", "IL", "TX", "...
## $ g <chr> "15r91", "ifIpBW", "10DaxOu", "TysVFU", "10IGW7m", "13GrCeP", "YmtpnZ", "13oM0hV", "15r91...
## $ h <chr> "10OBm3W", "ifIpBW", "10DaxOt", "TChsoQ", "10IGW7l", "13GrCeP", "YmtpnZ", "15PUeH0", "10O...
## $ l <chr> "pontifier", "bitly", "jaxstrong", "o_5004fs3lvd", "peacecorps", "bitly", "bitly", "nasat...
## ## $ al <chr> "en-US", "en-us", "en-US,en;q=0.5", "en-US", "en", "en-US", "en-US,en;q=0.5", "en-us", "e...
## $ hh <chr> "j.mp", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "go...
## $ r <chr> "direct", "http://www.usa.gov/", "http://www.facebook.com/l.php?u=http%3A%2F%2F1.usa.gov%...
## $ u <chr> "http://www.nsa.gov/", "http://answers.usa.gov/system/selfservice.controller?CONFIGURATIO...
## $ t <int> 1368832205, 1368832207, 1368832209, 1368832209, 1368832208, 1368832209, 1368832210, 13688...
## $ hc <int> 1365701422, 1302189369, 1368814585, 1354719206, 1368738258, 1368130510, 1363711958, 13687...
## $ cy <chr> "Anaheim", NA, "Fort Huachuca", "Houston", NA, "Mishawaka", "Hammond", NA, "Sydney", NA, ...
## $ ll <list> [<33.8161, -117.9794>, NULL, <31.5273, -110.3607>, <29.7633, -95.3633>, NULL, <41.6123, ...
## $ _heartbeat_ <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kw <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
UPDATE
更新
The latest version of the jsonlitepackage supports streaming JSON (which is what this actually is). You can now read it with one line like so:
该jsonlite软件包的最新版本支持流式 JSON(实际上就是这样)。您现在可以像这样用一行阅读它:
json_file <- stream_in(file("usagov_bitly_data2013-05-17-1368832207"))
See also Jeroen's answer below for stream-parsing it directly over http.
另请参阅下面 Jeroen 的回答,以直接通过 http 对其进行流解析。
OLD ANSWER
旧答案
It turns out this is a "pseudo-JSON" file. I come across these in many naive API systems I work in. Each line is valid JSON, but the individual objects aren't in a JSON array. You need to use readLinesand then build your own, valid JSON array from it and pass that into fromJSON:
事实证明这是一个“伪 JSON”文件。我在我工作的许多原始 API 系统中遇到过这些。每一行都是有效的 JSON,但单个对象不在 JSON 数组中。您需要使用readLines然后从中构建您自己的有效 JSON 数组并将其传递到fromJSON:
library(jsonlite)
# read in individual JSON lines
json_file <- "usagov_bitly_data2013-05-17-1368832207"
# turn it into a proper array by separating each object with a "," and
# wrapping that up in an array with "[]"'s.
dat <- fromJSON(sprintf("[%s]", paste(readLines(json_file), collapse=",")))
dim(dat)
## [1] 3959 18
str(dat)
## 'data.frame': 3959 obs. of 18 variables:
## $ a : chr "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile "| __truncated__ "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4"| __truncated__ "Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0" "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; SGH-T889 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile S"| __truncated__ ...
## $ c : chr "US" NA "US" "US" ...
## $ nk : int 0 0 1 1 0 0 1 0 0 0 ...
## $ tz : chr "America/Los_Angeles" "" "America/Phoenix" "America/Chicago" ...
## $ gr : chr "CA" NA "AZ" "TX" ...
## $ g : chr "15r91" "ifIpBW" "10DaxOu" "TysVFU" ...
## $ h : chr "10OBm3W" "ifIpBW" "10DaxOt" "TChsoQ" ...
## $ l : chr "pontifier" "bitly" "jaxstrong" "o_5004fs3lvd" ...
## $ al : chr "en-US" "en-us" "en-US,en;q=0.5" "en-US" ...
## $ hh : chr "j.mp" "1.usa.gov" "1.usa.gov" "1.usa.gov" ...
## ... (goes on for a while, many columns)
I combined the readLinesin with the paste/sprintfcall since the object.sizeof the resultant (temporary) object is 2,025,656bytes (~2MB) and didn't feel like doing an rmon a separate temporary variable.
我将readLinesin 与paste/sprintf调用结合起来,因为object.size结果(临时)对象的 是2,025,656字节(~2MB)并且不想rm在单独的临时变量上做一个。
回答by Jeroen
This format called ndjsonand designed to stream import (including the gzip). Just use this:
这种格式称为ndjson并设计用于流式导入(包括 gzip)。只需使用这个:
con <- url("http://1usagov.measuredvoice.com/bitly_archive/usagov_bitly_data2013-05-17-1368832207.gz")
mydata <- jsonlite::stream_in(gzcon(con))
Or alternatively use the curl package for better performance or to customize the http request:
或者使用 curl 包以获得更好的性能或自定义 http 请求:
library(curl)
con <- curl("http://1usagov.measuredvoice.com/bitly_archive/usagov_bitly_data2013-05-17-1368832207.gz")
mydata <- jsonlite::stream_in(gzcon(con))
回答by Gil Hornung
The package tidyjsoncan also read this "json lines" format:
read_json("my.json",format="jsonl")
该包tidyjson还可以读取这种“json行”格式:
read_json("my.json",format="jsonl")
The output is then parsed using a series of pipes, rather than having lists nested with dataframes.
然后使用一系列管道解析输出,而不是将列表与数据帧嵌套在一起。

