windows 无法将 unicode .csv 读入 R

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16838613/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 10:33:57  来源:igfitidea点击:

Cannot read unicode .csv into R

windowsrcsvutf-8read.table

提问by Ando Khachatryan

I have a .csv file, which contains the following data:

我有一个 .csv 文件,其中包含以下数据:

"?","?"
1,10
2,20

I cannot read it into R so that the column names are displayed like they are in the file.

我无法将其读入 R,因此列名的显示方式就像它们在文件中一样。

d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
head(d)

Produces the following:

产生以下内容:

> d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection './Data/1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on './Data/1.csv'
> head(d)
[1] X.
<0 rows> (or 0-length row.names)

Meanwhile, doing the same without specifying the fileEncoding produces this:

同时,在不指定 fileEncoding 的情况下执行相同操作会产生以下结果:

> d <- read.csv("./Data/1.csv")
> head(d)
  ?. ?2
1  1 10
2  2 20

When I run the "file" utility to find out the encoding of the file, it says it is UTF-8:

当我运行“文件”实用程序来找出文件的编码时,它说它是 UTF-8:

Data.csv: UTF-8 Unicode text, with CRLF line terminators

I am using RStudio, Windows 7, R version 2.15.2, 32-bit.

我正在使用 RStudio、Windows 7、R 版本 2.15.2、32 位。

Thanks in advance.

提前致谢。

回答by puslet88

I wrote a longer answer on the same issue here: R on Windows: character encoding hell.

我在这里针对同一问题写了更长的答案:Windows上的R:字符编码地狱

Quick answer, using the parameter encoding instead of fileEncoding should fix your first issue. You will not be able to read it possibly in either console or table view in RStudio, but you will be able to use it in formulaes.

快速回答,使用参数 encoding 而不是 fileEncoding 应该可以解决您的第一个问题。您可能无法在 RStudio 的控制台或表格视图中阅读它,但您可以在公式中使用它。

d <- read.csv("./Data/1.csv", encoding="UTF-8")
head(d)

Having saved your table into a UTF-8 file:

已将您的表保存到 UTF-8 文件中:

> test2 <- read.csv("test2.csv", header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", encoding = "UTF-8")
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'test2.csv'

This gives you how it looks like in the console and RStudio view

这为您提供了它在控制台和 RStudio 视图中的样子

> test2
        V1       V2
1 <U+0531> <U+0532>
2        1       10
3        2       20

However importantly you are able to manipulate this within R. Thus in my case it is possible to see that the script window input ? has UTF-8 encoding, and a grep correctly finds this encoding in your table.

但重要的是,您可以在 R 中操作它。因此,在我的情况下,可以看到脚本窗口输入 ? 具有 UTF-8 编码,并且 grep 可以在您的表中正确找到此编码。

> Encoding("?")
[1] "UTF-8"
> grep("?", as.character(test2[1,1]))
[1] 1

You may need to find suitable encoding variants that work on your settings, or possibly change them. Unfortunately I am not sure where it is done.

您可能需要找到适用于您的设置的合适的编码变体,或者可能需要更改它们。不幸的是,我不确定它在哪里完成。

You might not be able to make it pretty in all stages, but it is definitely possible to get it to work also in Windows 7 environment.

您可能无法在所有阶段都让它变得漂亮,但绝对有可能让它在 Windows 7 环境中也能正常工作。

回答by Statwonk

I tried two ways to replicate your problem.

我尝试了两种方法来复制您的问题。

I copied the characters above into RStudio, saved it to a csv with this code:

我将上面的字符复制到 RStudio 中,使用以下代码将其保存到 csv:

write.csv(c("?","?",
             1,10,
             2,20), "test.csv")

df <- read.csv("test.csv")

This worked fine.

这工作得很好。

Then I thought, well maybe R is cheating when I save it to CSV with R? So I just pasted the characters to a text file and save it as a CSV. This approach doesn't have problems either.

然后我想,当我用 R 将它保存到 CSV 时,也许 R 是在作弊?所以我只是将这些字符粘贴到一个文本文件中并将其另存为 CSV。这种方法也没有问题。

Here's my session info:

这是我的会话信息:

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
[4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
[7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] party_1.0-9       modeltools_0.2-21 strucchange_1.4-7 sandwich_2.2-10   zoo_1.7-10       
[6] GGally_0.4.4      reshape_0.8.4     plyr_1.8          ggplot2_0.9.3.1  

loaded via a namespace (and not attached):
[1] coin_1.0-23        colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
[5] gtable_0.1.2       labeling_0.2       lattice_0.20-23    MASS_7.3-29       
[9] munsell_0.4.2      mvtnorm_0.9-9995   proto_0.3-10       RColorBrewer_1.0-5
[13] reshape2_1.2.2     scales_0.2.3       splines_3.0.1      stringr_0.6.2 

回答by Julien Colomb

I had the same problem and found out that the file was corrupted.

我遇到了同样的问题,发现文件已损坏。

I opened the file with OpenOffice and saved it back using "UTF8" character set (you need to click the edit filter settings box) and then imported it with the read.csv()(no encoding or filencoding option) and it worked fine.

我使用 OpenOffice 打开文件并使用“UTF8”字符集将其保存回来(您需要单击编辑过滤器设置框),然后使用read.csv()(无编码或文件编码选项)将其导入,并且工作正常。