PHP：UTF 8 字符编码

Question

提问by Daniel Clark

I am scraping a list of RSS feeds by using cURL, and then I am reading and parsing the RSS data with SimpleXML. The sorted data is then inserted into a mySQL database.

我正在使用 cURL 抓取 RSS 提要列表，然后我正在使用 SimpleXML 读取和解析 RSS 数据。然后将排序后的数据插入到 mySQL 数据库中。

However, as notice on http://dansays.co.uk/research/MNA/rss.phpI am having several issues with characters not displaying correctly.

但是，正如http://dansays.co.uk/research/MNA/rss.php 上的通知，我遇到了几个字符显示不正确的问题。

Examples:

例子：

aGuitar Hero: Van Halena Trailer And Tracklist Available

NV 10/10/09 a“ Salt Lake City, UT 10/11/09 a“ Denver, CO 10/13/09 a“

I have tried using htmlentities and htmlspecialchars on the data before inserting them into the database, but it doesn't seem to help resolve issue.

在将数据插入数据库之前，我尝试在数据上使用 htmlentities 和 htmlspecialchars，但它似乎无助于解决问题。

How could I possibly resolve this issue I am having?

我怎么可能解决我遇到的这个问题？

Thanks for any advices.

感谢您的任何建议。

Updated

更新

I've tried what Greg suggested, and the issue is still here...

我已经尝试了 Greg 的建议，但问题仍然存在......

Here is the code I used to do SET NAMES in PDO:

这是我用来在 PDO 中执行 SET NAMES 的代码：

$dbh = new PDO($dbstring, $username, $password); 

$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 

$dbh->query('SET NAMES "utf8"');

I did a bit of echo'ing with the simplexml data before it is sorted and inserted into the database, and I now believe it is something to do with the cURL...

在对 simplexml 数据进行排序并插入数据库之前，我对它进行了一些回显，现在我相信这与 cURL 有关系...

Here is what I have for cURL:

这是我对 cURL 的看法：

$ch = curl_init($url);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);

curl_setopt($ch, CURLOPT_HEADER, 0);

curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');

$data = curl_exec($ch);

curl_close($ch);

$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);

Issue Resolved

问题解决了

I had to set the content charset in the RSS/HTML page to "UTF-8" to resolve this issue. I guess this isn't a real fix as the char problems are still there in the raw data. Looking forward to proper support for it in PHP6!

我必须将 RSS/HTML 页面中的内容字符集设置为“UTF-8”才能解决此问题。我想这不是真正的解决方案，因为原始数据中仍然存在字符问题。期待在 PHP6 中对它的适当支持！

Answer 1

采纳答案by ianaré

Like all debugging, you start by isolating the problem:

与所有调试一样，您首先要隔离问题：

I am scraping a list of RSS feeds by using cURL,- look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)

我正在使用 cURL 抓取 RSS 提要列表，- 查看 RSS 提要中出现问题的 xml（有多个提要，因此某些提要可能是正确的，而某些提要是错误的错误的方式不同）

and then I am reading and parsing the RSS data with SimpleXML.- print out the field that SimpleXML read out - is it ok or does a problem show up?

然后我正在使用 SimpleXML 读取和解析 RSS 数据。- 打印出 SimpleXML 读出的字段 - 是否可以或是否出现问题？

The sorted data is then inserted into a mySQL database.- print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.

然后将排序后的数据插入到 mySQL 数据库中。- 为出现问题的数据段打印出 hex(field)、length(field) 和 char_length(field)。

EDIT

编辑

Take the feed http://hangout.altsounds.com/external.php?type=RSS2, put it into the validator http://validator.w3.org/feed/. They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm.

将提要http://hangout.altsounds.com/external.php?type=RSS2放入验证器http://validator.w3.org/feed/ 中。他们将其内容类型声明为 iso-8859-1，但一些实际内容（例如引号）位于 cp1252 之类的内容中 - 例如，他们使用字节 0x93 来表示左引号 - http:// www.fileformat.info/info/unicode/char/201C/charset_support.htm。

What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.

令人讨厌的是，这不会出现在某些工具中 - Firefox 似乎猜测发生了什么并正确显示引号，更重要的是，SimpleXML 将 0x93 转换为 utf8，因此它显示为 0xc293，即加剧了问题。

EDIT 2

编辑 2

A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.

使该提要更正确读取的解决方法是在传递给简单 XML 之前将“ISO-8859-1”替换为“Windows-1252”。它不会 100% 工作，因为事实证明提要的某些部分是 UTF8。

The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.

假设您无法让世界上的每个人都纠正他们的提要，一般方法是将您需要的任何变通方法隔离到与发出格式错误数据的外部系统的接口，并将纯清晰的 utf8 传递到集线器你的系统。保存原始外部提要的日期副本，以便您将来记住为什么需要变通方法，分离并注释实现变通方法的代码行，以便在外部组织更正其提要时很容易获得和更改（或以不同的方式打破它），并不时再次检查。不幸的是，您不是按照规范进行编程，而是针对错误的当前状态进行编程，因此没有永久、干净的解决方案——您能做的最好的事情就是隔离、记录和监控。

Answer 2

回答by Greg

Your page is being served as UTF-8 so I'd point my finger at the database.

您的页面被用作 UTF-8，因此我将手指指向数据库。

Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:

在任何 SELECT 或 INSERTS 之前确保连接是 UTF-8 - 在 MySQL 中：

SET NAMES "utf8"

Answer 3

回答by ianaré

Just a quick note about CURLOPT_ENCODING: it's the Accept-Encodingheader, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".

关于CURLOPT_ENCODING的简短说明：它是Accept-Encoding标头，它与字符编码完全不同。支持的接受编码是“identity”、“deflate”和“gzip”。

Answer 4

回答by Ionu? G. Stan

It may have to do with the XML prologue, which looks like this for that particular feed you linked to:

它可能与 XML 序言有关，对于您链接到的特定提要，它看起来像这样：

<?xml version="1.0" encoding="ISO-8859-1" ?>

As far as I know libxml, on which SimpleXML is based, looks for this kind of things. I'm not sure about XML files but I'm sure that with HTML strings it looks for METAelements that specify the charset.

据我所知，SimpleXML 所基于的 libxml 会寻找这种东西。我不确定 XML 文件，但我确定它使用 HTML 字符串查找META指定字符集的元素。

Try stripping the XML prologue (I solved a similar problem once by stripping the HTML METAtags) and don't forget to utf8_encode()the data before feeding it to SimpleXMLElement.

尝试剥离 XML 序言（我曾经通过剥离 HTMLMETA标签解决了一个类似的问题）并且在utf8_encode()将数据提供给 SimpleXMLElement 之前不要忘记数据。

PHP：UTF 8 字符编码

提问by Daniel Clark

采纳答案by ianaré

回答by Greg

回答by ianaré

回答by Ionu? G. Stan

相关推荐

最近更新

标签

PHP：UTF 8 字符编码

提问by Daniel Clark

采纳答案by ianaré

回答by Greg

回答by ianaré

回答by Ionu? G. Stan

相关推荐

php PHP标头重定向不起作用

php symfony 与 cakephp

linux - 当 php 安装为 apache 模块时，从命令行运行 php 脚本

php 会话在登录表单上对用户进行身份验证

相关推荐

最近更新

标签