python编码utf-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15092437/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python encoding utf-8
提问by vekah
I am doing some scripts in python. I create a string that I save in a file. This string got lot of data, coming from the arborescence and filenames of a directory. According to convmv, all my arborescence is in UTF-8.
我正在用 python 做一些脚本。我创建了一个保存在文件中的字符串。这个字符串有很多数据,来自目录的树状和文件名。根据 convvv,我所有的树形都在 UTF-8 中。
I want to keep everything in UTF-8 because I will save it in MySQL after. For now, in MySQL, which is in UTF-8, I got some problem with some characters (like é or è - I'am French).
我想将所有内容都保留在 UTF-8 中,因为之后我会将其保存在 MySQL 中。目前,在 UTF-8 格式的 MySQL 中,我遇到了一些字符问题(例如 é 或 è - 我是法国人)。
I want that python always use string as UTF-8. I read some informations on the internet and i did like this.
我希望 python 总是使用字符串作为 UTF-8。我在互联网上阅读了一些信息,我确实喜欢这个。
My script begin with this :
我的脚本是这样开始的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
def createIndex():
import codecs
toUtf8=codecs.getencoder('UTF8')
#lot of operations & building indexSTR the string who matter
findex=open('config/index/music_vibration_'+date+'.index','a')
findex.write(codecs.BOM_UTF8)
findex.write(toUtf8(indexSTR)) #this bugs!
And when I execute, here is the answer : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)
当我执行时,这是答案: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)
Edit:
I see, in my file, the accent are nicely written. After creating this file, I read it and I write it into MySQL.
But I dont understand why, but I got problem with encoding.
My MySQL database is in utf8, or seems to be SQL query SHOW variables LIKE 'char%'returns me only utf8 or binary.
编辑:我看到,在我的文件中,口音写得很好。创建此文件后,我读取它并将其写入 MySQL。但我不明白为什么,但我遇到了编码问题。我的 MySQL 数据库是 utf8,或者似乎是 SQL 查询SHOW variables LIKE 'char%'只返回 utf8 或二进制。
My function looks like this :
我的功能是这样的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
def saveIndex(index,date):
import MySQLdb as mdb
import codecs
sql = mdb.connect('localhost','admin','*******','music_vibration')
sql.charset="utf8"
findex=open('config/index/'+index,'r')
lines=findex.readlines()
for line in lines:
if line.find('#artiste') != -1:
artiste=line.split('[:::]')
artiste=artiste[1].replace('\n','')
c=sql.cursor()
c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
nbr=c.fetchone()
if nbr[0]==0:
c=sql.cursor()
iArt+=1
c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')
And artiste who are nicely displayed in the file writes bad into the BDD. What is the problem ?
并且在文件中很好地显示的艺术家将错误写入BDD。问题是什么 ?
采纳答案by Martijn Pieters
You don't need to encode data that is alreadyencoded. When you try to do that, Python will first try to decodeit to unicodebefore it can encode it back to UTF-8. That is what is failing here:
您不需要对已经编码的数据进行编码。当您尝试这样做时,Python 将首先尝试将其解码unicode为 UTF-8 ,然后才能将其编码回 UTF-8。这就是这里失败的原因:
>>> data = u'\u00c3' # Unicode data
>>> data = data.encode('utf8') # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8') # Try to *re*-encode it
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Just write your data directly to the file, there is noneed to encode already-encoded data.
只需直接写您的数据文件,也没有必要编码已编码的数据。
If you instead build up unicodevalues instead, you would indeed have to encode those to be writable to a file. You'd want to use codecs.open()instead, which returns a file object that will encode unicode values to UTF-8 for you.
如果改为建立unicode值,则确实必须将这些值编码为可写到文件中。您想codecs.open()改用它,它返回一个文件对象,该对象将为您将 unicode 值编码为 UTF-8。
You also reallydon't want to write out the UTF-8 BOM, unlessyou haveto support Microsoft tools that cannot read UTF-8 otherwise (such as MS Notepad).
您也真的不想写出 UTF-8 BOM,除非您必须支持无法读取 UTF-8 的 Microsoft 工具(例如 MS 记事本)。
For your MySQL insert problem, you need to do two things:
对于你的 MySQL 插入问题,你需要做两件事:
Add
charset='utf8'to yourMySQLdb.connect()call.Use
unicodeobjects, notstrobjects when querying or inserting, but use sql parametersso the MySQL connector can do the right thing for you:artiste = artiste.decode('utf8') # it is already UTF8, decode to unicode c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,)) # ... c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
添加
charset='utf8'到您的MySQLdb.connect()通话中。在查询或插入时使用
unicode对象,而不是str对象,但使用 sql 参数,以便 MySQL 连接器可以为您做正确的事情:artiste = artiste.decode('utf8') # it is already UTF8, decode to unicode c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,)) # ... c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
It may actually work better if you used codecs.open()to decode the contents automatically instead:
如果您过去使用codecs.open()自动解码内容,它实际上可能会更好:
import codecs
sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')
with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
for line in findex:
if u'#artiste' not in line:
continue
artiste=line.split(u'[:::]')[1].strip()
cursor = sql.cursor()
cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
if not cursor.fetchone()[0]:
cursor = sql.cursor()
cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
artists_inserted += 1
You may want to brush up on Unicode and UTF-8 and encodings. I can recommend the following articles:
您可能想复习一下 Unicode 和 UTF-8 以及编码。我可以推荐以下文章:
Pragmatic Unicodeby Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky
内德巴切尔德的实用 Unicode
每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)作者:Joel Spolsky
回答by Ev Haus
Unfortunately, the string.encode() method is not always reliable. Check out this thread for more information: What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python
不幸的是,string.encode() 方法并不总是可靠的。查看此线程以获取更多信息:在 python 中将某些字符串(utf-8 或其他)转换为简单的 ASCII 字符串的简单方法是什么?

