在 python 中从 Windows-1252(cp1252) 文件中正确读取文本

Question

提问by Krisjanis Zvaigzne

so okay, as the title suggests the problem I have is with correctly reading input from a windows-1252 encoded file in python and inserting said input into SQLAlchemy-MySql table.

好吧，正如标题所暗示的，我遇到的问题是从 python 中的 windows-1252 编码文件中正确读取输入并将所述输入插入到 SQLAlchemy-MySql 表中。

The current system setup:
Windows 7 VM with "Roger Access Control System" which outputs the file;
Ubuntu 12.04 LTS VM with a shared-folder to the Windows system so I can access the file, using "Python 2.7.3".

当前系统设置：
带有“Roger Access Control System”的Windows 7 VM，输出文件；
Ubuntu 12.04 LTS VM 与 Windows 系统的共享文件夹，以便我可以使用“Python 2.7.3”访问该文件。

Now to the actual problem, for the input file I have a "VM shared-folder" that contains a file that is genereate on a Windows 7 system through Roger Access Control System(roger.pl for more details), this file is called "PREvents.csv" which suggests to it's contents, a ";" seperated list of data.

现在到实际问题，对于输入文件，我有一个“VM 共享文件夹”，其中包含通过 Roger 访问控制系统在 Windows 7 系统上生成的文件（更多详细信息为 roger.pl），该文件称为“ PREvents.csv" 建议它的内容，一个 ";" 单独的数据列表。

An example format of the data:

数据格式示例：

2013-03-19;15:58:30;100;Jānis;Dumburs;1;Uznemums1;0;Ieeja;
2013-03-19;15:58:40;100;Jānis;Dumburs;1;Uznemums1;2;Izeja;

The 4th field contains the card owners name and 5th contains the owners lastname, the 6th contains the owners assigned group.

第 4 个字段包含卡所有者姓名，第 5 个字段包含所有者姓氏，第 6 个包含所有者分配的组。

The issue comes from the fact that any one of the 3 above mentioned fields can contain characters specific to Latvian language, in the example file the word "Jānis" contains the letter "ā" which in unicode is 257.

问题来自于上述 3 个字段中的任何一个都可以包含特定于拉脱维亚语言的字符，在示例文件中，单词“Jānis”包含字母“ā”，在 unicode 中为 257。

As I'm used to, I open the file as such:

正如我习惯的那样，我打开文件：

try:
    f = codecs.open(file, 'rb', 'cp1252')
except IOError:
    f = codecs.open(file, 'wb', 'cp1252')

So far, everything works - it opens the file and so I move on to iterate over each line of the file(this is a continuos running script so pardon the loop):

到目前为止，一切正常 - 它打开文件，所以我继续迭代文件的每一行（这是一个连续运行的脚本，所以请原谅循环）：

while True:
    line = f.readline()

    if not line:
        # Pause loop for 1 second
        time.sleep(1)
    else:
        # Split the line into list
        date, timed, userid, firstname, lastname, groupid, groupname, typed, pointname, empty = line.split(';')

And this is where the issues start, if I print repr(firstname)it prints u'J\xe2nis'which is, as far as I undestand, not correct - `\xe2\ does not represent the Latvian character "ā".
Further down the loop depending on event type I assign the variables to SQLAlchemy object and insert/update:

这就是问题开始的地方，如果我print repr(firstname)打印出来u'J\xe2nis'，据我所知，这是不正确的 - `\xe2\ 不代表拉脱维亚字符“ā”。
根据事件类型，我将变量分配给 SQLAlchemy 对象并插入/更新：

if typed == '0':  # Entry type
    event = Events(
        period,
        fullname,
        userid,
        groupname,
        timestamp,
        0,
        0
    )
    session.add(event)
else:  # Exit type
    event = session.query(Events).filter(
        Events.period == period,
        Events.exit == 0,
        Events.userid == userid
    ).first()
    if event is not None:
        event.exit = timestamp
        event.spent = timestamp - event.entry

# Commit changes to database
session.commit()

In my search for answers I've found how to define the default encoding to use:

在寻找答案时，我发现了如何定义要使用的默认编码：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Which hasn't helped me in any way.

这对我没有任何帮助。

Basically, this is all leads to the me not being able to insert the correct owners First/last name aswell as owners assigned groupname if they contain any of Latvian-specific characters, for example:

基本上，这都会导致我无法插入正确的所有者名字/姓氏以及分配给组名的所有者，如果它们包含任何特定于拉脱维亚的字符，例如：

Instead of the character "ā" it inserts "a"

I'd also like to add that I cannot change the "PREvents.csv" file encoding and the "RACS" system does not support inserting into UTF-8 or Unicode files - if you try either way, the system inserts random symbols for the Latvian-specific characters.

我还想补充一点，我无法更改“PREvents.csv”文件编码，并且“RACS”系统不支持插入 UTF-8 或 Unicode 文件 - 如果您尝试任何一种方式，系统会为拉脱维亚特定字符。

Please let me now if any other information is needed, I'll gladly provide it :)

如果需要任何其他信息，请现在告诉我，我很乐意提供:)

Any help would be highly appreciated.

任何帮助将不胜感激。

Answer 1

采纳答案by phihag

CP1252 cannot represent ā; your input contains the similar character a. reprjust displays an ASCII representation of a unicode string in Python 2.x:

CP1252不能代表ā；您的输入包含相似的字符 a。repr只显示 Python 2.x 中 unicode 字符串的 ASCII 表示：

>>> print(repr(b'J\xe2nis'.decode('cp1252')))
u'J\xe2nis'
>>> print(b'J\xe2nis'.decode('cp1252'))
Janis

Answer 2

回答by djc

I think u'J\xe2nis'is correct, see:

我认为u'J\xe2nis'是正确的，请参阅：

>>> print u'J\xe2nis'.encode('utf-8')
Janis

Are you getting actual errors from SQLAlchemy or in your application's output?

您是否从 SQLAlchemy 或应用程序的输出中收到实际错误？

Answer 3

回答by ?ngelo Polotto

I had the same problem with some XML files, I solved reading the file with ANSI encoding (Windows-1252) and writing a file with UTF-8 encoding:

我在处理一些 XML 文件时遇到了同样的问题，我解决了使用 ANSI 编码（Windows-1252）读取文件并使用 UTF-8 编码编写文件的问题：

import os
import sys

path = os.path.dirname(__file__)

file_name = 'my_input_file.xml'

if __name__ == "__main__":
    with open(os.path.join(path, './' + file_name), 'r', encoding='cp1252') as f1:
        lines = f1.read()
        f2 = open(os.path.join(path, './' + 'my_output_file.xml'), 'w', encoding='utf-8')
        f2.write(lines)
        f2.close()

在 python 中从 Windows-1252(cp1252) 文件中正确读取文本

提问by Krisjanis Zvaigzne

采纳答案by phihag

回答by djc

回答by ?ngelo Polotto

相关推荐

最近更新

标签

在 python 中从 Windows-1252(cp1252) 文件中正确读取文本

提问by Krisjanis Zvaigzne

采纳答案by phihag

回答by djc

回答by ?ngelo Polotto

相关推荐

Python 在 matplotlib 颜色栏中隐藏每个第 n 个刻度标签的最干净方法？

Python Mac OS X - 环境错误：找不到 mysql_config

Python Pandas：从 dict 在 DataFrame 中创建命名列

Python 如何腌制一个列表？

相关推荐

最近更新

标签