windows 从文本文件中删除未知字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4278636/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing unknown characters from a text file
提问by meepmeep
I have a large number of files containing data I am trying to process using a Python script.
我有大量文件,其中包含我尝试使用 Python 脚本处理的数据。
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
这些文件采用未知编码,如果我在 Notepad++ 中打开它们,它们包含由大量“空”字符分隔的数字数据(在 Notepad++ 中以黑色背景表示为白色 NULL)。
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
为了处理这个问题,我用空字符 \x00 分隔文件,并使用以下脚本仅检索数值:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
然而,有时文件中还有其他无法识别的字符(据我所知,总是在最开始)——这些在 Notepad++ 中显示为“EOT”、“SUB”和“ETX”。它们似乎干扰了 Python 中文件的处理——文件似乎以这些字符结尾,即使在 Notepad++ 中显然有更多的数据可见。
How can I remove all non-ASCII characters from these files prior to processing?
如何在处理之前从这些文件中删除所有非 ASCII 字符?
回答by Martin v. L?wis
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().
您正在以文本模式打开文件。这意味着第一个 Ctrl-Z 字符被视为文件结束字符。在 open() 中指定 'rb' 而不是 'r'。
回答by Chris Pfohl
I don't know if this will work for sure, but you could try using the IO methods in the codec
module:
我不知道这是否确实有效,但您可以尝试使用codec
模块中的 IO 方法:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile
just like a normal FILE object.
您可以inFile
像对待普通的 FILE 对象一样对待。
This may or may not help you, but it probably will.
这可能会也可能不会帮助你,但它可能会。
[EDIT]
[编辑]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r')
with h=open(os.path.join(root, rawfile),'r', 'utf-8')
基本上你会替换:h=open(os.path.join(root, rawfile),'r')
用h=open(os.path.join(root, rawfile),'r', 'utf-8')
回答by Kissaki
The file.read()function will read until EOF. As you said it stops too early you want to continue reading the file even when hitting an EOF. Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell()when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
该file.read()函数将读到EOF。正如您所说,它停止得太早,即使遇到 EOF,您也想继续读取文件。确保在阅读整个文件后停止。您可以通过在遇到 EOF 时通过file.tell()检查文件中的位置并在遇到文件大小(在读取之前读取文件大小)时停止来实现此目的。
As this is rather complex you may want to use file.nextand iterate over bytes.
由于这相当复杂,您可能希望使用file.next并遍历字节。
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define. E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string. See an ASCII table.
要删除非 ascii 字符,您可以使用特定字符的白名单或根据您定义的范围检查读取的字节。例如是 x30 和 x39 之间的字节(一个数字)-> 保留/保存在某处/将其添加到字符串中。请参阅ASCII 表。