windows 从文本文件中删除未知字符

Question

提问by meepmeep

I have a large number of files containing data I am trying to process using a Python script.

我有大量文件，其中包含我尝试使用 Python 脚本处理的数据。

The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).

这些文件采用未知编码，如果我在 Notepad++ 中打开它们，它们包含由大量“空”字符分隔的数字数据（在 Notepad++ 中以黑色背景表示为白色 NULL）。

In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:

为了处理这个问题，我用空字符 \x00 分隔文件，并使用以下脚本仅检索数值：

stripped_data=[]
for root,dirs,files in os.walk(PATH):
    for rawfile in files:
        (dirName, fileName)= os.path.split(rawfile)
        (fileBaseName, fileExtension)=os.path.splitext(fileName)
        h=open(os.path.join(root, rawfile),'r')
        line=h.read()
        for raw_value in line.split('\x00'):
            try:
                test=float(raw_value)
                stripped_data.append(raw_value.strip())
            except ValueError:  
                pass

However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.

然而，有时文件中还有其他无法识别的字符（据我所知，总是在最开始）——这些在 Notepad++ 中显示为“EOT”、“SUB”和“ETX”。它们似乎干扰了 Python 中文件的处理——文件似乎以这些字符结尾，即使在 Notepad++ 中显然有更多的数据可见。

How can I remove all non-ASCII characters from these files prior to processing?

如何在处理之前从这些文件中删除所有非 ASCII 字符？

Answer 1

回答by Martin v. L?wis

You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().

您正在以文本模式打开文件。这意味着第一个 Ctrl-Z 字符被视为文件结束字符。在 open() 中指定 'rb' 而不是 'r'。

Answer 2

回答by Chris Pfohl

I don't know if this will work for sure, but you could try using the IO methods in the codecmodule:

我不知道这是否确实有效，但您可以尝试使用codec模块中的 IO 方法：

import codec

inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
    do_stuff()

You can treat the inFilejust like a normal FILE object.

您可以inFile像对待普通的 FILE 对象一样对待。

This may or may not help you, but it probably will.

这可能会也可能不会帮助你，但它可能会。

[EDIT]

[编辑]

Basically you'll replace: h=open(os.path.join(root, rawfile),'r')with h=open(os.path.join(root, rawfile),'r', 'utf-8')

基本上你会替换：h=open(os.path.join(root, rawfile),'r')用h=open(os.path.join(root, rawfile),'r', 'utf-8')

Answer 3

回答by Kissaki

The file.read()function will read until EOF. As you said it stops too early you want to continue reading the file even when hitting an EOF. Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell()when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).

该file.read（）函数将读到EOF。正如您所说，它停止得太早，即使遇到 EOF，您也想继续读取文件。确保在阅读整个文件后停止。您可以通过在遇到 EOF 时通过file.tell()检查文件中的位置并在遇到文件大小（在读取之前读取文件大小）时停止来实现此目的。

As this is rather complex you may want to use file.nextand iterate over bytes.

由于这相当复杂，您可能希望使用file.next并遍历字节。

To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define. E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string. See an ASCII table.

要删除非 ascii 字符，您可以使用特定字符的白名单或根据您定义的范围检查读取的字节。例如是 x30 和 x39 之间的字节（一个数字）-> 保留/保存在某处/将其添加到字符串中。请参阅ASCII 表。

windows 从文本文件中删除未知字符

提问by meepmeep

回答by Martin v. L?wis

回答by Chris Pfohl

回答by Kissaki

相关推荐

最近更新

标签

windows 从文本文件中删除未知字符

提问by meepmeep

回答by Martin v. L?wis

回答by Chris Pfohl

回答by Kissaki

相关推荐

在 Windows XP 的后台进程中捕获屏幕视频 C# .NET

windows 即使我获得了类窗口的句柄，BringWindowToTop 也不起作用

windows 错误 C2446：==：没有从 const char * 到 TCHAR * 的转换

windows C++ 检测操作系统版本

相关推荐

最近更新

标签