Python 如何将 LF 转换为 CRLF？

Question

提问by Rushy Panchal

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html

我在网上找到了大部分英文单词的列表，但换行符是 unix 风格的（以 Unicode 编码：UTF-8）。我在这个网站上找到了它：http: //dreamsteep.com/projects/the-english-open-word-list.html

How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.

如何将换行符转换为 CRLF，以便我可以遍历它们？我将在其中使用它们的程序遍历文件中的每一行，因此每行必须有一个单词。

This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard

这是文件的一部分： bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard

It should be:

它应该是：

bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard

How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).

如何将我的文件转换为这种类型？注意：它有 26 个文件（每个字母一个），总共有 80,000 个字左右（所以程序应该非常快）。

I don't know where to start because I've never worked with unicode. Thanks in advance!

我不知道从哪里开始，因为我从未使用过 unicode。提前致谢！

Using rUas the parameter (as suggested), with this in my code:

使用rU作为参数（如建议），这个在我的代码：

with open(my_file_name, 'rU') as my_file:
    for line in my_file:
        new_words.append(str(line))
my_file.close()

I get this error:

我收到此错误：

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    addWords('B Words')
  File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
    for line in my_file:
  File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>

Can anyone help me with this?

谁能帮我这个？

Answer 1

采纳答案by NPE

Instead of converting, you should be able to just open the file using Python's universal newline support:

您应该能够使用 Python 的通用换行支持打开文件，而不是转换：

f = open('words.txt', 'rU')

(Note the U.)

（注意U.）

Answer 2

回答by dugres

You can use the replace method of strings. Like

您可以使用字符串的替换方法。喜欢

txt.replace('\n', '\r\n')

EDIT :
in your case :

编辑：
在你的情况下：

with open('input.txt') as inp, open('output.txt', 'w') as out:
    txt = inp.read()
    txt = txt.replace('\n', '\r\n')
    out.write(txt)

Answer 3

回答by Eric Rahmig

You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.

您不需要转换文件中的行尾以便能够对其进行迭代。正如 NPE 所建议的，只需使用python 的通用换行模式即可。

The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252encoding and that causes a UnicodeDecodeError.

发生 UnicodeDecodeError 是因为您正在处理的文件被编码为 UTF-8，并且当您尝试将内容从字节解码为字符串时str(line)，Python 正在使用cp1252编码将从文件中读取的字节转换为 Python 3 字符串（即一系列 unicode 代码点）。但是，这些文件中的某些字节无法使用cp1252编码进行解码并导致 UnicodeDecodeError。

If you change str(line)to line.decode('utf-8')you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bitwriteup for some more details.

如果您更改str(line)为line.decode('utf-8')您不应再收到 UnicodeDecodeError。查看文本 Vs。数据而不是 Unicode Vs。8-bitwriteup 了解更多细节。

Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky useful.

最后，您可能还会发现Joel Spolsky 的The Absolute minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)很有用。

Answer 4

回答by Joab Leite

You can use cereja package

您可以使用 cereja 包

pip install cereja

import cereja cereja.lf_to_crlf(dir_or_file_path)

or

或者

cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])

You can substitute for any standard. See the filetools module

您可以替代任何标准。查看文件工具模块

Python 如何将 LF 转换为 CRLF？

提问by Rushy Panchal

采纳答案by NPE

回答by dugres

回答by Eric Rahmig

回答by Joab Leite

相关推荐

最近更新

标签

Python 如何将 LF 转换为 CRLF？

提问by Rushy Panchal

采纳答案by NPE

回答by dugres

回答by Eric Rahmig

回答by Joab Leite

相关推荐

如何从 Python 中的类内部访问类方法

python中的卡尔曼二维滤波器

Python 字符串中所有唯一字符的列表？

Python - “元组索引超出范围”

相关推荐

最近更新

标签