如何摆脱python中字符串中的b前缀？

Question

提问by Stan Shunpike

A bunch of the tweets I am importing are having this issue where they read

我正在导入的一堆推文在他们阅读的地方遇到了这个问题

b'I posted a new photo to Facebook'

I gather the bindicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the bdoesn't go away and is interferring in future code.

我收集b表明它是一个字节。但这被证明是有问题的，因为在我最终编写的 CSV 文件中，b它不会消失并且会干扰未来的代码。

Is there a simple way to remove this bprefix from my lines of text?

有没有一种简单的方法可以b从我的文本行中删除这个前缀？

Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.

请记住，我似乎需要将文本编码为 utf-8，否则 tweepy 无法从网络中提取它们。

Here's the link content I'm analyzing:

这是我正在分析的链接内容：

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link'

Code Attempt

代码尝试

outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets]
print(outtweets)

Error

错误

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-21-6019064596bf> in <module>()
      1 for screen_name in user_list:
----> 2     get_all_tweets(screen_name,"instance file")

<ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)
     99             with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:
    100                 writer = csv.writer(f)
--> 101                 writer.writerows(outtweets)
    102         else:
    103             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:

C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>

Answer 1

回答by hiro protagonist

you need to decodethe bytesof you want a string:

你需要解码的bytes你想要的字符串：

b = b'1234'
print(b.decode('utf-8'))  # '1234'

Answer 2

回答by Jonathan Komar

It is just letting you know that the object you are printing is not a string, rather a byte object as a byte literal. People explain this in incomplete ways, so here is my take.

它只是让您知道您正在打印的对象不是字符串，而是作为字节文字的字节对象。人们以不完整的方式解释了这一点，所以这是我的看法。

Consider creating a byte objectby typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string objectencoded in utf-8. (Note that converting here means decoding)

考虑创建一个字节对象通过键入一个字节字面（字面定义字节对象实际上并没有使用一个字节对象例如通过键入b“”）并将其转换成一个字符串对象以UTF-8编码。（注意这里的转换是指解码）

byte_object= b"test" # byte object by literally typing characters
print(byte_object) # Prints b'test'
print(byte_object.decode('utf8')) # Prints "test" without quotations

You see that we simply apply the .decode(utf8)function.

您会看到我们只是应用了该.decode(utf8)功能。

Bytes in Python

Python 中的字节

https://docs.python.org/3.3/library/stdtypes.html#bytes

String literals are described by the following lexical definitions:

字符串文字由以下词法定义描述：

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
stringescapeseq ::=  "\" <any source character>

bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "\" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "\">
bytesescapeseq ::=  "\" <any ASCII character>

Answer 3

回答by salmanwahed

You need to decode it to convert it to a string. Check the answer here about bytes literal in python3.

您需要对其进行解码以将其转换为字符串。在此处查看有关 python3 中字节文字的答案。

In [1]: b'I posted a new photo to Facebook'.decode('utf-8')
Out[1]: 'I posted a new photo to Facebook'

Answer 4

回答by Avinash Chougule

****How to remove b' ' chars which is decoded string in python ****

****如何删除在python中解码字符串的b' '字符****

import base64
a='cm9vdA=='
b=base64.b64decode(a).decode('utf-8')
print(b)

Answer 5

回答by Fernando D Jaime

On python 3.6 with django 2.0, decode on a byte literal does not works as expected. Yeah i get the right result when i print it, but the b'value' is still there even if you print it right.

在带有 django 2.0 的 python 3.6 上，对字节文字的解码无法按预期工作。是的，当我打印它时我得到了正确的结果，但是即使你打印正确，b'value' 仍然存在。

This is what im encoding

这就是我的编码

uid': urlsafe_base64_encode(force_bytes(user.pk)),

This is what im decoding:

这是即时解码：

uid = force_text(urlsafe_base64_decode(uidb64))

This is what django 2.0 says :

这就是 django 2.0 所说的：

urlsafe_base64_encode(s)[source]

Encodes a bytestring in base64 for use in URLs, stripping any trailing equal signs.

以 base64 编码字节串以用于 URL，去除任何尾随等号。

urlsafe_base64_decode(s)[source]

Decodes a base64 encoded string, adding back any trailing equal signs that might have been stripped.

解码 base64 编码的字符串，添加回任何可能已被剥离的尾随等号。

This is my account_activation_email_test.html file

这是我的 account_activation_email_test.html 文件

{% autoescape off %}
Hi {{ user.username }},

Please click on the link below to confirm your registration:

http://{{ domain }}{% url 'accounts:activate' uidb64=uid token=token %}
{% endautoescape %}

This is my console response:

这是我的控制台响应：

Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Activate Your MySite Account From: webmaster@localhost To: [email protected] Date: Fri, 20 Apr 2018 06:26:46 -0000 Message-ID: <152420560682.16725.4597194169307598579@Dash-U>

内容类型：文本/纯文本；charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Activate Your MySite Account From: webmaster@localhost To: [email protected] Date: Fri, 20 Apr 2018 06:26:46 - 0000 消息 ID：<152420560682.16725.4597194169307598579@Dash-U>

Hi testuser,

你好测试用户，

Please click on the link below to confirm your registration:

请点击以下链接确认您的注册：

http://127.0.0.1:8000/activate/b'MjU'/4vi-fasdtRf2db2989413ba/

as you can see uid = b'MjU'

如你看到的 uid = b'MjU'

expected uid = MjU

预期的 uid = MjU

test in console:

在控制台中测试：

$ python
Python 3.6.4 (default, Apr  7 2018, 00:45:33) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.utils.http import urlsafe_base64_encode, urlsafe_base64_decode
>>> from django.utils.encoding import force_bytes, force_text
>>> var1=urlsafe_base64_encode(force_bytes(3))
>>> print(var1)
b'Mw'
>>> print(var1.decode())
Mw
>>>

After investigating it seems like its related to python 3. My workaround was quite simple:

经过调查，它似乎与 python 3 有关。我的解决方法很简单：

'uid': user.pk,

i receive it as uidb64 on my activate function:

我在我的激活函数中收到它作为 uidb64：

user = User.objects.get(pk=uidb64)

and voila:

瞧：

Content-Transfer-Encoding: 7bit
Subject: Activate Your MySite Account
From: webmaster@localhost
To: [email protected]
Date: Fri, 20 Apr 2018 20:44:46 -0000
Message-ID: <152425708646.11228.13738465662759110946@Dash-U>


Hi testuser,

Please click on the link below to confirm your registration:

http://127.0.0.1:8000/activate/45/4vi-3895fbb6b74016ad1882/

now it works fine. :)

现在它工作正常。:)

Answer 6

回答by DevJoe

I got it done by only encoding the output using utf-8. Here is the code example

我通过只使用 utf-8 对输出进行编码来完成它。这是代码示例

new_tweets = api.GetUserTimeline(screen_name = user,count=200)
result = new_tweets[0]
try: text = result.text
except: text = ''

with open(file_name, 'a', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(text)

i.e: do not encode when collecting data from api, encode the output (print or write) only.

即：从 api 收集数据时不要编码，仅对输出（打印或写入）进行编码。

Answer 7

回答by Joseph Boyd

Assuming you don't want to immediately decode it again like others are suggesting here, you can parse it to a string and then just strip the leading 'band trailing '.

假设您不想像其他人在这里建议的那样立即再次解码它，您可以将其解析为一个字符串，然后只需去除前导'b和尾随'.

>>> x = "Hi there " 
>>> x = "Hi there ".encode("utf-8") 
>>> x
b"Hi there \xef\xbf\xbd"
>>> str(x)[2:-1]
"Hi there \xef\xbf\xbd"

Answer 8

回答by Kamol Roy

Although the question is very old, I think it may be helpful to who is facing the same problem. Here the texts is a string like below:

虽然这个问题很老了，但我认为它可能对谁面临同样的问题有所帮助。这里的文本是一个如下所示的字符串：

text= "b'I posted a new photo to Facebook'"

Thus you can not remove b by encoding it because it's not a byte. I did the following to remove it.

因此您不能通过编码来删除 b，因为它不是一个字节。我做了以下操作来删除它。

cleaned_text = text.split("b'")[1]

which will give "I posted a new photo to Facebook"

这会给 "I posted a new photo to Facebook"

如何摆脱python中字符串中的b前缀？

提问by Stan Shunpike

Code Attempt

代码尝试

Error

错误

回答by hiro protagonist

回答by Jonathan Komar

Bytes in Python

Python 中的字节

String literals are described by the following lexical definitions:

字符串文字由以下词法定义描述：

回答by salmanwahed

回答by Avinash Chougule

回答by Fernando D Jaime

回答by DevJoe

回答by Joseph Boyd

回答by Kamol Roy

相关推荐

最近更新

标签

如何摆脱python中字符串中的b前缀？

提问by Stan Shunpike

Code Attempt

代码尝试

Error

错误

回答by hiro protagonist

回答by Jonathan Komar

Bytes in Python

Python 中的字节

String literals are described by the following lexical definitions:

字符串文字由以下词法定义描述：

回答by salmanwahed

回答by Avinash Chougule

回答by Fernando D Jaime

回答by DevJoe

回答by Joseph Boyd

回答by Kamol Roy

相关推荐

Python 在预测过程中，数据规范化在 keras 中是如何工作的？

OpenCV python 裁剪图像

Python ufunc 'add' 不包含签名匹配类型 dtype ('S32') ('S32') ('S32') 的循环

Python Pandas - 缺少必需的依赖项 ['numpy'] 1

相关推荐

最近更新

标签