如何摆脱python中字符串中的b前缀?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41918836/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:52:33  来源:igfitidea点击:

How do I get rid of the b-prefix in a string in python?

python

提问by Stan Shunpike

A bunch of the tweets I am importing are having this issue where they read

我正在导入的一堆推文在他们阅读的地方遇到了这个问题

b'I posted a new photo to Facebook'

I gather the bindicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the bdoesn't go away and is interferring in future code.

我收集b表明它是一个字节。但这被证明是有问题的,因为在我最终编写的 CSV 文件中,b它不会消失并且会干扰未来的代码。

Is there a simple way to remove this bprefix from my lines of text?

有没有一种简单的方法可以b从我的文本行中删除这个前缀?

Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.

请记住,我似乎需要将文本编码为 utf-8,否则 tweepy 无法从网络中提取它们。



Here's the link content I'm analyzing:

这是我正在分析的链接内容:

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link'

Code Attempt

代码尝试

outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets]
print(outtweets)

Error

错误

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-21-6019064596bf> in <module>()
      1 for screen_name in user_list:
----> 2     get_all_tweets(screen_name,"instance file")

<ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)
     99             with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:
    100                 writer = csv.writer(f)
--> 101                 writer.writerows(outtweets)
    102         else:
    103             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:

C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>

回答by hiro protagonist

you need to decodethe bytesof you want a string:

你需要解码bytes你想要的字符串:

b = b'1234'
print(b.decode('utf-8'))  # '1234'

回答by Jonathan Komar

It is just letting you know that the object you are printing is not a string, rather a byte object as a byte literal. People explain this in incomplete ways, so here is my take.

它只是让您知道您正在打印的对象不是字符串,而是作为字节文字的字节对象。人们以不完整的方式解释了这一点,所以这是我的看法。

Consider creating a byte objectby typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string objectencoded in utf-8. (Note that converting here means decoding)

考虑创建一个字节对象通过键入一个字节字面(字面定义字节对象实际上并没有使用一个字节对象例如通过键入b“”)并将其转换成一个字符串对象以UTF-8编码。(注意这里的转换是指解码

byte_object= b"test" # byte object by literally typing characters
print(byte_object) # Prints b'test'
print(byte_object.decode('utf8')) # Prints "test" without quotations

You see that we simply apply the .decode(utf8)function.

您会看到我们只是应用了该.decode(utf8)功能。

Bytes in Python

Python 中的字节

https://docs.python.org/3.3/library/stdtypes.html#bytes

https://docs.python.org/3.3/library/stdtypes.html#bytes

String literals are described by the following lexical definitions:

字符串文字由以下词法定义描述:

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
stringescapeseq ::=  "\" <any source character>

bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "\" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "\">
bytesescapeseq ::=  "\" <any ASCII character>

回答by salmanwahed

You need to decode it to convert it to a string. Check the answer here about bytes literal in python3.

您需要对其进行解码以将其转换为字符串。在此处查看有关 python3 中字节文字的答案 。

In [1]: b'I posted a new photo to Facebook'.decode('utf-8')
Out[1]: 'I posted a new photo to Facebook'

回答by Avinash Chougule

****How to remove b' ' chars which is decoded string in python ****

****如何删除在python中解码字符串的b' '字符****

import base64
a='cm9vdA=='
b=base64.b64decode(a).decode('utf-8')
print(b)

回答by Fernando D Jaime

On python 3.6 with django 2.0, decode on a byte literal does not works as expected. Yeah i get the right result when i print it, but the b'value' is still there even if you print it right.

在带有 django 2.0 的 python 3.6 上,对字节文字的解码无法按预期工作。是的,当我打印它时我得到了正确的结果,但是即使你打印正确,b'value' 仍然存在。

This is what im encoding

这就是我的编码

uid': urlsafe_base64_encode(force_bytes(user.pk)),

This is what im decoding:

这是即时解码:

uid = force_text(urlsafe_base64_decode(uidb64))


This is what django 2.0 says :

这就是 django 2.0 所说的:

urlsafe_base64_encode(s)[source]

Encodes a bytestring in base64 for use in URLs, stripping any trailing equal signs.

以 base64 编码字节串以用于 URL,去除任何尾随等号。

urlsafe_base64_decode(s)[source]

Decodes a base64 encoded string, adding back any trailing equal signs that might have been stripped.

解码 base64 编码的字符串,添加回任何可能已被剥离的尾随等号。



This is my account_activation_email_test.html file

这是我的 account_activation_email_test.html 文件

{% autoescape off %}
Hi {{ user.username }},

Please click on the link below to confirm your registration:

http://{{ domain }}{% url 'accounts:activate' uidb64=uid token=token %}
{% endautoescape %}


This is my console response:

这是我的控制台响应:

Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Activate Your MySite Account From: webmaster@localhost To: [email protected] Date: Fri, 20 Apr 2018 06:26:46 -0000 Message-ID: <152420560682.16725.4597194169307598579@Dash-U>

内容类型:文本/纯文本;charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Activate Your MySite Account From: webmaster@localhost To: [email protected] Date: Fri, 20 Apr 2018 06:26:46 - 0000 消息 ID:<152420560682.16725.4597194169307598579@Dash-U>

Hi testuser,

你好测试用户,

Please click on the link below to confirm your registration:

请点击以下链接确认您的注册:

http://127.0.0.1:8000/activate/b'MjU'/4vi-fasdtRf2db2989413ba/

as you can see uid = b'MjU'

如你看到的 uid = b'MjU'

expected uid = MjU

预期的 uid = MjU



test in console:

在控制台中测试:

$ python
Python 3.6.4 (default, Apr  7 2018, 00:45:33) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.utils.http import urlsafe_base64_encode, urlsafe_base64_decode
>>> from django.utils.encoding import force_bytes, force_text
>>> var1=urlsafe_base64_encode(force_bytes(3))
>>> print(var1)
b'Mw'
>>> print(var1.decode())
Mw
>>> 

After investigating it seems like its related to python 3. My workaround was quite simple:

经过调查,它似乎与 python 3 有关。我的解决方法很简单:

'uid': user.pk,

i receive it as uidb64 on my activate function:

我在我的激活函数中收到它作为 uidb64:

user = User.objects.get(pk=uidb64)

and voila:

瞧:

Content-Transfer-Encoding: 7bit
Subject: Activate Your MySite Account
From: webmaster@localhost
To: [email protected]
Date: Fri, 20 Apr 2018 20:44:46 -0000
Message-ID: <152425708646.11228.13738465662759110946@Dash-U>


Hi testuser,

Please click on the link below to confirm your registration:

http://127.0.0.1:8000/activate/45/4vi-3895fbb6b74016ad1882/

now it works fine. :)

现在它工作正常。:)

回答by DevJoe

I got it done by only encoding the output using utf-8. Here is the code example

我通过只使用 utf-8 对输出进行编码来完成它。这是代码示例

new_tweets = api.GetUserTimeline(screen_name = user,count=200)
result = new_tweets[0]
try: text = result.text
except: text = ''

with open(file_name, 'a', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(text)

i.e: do not encode when collecting data from api, encode the output (print or write) only.

即:从 api 收集数据时不要编码,仅对输出(打印或写入)进行编码。

回答by Joseph Boyd

Assuming you don't want to immediately decode it again like others are suggesting here, you can parse it to a string and then just strip the leading 'band trailing '.

假设您不想像其他人在这里建议的那样立即再次解码它,您可以将其解析为一个字符串,然后只需去除前导'b和尾随'.

>>> x = "Hi there " 
>>> x = "Hi there ".encode("utf-8") 
>>> x
b"Hi there \xef\xbf\xbd"
>>> str(x)[2:-1]
"Hi there \xef\xbf\xbd"   

回答by Kamol Roy

Although the question is very old, I think it may be helpful to who is facing the same problem. Here the texts is a string like below:

虽然这个问题很老了,但我认为它可能对谁面临同样的问题有所帮助。这里的文本是一个如下所示的字符串:

text= "b'I posted a new photo to Facebook'"

Thus you can not remove b by encoding it because it's not a byte. I did the following to remove it.

因此您不能通过编码来删除 b,因为它不是一个字节。我做了以下操作来删除它。

cleaned_text = text.split("b'")[1]

which will give "I posted a new photo to Facebook"

这会给 "I posted a new photo to Facebook"