使用 Pandas 将 JSON 转换为 CSV

Question

提问by farheen

I have a JSON file which contains 46k+ tweets in english and other languages as well which I want to save as csv file. Below is a part of json file.

我有一个 JSON 文件，其中包含 46k+ 条英语和其他语言的推文，我想将其另存为 csv 文件。下面是json文件的一部分。

    [{"user_id": 938118866135343104, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @PTISPOfficial: ??????? ????? ????? ?? ???? ??????? ??? ????? ?????  ???? ??? ???????? ?? ??????? ??? ??? ?? ??? ?????? ?? ??? ??????? ?…", "tweet_id": 976166125502427136}
{"user_id": 959235642, "date_time": "03/20/2018 18:38:35", "tweet_content": "At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \n\n#PZvQG \n#ABC", "tweet_id": 976166125535973378}
{"user_id": 395163528, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\nWatch ball by ball highlights at (link removed)\n\n#PZvQG #HBLPSL #PSL2018 @_crici…", "tweet_id": 976166126202839040}
{"user_id": 3117825702, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @JeremyMcLellan: Rumor has it Amir Liaquat isn't allowed to play in #PSL2018 because he keeps switching teams every week.", "tweet_id": 976166126483902466}
{"user_id": 3310967346, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…", "tweet_id": 976166126559354880}
{"user_id": 701494826194354179, "date_time": "03/20/2018 18:38:35", "tweet_content": "I wanted a super over\n#PZvQG", "tweet_id": 976166126836178944}
{"user_id": 347132028, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ??  #PZvQG", "tweet_id": 976166126685171713}
{"user_id": 3461853618, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @walterMiitty: It's harder than I thought to tell the truth\nIt's gonna leave you in pieces\nAll alone with your demons\nAnd I know that we…", "tweet_id": 976166126924201986}]

I followed this solutionto convert it into CSV but got invalid syntax error on an urdu tweet. I also tried this:

我按照此解决方案将其转换为 CSV，但在乌尔都语推文中出现无效语法错误。我也试过这个：

    import json
with open("PeshVsQuetta.json") as f:
all_tweets = []
for line in f:
    text_dict = json.loads(line)
    all_tweets.append(text_dict)

print(all_tweets[0]['tweet_content'])

which gives me following error.

这给了我以下错误。

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 148: character maps to <undefined>

I even saved json file as txt file and tried this:

我什至将 json 文件保存为 txt 文件并尝试这样做：

    import pandas as pd
    from ast import literal_eval
    columns = ['Tweet ID','Author ID','Tweet','Time']
    df1 = pd.DataFrame(columns = columns)
    f = open('PeshvsQuetta.txt',encoding = 'utf-8')
    counter = 1
    for line in f:
         if(counter != 1):
             s1 = literal_eval(line)
             ser = pd.Series([s1['tweet_id'],s1['user_id'],s1['tweet_content'],s1["date_time"]],index=['Tweet ID','Author ID','Tweet','Time'])
             df1 = df1.append(ser,ignore_index=True)
    counter = counter + 1
    df1.to_csv('PeshVsQuetta1.csv', encoding='utf-8',index=False,columns = columns)

But the resulting csv file has each series saved in one cell and it has lots of empty rows and some tweets are saved over multiple rows. Below is the image.

但是生成的 csv 文件将每个系列保存在一个单元格中，并且它有很多空行，一些推文保存在多行中。下面是图像。

Any help would be truly appreciated.

任何帮助将不胜感激。

Answer 1

回答by Martin Evans

You should just be able to use Pandas as follows:

您应该能够按如下方式使用 Pandas：

import pandas as pd

with open('PeshVsQuetta.json', encoding='utf-8-sig') as f_input:
    df = pd.read_json(f_input)

df.to_csv('PeshVsQuetta.csv', encoding='utf-8', index=False)

This assumes that your JSON file contains a BOM at the start. For the data you have given above, this produces the following CSV file:

这假设您的 JSON 文件在开始时包含一个 BOM。对于您在上面提供的数据，这会生成以下 CSV 文件：

date_time,tweet_content,tweet_id,user_id
2018-03-20 18:38:35,RT @PTISPOfficial: ??????? ????? ????? ?? ???? ??????? ??? ????? ?????  ???? ??? ???????? ?? ??????? ??? ??? ?? ??? ?????? ?? ??? ??????? ?…,976166125502427136,938118866135343104
2018-03-20 18:38:35,"At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. 

#PZvQG 
#ABC",976166125535973378,959235642
2018-03-20 18:38:35,"RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali
Watch ball by ball highlights at (link removed)

#PZvQG #HBLPSL #PSL2018 @_crici…",976166126202839040,395163528
2018-03-20 18:38:35,RT @JeremyMcLellan: Rumor has it Amir Liaquat isn't allowed to play in #PSL2018 because he keeps switching teams every week.,976166126483902466,3117825702
2018-03-20 18:38:35,RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…,976166126559354880,3310967346
2018-03-20 18:38:35,"I wanted a super over
#PZvQG",976166126836178944,701494826194354179
2018-03-20 18:38:35,RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ??  #PZvQG,976166126685171713,347132028
2018-03-20 18:38:35,"RT @walterMiitty: It's harder than I thought to tell the truth
It's gonna leave you in pieces
All alone with your demons
And I know that we…",976166126924201986,3461853618

Note: some of your fields contain newlines, so the output might look a bit odd. An application reading this though will handle it correctly (as long as you tell it the encoding is UTF-8 when importing it)

注意：您的某些字段包含换行符，因此输出可能看起来有点奇怪。读取此内容的应用程序将正确处理它（只要您在导入时告诉它编码为 UTF-8）

使用 Pandas 将 JSON 转换为 CSV

提问by farheen

回答by Martin Evans

相关推荐

最近更新

标签

使用 Pandas 将 JSON 转换为 CSV

提问by farheen

回答by Martin Evans

相关推荐

pandas 删除熊猫数据框上的索引和列

pandas Pip安装超时问题

pandas 每个色调带有堆叠条的计数图

pandas.errors.EmptyDataError：没有要从文件中解析的列

相关推荐

最近更新

标签