pandas 使用pandas包用python清理excel数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42234523/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:59:04  来源:igfitidea点击:

Clean up excel data with python using pandas package

pythonpython-3.xpandasnumpy

提问by Tommy

I have read an xls file into Python with pandas using pd.read_excel

我已经使用 pd.read_excel 将一个 xls 文件读入 Python 和 Pandas

I am trying to cleanup my data but I'm way out of my league.

我正在尝试清理我的数据,但我已经超出了我的联盟。

There is a blank line between every record. In the example pic it's excel row 4, 9 and 11.

每条记录之间有一个空行。在示例图片中,它是 excel 第 4、9 和 11 行。

There is a comments column, designated in the example (see link) as "col_F". Every record has at least one cell that contains text. The person that created this xls file split longer comments into multiple cells.

有一个注释列,在示例(见链接)中指定为“col_F”。每条记录至少有一个包含文本的单元格。创建此 xls 文件的人将较长的评论拆分为多个单元格。

I would like to concatenate all of the data in col_F for a particular record into one cell.

我想将 col_F 中特定记录的所有数据连接到一个单元格中。

I will also trim out blank records once I figure out how to properly concatenate col_F.

一旦我弄清楚如何正确连接 col_F,我还将修剪空白记录。

I am using Python version 3.5.0, numpy 1.12.0 and pandas 0.19.2

我使用的是 Python 3.5.0、numpy 1.12.0 和 Pandas 0.19.2

Here is what I have so far:

这是我到目前为止所拥有的:

import numpy as np
import pandas as pd

data = pd.read_excel("C:/blah/blahblah/file.xls", header=0, nrows=10000)
df = pd.DataFrame(data)

I appreciate any suggestion or insight!!

我感谢任何建议或见解!

Thanks!

谢谢!

How the raw data looks: enter image description here

原始数据的外观: 在此处输入图片说明

Update: This is how my data looks when I load into Python raw data read into python

更新:这是我加载到 Python 时数据的样子 原始数据读入python

This is what my desired outcome would look like: what I would like dataframe to look like when finished

这就是我想要的结果: 我希望数据框完成后的样子

采纳答案by Tommy

Figured it out!!! Big shout out to the Norfolk Data Science Club for the helping hand.

弄清楚了!!!向诺福克数据科学俱乐部大喊大叫,寻求帮助。

Import pandas, sqlite and os packages

导入 pandas、sqlite 和 os 包

import pandas as pd
import sqlite3
import os

Specify filepaths and filenames

指定文件路径和文件名

filepath = "C:/blah/blahblah/randomfolder"
filename = "raw_data.xlsx"
filePathFileName = filepath + "/" + filename
outputPathFileName = filepath + "/cleaned_" + filename
outputFileName = "cleaned_" + filename

Use pandas to read in excel file.

使用 Pandas 读取 excel 文件。

df = pd.read_excel(filePathFileName, header=0, nrows=14)

remove blank rows

删除空白行

df.dropna(how='all', inplace=True)

fill in the gaps in our data

填补我们数据中的空白

df.ffill(inplace=True)

create a sqlite database and a connection to the sqlite database

创建一个 sqlite 数据库和一个到 sqlite 数据库的连接

con = sqlite3.connect(":memory:")
con.isolation_level = None
cur = con.cursor()

create a table for our data in sqlite

在 sqlite 中为我们的数据创建一个表

df.to_sql('example_data', con)

SQL query to aggregate our data

SQL 查询来聚合我们的数据

df_pdsql = pd.read_sql_query("SELECT col_A, col_B, col_C, col_D, col_E, GROUP_CONCAT(col_F, ' ') AS col_F FROM example_data GROUP BY col_A", con)

write our df to xlsx file

将我们的 df 写入 xlsx 文件

df_pdsql.to_excel(outputPathFileName, sheet_name='test', index=False)

let user know where file is located

让用户知道文件所在的位置

print("Your new file is located in: " + outputPathFileName)

close connection to sqlite database

关闭与 sqlite 数据库的连接

con.close()

回答by titipata

This is really crude solution that I wrote where @Boud explained in the comment. I create example data here first:

这是我在@Boud 在评论中解释的地方写的非常粗略的解决方案。我首先在这里创建示例数据:

df = pd.DataFrame([
    ['record1', '10', 'date1', 'optional', 'comment'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', ''], 
    ['record2', '100', 'date2', '', 'comment'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', ''],
    ['record3', '10000', 'date3', '', 'comment']], 
    columns=['col_A', 'col_B', 'col_C', 'col_D', 'col_F'])
df.replace('', np.nan, regex=True, inplace=True)

Note that blank here should be filled with NaNinstead of blank space. Basically, first, you can drop the row that you don't use first using dropna.

请注意,此处应填充NaN空格而不是空格。基本上,首先,您可以使用dropna.

df.dropna(axis=0, how='all', inplace=True) # drop NaN by row

Then you can fill col_Aby previous records.

然后你可以填写col_A以前的记录。

new_col = []
row_name = ''
for r in df.col_A:
    if not pd.isnull(r):
        row_name = r
    new_col.append(row_name)
df.col_A = new_col

After than, you can group other columns by applying groupbyfunction to column A and aggregate others columns by joining string as follows.

之后,您可以通过将groupby函数应用于 A列来对其他列进行分组,并通过按如下方式连接字符串来聚合其他列。

gr = df.groupby('col_A')

def join_str(x):
    x = list(map(str, x))
    x = [e for e in x if str(e) != 'nan']
    return ' '.join(x)

gr.agg({'col_B' : join_str, 
        'col_C': join_str, 
        'col_D': join_str, 
        'col_F': join_str}).reset_index()