解析 Pandas 数据框

Question

提问by Bardiya Choupani

I have the following data in a single data frame which I parsed from XML

我在从 XML 解析的单个数据框中有以下数据

index                               xml_data    
0                                   \n      
1                               sessionKey  
2            JKX6G3_07092016_1476953673631  
3                                   \n      
4                                   Number  
5                                   JKX6G3  
6                                   \n      
7                            CreateDate 
8                            1468040400000  
9                                   \n      
10                              Id  
11                                83737626  
12                                       1  
13                                  \n      
14                             customerAge  
15                                      64  
16                                       1

I like to make every row after "\n" a column and value associated with the column is the next row for example:

我喜欢将 "\n" 之后的每一行作为一列，与该列关联的值是下一行，例如：

sessionKey  Number  CreateDate  Id  Age

JKX6G3_07092016_1476953673631   JKX6G3  1.46804E+12 83737626    64

Is there a more elegant way of doing this than: for row in doc_df.itertuples(): and going through each row and parse?

有没有比以下更优雅的方法： for row in doc_df.itertuples(): 并遍历每一行并解析？

Answer 1

回答by Serenity

import pandas as pd
import numpy as np

# set dataframe
...

# get columns name
columns = []
count_n = 0
for i in range(0, len(df)-1):
    if (df.iloc[i]['xml_data'] == '\n'):
        columns.append(df.iloc[i+1]['xml_data'])
        count_n += 1

# generate new df    
new_df = pd.DataFrame(columns = columns, index = np.arange(count_n))
j = 0
count = 0
# set values
for i in range(0, len(df)-2):
    if (df.iloc[i]['xml_data'] == '\n'):
        new_df.iloc[j][df.iloc[i+1]['xml_data']] = df.iloc[i+2]['xml_data'] 
        count += 1
        if count == len(new_df):
            count = 0
            j += 1

new_df.dropna(inplace=True)

print(new_df)

Result:

结果：

                      sessionKey  Number     CreateDate        Id customerAge
0  JKX6G3_07092016_1476953673631  JKX6G3  1468040400000  83737626          64

Answer 2

回答by piRSquared

I'd look for the positions of the \nand add one to locate keys, and add 2 for values. Then build an array and a subsequent dataframe

我会寻找的位置\n并添加一个来定位键，并为值添加 2。然后构建一个数组和一个后续的数据框

v = df.xml_data.values
a, b = np.where(v == '\n')[0][None, :] + [[1], [2]]
pd.DataFrame([v[b]], columns=v[a])

                      sessionKey  Number     CreateDate        Id customerAge
0  JKX6G3_07092016_1476953673631  JKX6G3  1468040400000  83737626          64

解析 Pandas 数据框

提问by Bardiya Choupani

回答by Serenity

回答by piRSquared

相关推荐

最近更新

标签

解析 Pandas 数据框

提问by Bardiya Choupani

回答by Serenity

回答by piRSquared

相关推荐

向 Pandas 数据透视表添加过滤器

pandas TypeError: 'DataFrame' 对象是可变的，因此它们不能被散列

pandas.to_dict 返回 None 与 nan 混合

在 Pandas 中计算奇数比的更好方法

相关推荐

最近更新

标签