pandas 如果每行包含不同数量的字段(数量很大),如何正确读取 csv 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35344282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:41:08  来源:igfitidea点击:

How to read the csv file properly if each row contains different number of fields (number quite big)?

pythoncsvpandas

提问by user5779223

I have a text file from amazon, containing the following info:

我有一个来自亚马逊的文本文件,其中包含以下信息:

 #      user        item     time   rating     review text (the header is added by me for explanation, not in the text file
  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant

As you see, the data is separated by space and there are different number of columns in each row. However, so it is the text content. Here is the code I have tried:

如您所见,数据由空格分隔,每行有不同数量的列。但是,文本内容也是如此。这是我尝试过的代码:

pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part

And such an error occurs:

并且出现这样的错误:

ValueError: Passed header names mismatches usecols

When I tried to read all the columns:

当我尝试阅读所有列时:

pd.read_csv(filename, sep = " ", header = None)

And the error this time is:

这次的错误是:

Error tokenizing data. C error: Expected 229 fields in line 3, saw 320

And given the review text is so long in many rows , the method of adding header names for each column in this questioncan not work.

并且鉴于评论文本在多行中如此长,在此问题中为每一列添加标题名称的方法行不通。

I wonder how to read the csv file if I want to keep the review text and skip them respectively. Thank you in advance!

如果我想保留评论文本并分别跳过它们,我想知道如何读取csv文件。先感谢您!

EDIT:

编辑:

The problem has been solved by Martin Evans perfectly. But now I am playing with another data set with similar but different format. Now the order of the data is converse:

这个问题已经被 Martin Evans 完美的解决了。但是现在我正在使用另一个格式相似但不同的数据集。现在数据的顺序是相反的:

     # review text                          user        item     time   rating      (the header is added by me for explanation, not in the text file
   I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3     

Do you have any idea to read it properly? It would be appreciated for any help!

你有什么想法可以正确阅读它吗?任何帮助将不胜感激!

回答by Martin Evans

As suggested, DictReadercould also be used as follows to create a list of rows. This could then be imported as a frame in pandas:

按照建议,DictReader也可以按如下方式使用来创建行列表。然后可以将其作为框架导入到Pandas中:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
    for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
        try:
            rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
        except KeyError, e:
            rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display the following:

这将显示以下内容:

         user      item rating                                  review
0  disjiad123  TYh23hs9      5  I love this phone as it is easy to use
1  hjf2329ccc  TGjsk123      3                         Suck restaurant


If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:

如果评论出现在行的开头,那么一种方法是反向解析该行,如下所示:

import pandas as pd
import csv


rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
    for row in f_input:
        cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
        rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display:

这将显示:

  rating      time      item        user  \
0      5  13160032  TYh23hs9   isjiad123   
1      3  14423321  TGjsk123  hjf2329ccc   

                                    review  
0  I love this phone as it is easy to used  
1                          Suck restaurant  

row[::-1]is used to reverse the text of the whole line, the [2:]skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rowsis appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.

row[::-1]用于反转整行的文本,[2:]跳过现在位于行首的行尾。然后在空格上分割每一行。列表理解然后重新反转每个拆分条目。最后rows通过采用固定的 5 列条目(现在在开头)附加到 first。其余条目然后用空格重新连接在一起并添加为最后一列。

The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.

这种方法的好处是它不依赖于您的输入数据采用完全固定的宽度格式,并且您不必担心所使用的列宽是否会随着时间而改变。

回答by chthonicdaemon

It looks like this is a fixed width file. Pandas supplies read_fwffor this exact purpose. The following code reads the file correctly for me. You may want to mess around with the widths a little if it doesn't work perfectly.

看起来这是一个固定宽度的文件。Pandasread_fwf正是为此目的而提供的。以下代码为我正确读取文件。如果它不能完美工作,您可能想要稍微改变宽度。

pandas.read_fwf('test.fwf', 
                 widths=[13, 12, 13, 5, 100], 
                 names=['user', 'item', 'time', 'rating', 'review'])

If the columns still line up with the edited version (where the rating comes first), you just need to add the correct specification. A guide line like the following helps to do this quickly:

如果列仍然与编辑后的版本(评级在前)保持一致,您只需要添加正确的规范。如下指南有助于快速完成此操作:

0        1         2         3         4         5         6         7         8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                          hjf2329ccc   TGjsk123     14423321    3     

So the new command becomes:

所以新命令变成:

pandas.read_fwf('test.fwf', 
                colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]], 
                names=['review', 'user', 'item', 'time', 'rating'])

回答by alexis

Since the first four (now last four) of the fields are never going to contain spaces or need to be surrounded by quotes, let's forget about the csv library and use python's awesome string handling directly. Here is a one-liner that splits each line into exactly five columns, courtesy of the maxsplitargument to rsplit():

由于字段的前四个(现在是最后四个)永远不会包含空格或需要用引号括起来,让我们忘记 csv 库并直接使用 python 的超棒字符串处理。这是一个单行,将每一行分成正好五列,由以下maxsplit参数提供rsplit()

with open("myfile.dat") as data:
    frame = pd.DataFrame(line.strip().rsplit(maxsplit=4) for line in data)

The above should solve your problem, but I prefer to unpack it into a generator function that is easier to understand, and can be extended if necessary:

以上应该可以解决您的问题,但我更喜欢将其解压缩为更易于理解的生成器函数,并且可以在必要时进行扩展:

def splitfields(data):
    """Generator that parses the data correctly into fields"""
    for line in data:
        fields = line.rsplit(maxsplit=4)
        fields[0] = fields[0].strip()   # trim line-initial spaces
        yield fields

with open("myfile.dat") as data:
    frame = pd.DataFrame(splitfields(data))

Both versions avoid having to build a large ordinary array in memory only to hand it over to the DataFrameconstructor. As each line of input is read from the file, it is parsed and immediately added to the dataframe.

两个版本都避免了在内存中构建一个大的普通数组只是为了将它交给DataFrame构造函数。当从文件中读取每一行输入时,它会被解析并立即添加到数据帧中。

The above is for the format in the updated question, which has the free text on the left. (For the original format, use line.splitinstead of line.rsplitand strip the last field, not the first.)

以上是更新问题中的格式,左边有自由文本。(对于原始格式,使用line.split代替line.rsplit并去除最后一个字段,而不是第一个。)

    I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3

There's more you could do depending on what the data actually looks like: If the fields are separated by exactly four spaces (as it seems from your example), you could split on " "instead of splitting on all whitespace. That would also work correctly if some other fields can contain spaces. In general, pre-parsing like this is flexible and extensible; I leave the code simple since there's no evidence from your question that more is needed.

根据数据的实际情况,您还可以做更多的事情:如果字段由正好四个空格分隔(如您的示例所示),则您可以拆分" "而不是拆分所有空格。如果其他一些字段可以包含空格,那也可以正常工作。总的来说,像这样的预解析是灵活和可扩展的;我让代码保持简单,因为您的问题没有证据表明需要更多。

回答by atkat12

Usecolsrefers to the name of the columns in the input file. If your file doesn't have those columns named like that (user, item, rating) it won't know which columns you're referring to. Instead you should pass an index like usecols=[0,1,2].

Usecols指的是输入文件中列的名称。如果您的文件没有像 ( user, item, rating)那样命名的那些列,它就不会知道您指的是哪些列。相反,您应该传递一个像usecols=[0,1,2].

Also, namesrefers to what you're calling the columns you import. So, I think you cannot have four names upon importing 3 columns. Does this work?

此外,names指的是您所称的导入的列。因此,我认为导入 3 列时不能有四个名称。这行得通吗?

pd.read_csv(filename, sep = " ", 
                      header = None, 
                      names = ["user","item","rating"], 
                      usecols = [0,1,2])

The tokenizing error looks like a problem with the delimiter. It may try to parse your review textcolumn as many columns, because "I" "love" "this" ... are all separated by spaces. Hopefully if you're only reading the first three columns you can avoid throwing an error, but if not you could consider parsing row-by-row (for example, here: http://cmdlinetips.com/2011/08/three-ways-to-read-a-text-file-line-by-line-in-python/) and writing to a DataFrame from there.

标记错误看起来像是分隔符的问题。它可能会尝试将您的review text列解析为许多列,因为“我”“爱”“这个”......都用空格分隔。希望如果您只阅读前三列,您可以避免抛出错误,但如果不是,您可以考虑逐行解析(例如,这里:http: //cmdlinetips.com/2011/08/three- way-to-read-a-text-file-line-by-line-in-python/)并从那里写入数据帧。

回答by jezrael

I think the best approach is using pandasread_csv:

我认为最好的方法是使用pandasread_csv

 import pandas as pd
import io

temp=u"""  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """


#estimated max length of columns 
N = 20

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep = "\s+", #separator is arbitrary whitespace 
                 header = None, #first row is not header, read all data to df
                 names=range(N)) 
print df
           0         1         2   3     4           5     6      7     8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone    as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant    so      I  love   

        9      10    11     12   13      14        15  16  17  18  19  
0       it     is  easy     to  use     NaN       NaN NaN NaN NaN NaN  
1  cooking  pizza  with  onion  ham  garlic  tomatoes NaN NaN NaN NaN

#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
         user      item      time
0  disjiad123  TYh23hs9  13160032
1  hjf2329ccc  TGjsk123  14423321

If you need all columns, you need preprocessing for founding max length of columns for parameter usecolsand then postprocessing join last columns to one:

如果您需要所有列,则需要进行预处理以创建参数列的最大长度,usecols然后后处理将最后一列连接为一:

import pandas as pd
import csv

#preprocessing
def get_max_len():
    with open('file1.csv', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            num.append(len(''.join(row).split()))
        m = max(num)
        #print m
        return m


df = pd.read_csv('file1.csv', 
                         sep = "\s+", #separator is arbitrary whitespace 
                         header = None, #first row is not header, read all data to df
                         usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
           0         1         2   3     4           5     6      7    8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone   as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant   NaN    NaN  NaN   

    9    10    11   12   13  
0   it   is  easy   to  use  
1  NaN  NaN   NaN  NaN  NaN 
#df from 4 col to last
print df.ix[:, 4:]
     4           5     6      7    8    9    10    11   12   13
0     I        love  this  phone   as   it   is  easy   to  use
1  Suck  restaurant   NaN    NaN  NaN  NaN  NaN   NaN  NaN  NaN

#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})

#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]

#filter only string columns
print df[cols]
         user      item      time  rating  \
0  disjiad123  TYh23hs9  13160032       5   
1  hjf2329ccc  TGjsk123  14423321       3   

                              review text  
0  I love this phone as it is easy to use  
1                         Suck restaurant  

回答by Gerhard Hagerer

I would iterate over each line and replace the consecutive spaces with semicolon. Then call str.split() and choose semicolon as separator. It could look like the following:

我会遍历每一行并用分号替换连续的空格。然后调用 str.split() 并选择分号作为分隔符。它可能如下所示:

data = [["user","item","rating", "review"]]
with open("your.csv") as f:
    for line in f.readlines():
        for i in range(10, 1, -1):
            line = line.replace(' '*i, ';')
        data += [line.split(';')]

回答by rajan

I think OP is using Amazon's review data, and if so I have also found this input file to be difficult to read. I'm not 100% sure, but I think the reason that pandas.read_csv has a hard time with this is that the review_body column has tabs that replace newline characters (for whatever reason).

我认为 OP 正在使用亚马逊的评论数据,如果是这样,我也发现此输入文件难以阅读。我不是 100% 确定,但我认为 pandas.read_csv 对此感到困难的原因是 review_body 列具有替换换行符的选项卡(无论出于何种原因)。

I tried a few of the solutions and I ended up building a new solution based on that proposed by @alexis. The solutions here did not work because the files in the link I provided have the following column names (note that "review_body" is neither at the end or the beginning of the list):

我尝试了一些解决方案,最终基于@alexis 提出的解决方案构建了一个新的解决方案。此处的解决方案不起作用,因为我提供的链接中的文件具有以下列名称(请注意,“review_body”既不在列表的末尾也不在列表的开头):

['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_body', 'review_date']

I apologize in advance for the similarity of the variable names. For example, there is a stopColand stopCols. I know ... very bad form.

对于变量名称的相似性,我提前道歉。例如,有一个stopColand stopCols。我知道......非常糟糕的形式。

    # declare dictionary to contain columns from left-to-right search
    forwCols = {}
    # declare dictionary to contain "review_body" column
    stopCols = {}
    # declare dictionary to contain columns from right-to-left search
    revrCols = {}

    with open(filstr,'r') as TSVfile:
        lines    = TSVfile.readlines()
        # The header should have the maximum num of cols
        numCols  = len(lines[0].split())
        # Find which column index corresponds to 'review body' col
        stopCol  = lines[0].split().index('review_body')
        colNames = lines[0].split()

    for lineInt in range(1,len(lines)):
        # populate dict with cols until the column with tabs
        forwCols[lineInt] = lines[lineInt].\
                            split('\t',maxsplit=14)[:stopCol]
        # reverse list
        revrCols[lineInt] = lines[lineInt].rsplit('\t',maxsplit=2)[2:]
        forwLine = '\t'.join(forwCols[lineInt])
        revrLine = '\t'.join(revrCols[lineInt])
        # this next line removes the contents of the line that exists in
        # the dicts that are created already
        stopCols[lineInt] = \
                lines[lineInt].replace(forwLine,'').replace(revrLine,'')

    # Create three DFs using the three dicts just created
    revDF  = pd.DataFrame.from_dict(forwCols,orient='index',\
                            columns=colNames[:stopCol])
    dateDF = pd.DataFrame.from_dict(revrCols,orient='index',columns=['review_date'])
    revbodyDF = pd.DataFrame.from_dict(stopCols,orient='index',\
                                       columns=['review_body'])

    # join the three DFs together on indices
    combineDF1 = revbodyDF.merge(right=dateDF,how='outer',left_index=True,\
                                 right_index=True)
    combineDF = revDF.merge(right=combineDF1,how='outer',\
                                 left_index=True,right_index=True)

The solution above is a brute-force approach, but it is the only way I could see that this might work where the column containing tabs isn't the first or last column.

上面的解决方案是一种蛮力方法,但这是我认为这可能适用于包含选项卡的列不是第一列或最后一列的唯一方法。