pandas 检查头是否存在与 Python 熊猫

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32312309/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:50:15  来源:igfitidea点击:

Check if header exists with Python pandas

pythoncsvpandas

提问by JDE876

I have a question. Is there a possible way to see if a column header exists in a file, or skip rows until? Say I have a group of files. One with a header on the first row, another with the header on the second row following some useless text on the first row, and another that has no header. I want to skip all rows before the column header or detect if one even exists without specifying "skiprows" in the code. There are a number of hard coded ways to do this. I have used regexes and replaces etc., but I am looking for a more universal idea that covers all bases. I have even made a raw input prompt that allows you to enter the amount of rows you want to skip. That method worked, but I want something that will not have to rely on user input and just detect column headers on its own. I am just looking for a few ideas if any. I am working mainly csv type files and would like to do this with Python.

我有个问题。有没有可能的方法来查看文件中是否存在列标题,或者跳过行直到?假设我有一组文件。一个在第一行有标题,另一个在第二行有标题,第一行有一些无用的文本,另一个没有标题。我想跳过列标题之前的所有行,或者在代码中不指定“skiprows”的情况下检测是否存在。有许多硬编码方法可以做到这一点。我使用过正则表达式和替换等,但我正在寻找一个涵盖所有基础的更普遍的想法。我什至制作了一个原始输入提示,允许您输入要跳过的行数。该方法有效,但我想要一些不必依赖用户输入而仅自行检测列标题的东西。我只是在寻找一些想法(如果有的话)。

回答by

csv.Sniffer has a has_header() function that should return True if the first row appears to be a header. A procedure for using it would be to first remove all empty rows from the top until the first non-empty row and then run csv.Sniffer.has_header(). My experience is that the header must be in the first line for has_header() to return True and it will return False if the number of header fields do not match the number of data fields for at least one row in its scan range which must be set by the user. 1024 or 2048 are typical scan ranges. I tried to set it much higher even so the entire file would be read, but it still failed to recognize the header if it was not in the first line. All my testing was done using Python 2.7.10.

csv.Sniffer 有一个 has_header() 函数,如果第一行看起来是标题,它应该返回 True。使用它的过程是首先从顶部删除所有空行,直到第一个非空行,然后运行 ​​csv.Sniffer.has_header()。我的经验是,标题必须在第一行,has_header() 才能返回 True,如果标题字段的数量与其扫描范围内至少一行的数据字段数量不匹配,它将返回 False。由用户设置。1024 或 2048 是典型的扫描范围。我试图将它设置得更高,即使整个文件都会被读取,但如果它不在第一行,它仍然无法识别标题。我所有的测试都是使用 Python 2.7.10 完成的。

Here is an example of using csv.Sniffer in a script that first determines if a file has a recognizable header and if not renames it, creates a new, empty file with the original name, then opens the renamed file for reading and the new file for writing and writes the renamed file contents to the new file excluding leading blank lines. Finally it retests the new file for a header to determine if removing the blank lines made a difference.

这是在脚本中使用 csv.Sniffer 的示例,该脚本首先确定文件是否具有可识别的标头,如果没有重命名,则使用原始名称创建一个新的空文件,然后打开重命名的文件以进行读取和新文件用于将重命名的文件内容写入新文件,不包括前导空行。最后,它重新测试新文件的标题以确定删除空行是否有所不同。

import csv
from datetime import datetime
import os
import re
import shutil
import sys
import time

common_delimeters = set(['\' \'', '\'\t\'', '\',\''])

def sniff(filepath):
   with open(filepath, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(2048))
        delimiter = repr(dialect.delimiter)
        if delimiter not in common_delimeters:
            print filepath,'has uncommon delimiter',delimiter
        else:
            print filepath,'has common delimiter',delimiter
        csvfile.seek(0)
        if csv.Sniffer().has_header(csvfile.read(2048)):
            print filepath, 'has a header'
            return True
        else:
            print filepath, 'does not have a header'
            return False

def remove_leading_blanks(filepath):
    # test filepath  for header and delimiter
    print 'testing',filepath,'with sniffer'
    has_header = sniff(filepath)
    if has_header:
        print 'no need to remove leading blank lines if any in',filepath
        return True
    # make copy of filepath appending current date-time to its name
    if os.path.isfile(filepath):
        now = datetime.now().strftime('%Y%d%m%H%M%S')
        m = re.search(r'(\.[A-Za-z0-9_]+)\Z',filepath)
        bakpath = ''
        if m != None:
            bakpath = filepath.replace(m.group(1),'') + '.' + now + m.group(1)
        else:
            bakpath = filepath + '.' + now       
        try:
            print 'renaming', filepath,'to', bakpath
            os.rename(filepath, bakpath)
        except:
            print 'renaming operation failed:', sys.exc_info()[0]
            return False
       print 'creating a new',filepath,'from',bakpath,'minus leading blank lines'
        # now open renamed file and copy it to original filename
        # except for leading blank lines
        time.sleep(2)
        try:
            with open(bakpath) as o, open (filepath, 'w') as n:
                p = False
                for line in o:
                    if p == False:
                        if line.rstrip():
                            n.write(line)
                            p = True
                        else:
                            continue
                    else:
                        n.write(line)
        except IOError as e:
            print 'file copy operation failed: %s' % e.strerror   
            return False
        print 'testing new',filepath,'with sniffer'       
        has_header = sniff(filepath)
        if has_header:
            print 'the header problem with',filepath,'has been fixed'
        return True
        else:
            print 'the header problem with',filepath,'has not been fixed'
            return False

Given this csv file where the header is actually on line 11:

鉴于这个 csv 文件的标题实际上在第 11 行:

header,better,leader,fodder,blather,super
1,2,3,,,
4,5,6,7,8,9
3,4,5,6,7,
2,,,,,

remove_leading_blanks() determined that it did not have headers, then removed the leading blank lines and determined that it did have headers. Here is the trace of its console output:

remove_leading_blanks() 确定它没有标题,然后删除前导空白行并确定它确实有标题。这是其控制台输出的跟踪:

testing test1.csv with sniffer...
test1.csv has uncommon delimiter '\r'
test1.csv does not have a header
renaming test1.csv to test1.20153108142923.csv
creating a new test1.csv from test1.20153108142923.csv minus leading blank lines
testing new test1.csv with sniffer
test1.csv has common delimiter ','
test1.csv has a header
the header problem with test1.csv has been fixed
done ok

While this may work a lot of the time, generally it does not appear reliable due to too much variation in headers and their placement. However, maybe its better than nothing.

虽然这可能在很多时候都有效,但由于标题及其位置的变化太大,它通常看起来并不可靠。然而,也许总比没有好。

See csv.Sniffer, csv.pyand _csv.cfor more info. PyMOTW's csv – Comma-separated value fileshas a good tutorial review of the csv module with details on Dialects.

有关更多信息,请参阅csv.Sniffercsv.py_csv.cPyMOTW 的 csv – Comma-separated value files对 csv 模块有一个很好的教程,其中包含有关方言的详细信息。