如何在 Python 中解析这个自定义日志文件

Question

提问by steven.levey

I am using Python logging to generate log files when processing and I am trying to READ those log files into a list/dict which will then be converted into JSON and loaded into a nosql database for processing.

我正在使用 Python 日志记录在处理时生成日志文件，我试图将这些日志文件读取到列表/字典中，然后将其转换为 JSON 并加载到 nosql 数据库中进行处理。

The file gets generated with the following format.

该文件使用以下格式生成。

2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

NOTE: There are actually \n breaks before each NEW date you see but cant seem to represent it here.

注意：在您看到的每个新日期之前实际上都有 \n 间断，但似乎无法在这里表示。

Basically I am trying to read in this text file and produce a json object that looks like this:

基本上，我试图读取此文本文件并生成一个如下所示的 json 对象：

{
    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'
}
...

{
    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}

The problem I am having:

我遇到的问题：

I can add each line into a list or dict etc BUT the ERROR message sometimes goes over multiple lines so I end up splitting it up incorrectly.

我可以将每一行添加到列表或字典等中，但错误消息有时会超过多行，因此我最终会错误地将其拆分。

Tried:

尝试：

I have tried to use code like the below to only split the lines on valid dates but I cant seem to get the error messages that go across multiple lines. I also tried regular expressions and think that's a possible solution but cant seem to find the right regex to use...NO CLUE how it works so tried a bunch of copy paste but without any success.

我尝试使用如下代码仅在有效日期拆分行，但我似乎无法获得跨越多行的错误消息。我也尝试过正则表达式，并认为这是一个可能的解决方案，但似乎无法找到正确的正则表达式来使用......不知道它是如何工作的，所以尝试了一堆复制粘贴但没有任何成功。

with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:
                listNew.append(line)

Tried some crazy regex but no luck here either:

尝试了一些疯狂的正则表达式，但这里也没有运气：

logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)

Would appreciate any help...thanks

将不胜感激任何帮助...谢谢

EDIT:

编辑：

Posted a Solution below for anyone else struggling with the same thing.

在下面为其他遇到同样问题的人发布了解决方案。

Answer 1

采纳答案by steven.levey

Using @Joran Beasley's answer I came up with the following solution and it seems to work:

使用@Joran Beasley 的回答，我想出了以下解决方案，它似乎有效：

Main Points:

要点：

My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.

我的日志文件总是遵循相同的结构：{Date} - {Type} - {Message} 所以我使用字符串切片和拆分来按照我需要的方式分解项目。例如 {Date} 总是 23 个字符，我只想要前 19 个字符。
使用 line.startswith("2015") 很疯狂，因为日期最终会改变，所以创建了一个新函数，它使用一些正则表达式来匹配我期望的日期格式。再一次，我的日志日期遵循特定的模式，所以我可以得到具体的信息。
该文件被读入第一个函数“generateDicts()”，然后调用“matchDate()”函数以查看正在处理的行是否与我正在寻找的 {Date} 格式匹配。
每次找到有效的 {Date} 格式时都会创建一个 NEW dict，并处理所有内容，直到遇到下一个有效的 {Date}。

Function to split up the log files.

拆分日志文件的功能。

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

查看正在处理的行是否以与我要查找的格式匹配的 {Date} 开头的函数

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis

Answer 2

回答by Joran Beasley

create a generator (Im on a generator bend today)

创建一个发电机（我今天在发电机弯道上）

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith("2015"): #you might want a better check here
           if currentDict:
              yield currentDict
           currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
       else:
          currentDict["text"] += line
    yield currentDict

 with open("logfile.txt") as f:
    print list(generateDicts(f))

there may be a few minor typos... I didnt actually run this

可能有一些小错别字...我实际上并没有运行这个

Answer 3

回答by Deepak

The solution provided by @steven.levey is perfect. One addition to it that I would like to make is to use this regex pattern to determine if the line is proper and extract the required values. So that we don't have to work on splitting the lines once again after determining the format using regex.

@steven.levey 提供的解决方案是完美的。我想做的一个补充是使用这个正则表达式模式来确定该行是否正确并提取所需的值。这样我们就不必在使用正则表达式确定格式后再次拆分行。

pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'

Answer 4

回答by Lincoln Randall McFarland

You can get the fields you are looking for directly from the regex using groups. You can even name them:

您可以使用组直接从正则表达式中获取您要查找的字段。您甚至可以为它们命名：

>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
...     print found.groupdict()
... 
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'

Then create a date class where the constructor's kwargs match the group names. Use a little **magic to create an instance of the object directly from the regex groupdict and you are cooking with gas. In the constructor you can then figure out if 2016 is a leap year and Feb 29 exits.

然后创建一个日期类，其中构造函数的 kwargs 与组名称匹配。使用一点 **magic 直接从正则表达式 groupdict 创建对象的实例，并且您正在使用燃气进行烹饪。在构造函数中，您可以确定 2016 年是否为闰年，2 月 29 日是否退出。

-lrm

Answer 5

回答by Sadashiv Raj Bharadwaj

list = []
with open('bla.txt', 'r') as file:
  for line in file.readlines():
    if len(line.split(' - ')) >= 4:
      d = dict()
      d['Date'] = line.split(' - ')[0]
      d['Type'] = line.split(' - ')[2]
      d['Message'] = line.split(' - ')[3]
      list.append(d)
print(list)

Output:

输出：

[{
    'Date': '2015-05-22 16:46:46,985',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:46:56,645',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:47:46,488',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:48:48,180',
    'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
    'Type': 'ERROR'
}, {
    'Date': '2015-05-22 16:49:17,918',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:32,160',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:39,329',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:53:30,706',
    'Message': 'Starting to Wait for Files',
    'Type': 'INFO'
}]

如何在 Python 中解析这个自定义日志文件

提问by steven.levey

采纳答案by steven.levey

Function to split up the log files.

拆分日志文件的功能。

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

查看正在处理的行是否以与我要查找的格式匹配的 {Date} 开头的函数

回答by Joran Beasley

回答by Deepak

回答by Lincoln Randall McFarland

回答by Sadashiv Raj Bharadwaj

相关推荐

最近更新

标签

如何在 Python 中解析这个自定义日志文件

提问by steven.levey

采纳答案by steven.levey

Function to split up the log files.

拆分日志文件的功能。

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

查看正在处理的行是否以与我要查找的格式匹配的 {Date} 开头的函数

回答by Joran Beasley

回答by Deepak

回答by Lincoln Randall McFarland

回答by Sadashiv Raj Bharadwaj

相关推荐

Python中的反向索引？

Python 使用特定列连接两个 Pandas 数据框

Python string.format() 没有四舍五入的百分比

使用python将数据插入MSSQL服务器

相关推荐

最近更新

标签