如何在 Python 中解析这个自定义日志文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30627810/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse this custom log file in Python
提问by steven.levey
I am using Python logging to generate log files when processing and I am trying to READ those log files into a list/dict which will then be converted into JSON and loaded into a nosql database for processing.
我正在使用 Python 日志记录在处理时生成日志文件,我试图将这些日志文件读取到列表/字典中,然后将其转换为 JSON 并加载到 nosql 数据库中进行处理。
The file gets generated with the following format.
该文件使用以下格式生成。
2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
File "<ipython-input-16-132cda1c011d>", line 10, in <module>
if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files
NOTE: There are actually \n breaks before each NEW date you see but cant seem to represent it here.
注意:在您看到的每个新日期之前实际上都有 \n 间断,但似乎无法在这里表示。
Basically I am trying to read in this text file and produce a json object that looks like this:
基本上,我试图读取此文本文件并生成一个如下所示的 json 对象:
{
'Date': '2015-05-22 16:46:46,985',
'Type': 'INFO',
'Message':'Starting to Wait for Files'
}
...
{
'Date': '2015-05-22 16:48:48,180',
'Type': 'ERROR',
'Message':'Failed: Waiting for files the Files from Cloud Storage: gs://folder/anotherfolder/ Traceback (most recent call last):
File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}
The problem I am having:
我遇到的问题:
I can add each line into a list or dict etc BUT the ERROR message sometimes goes over multiple lines so I end up splitting it up incorrectly.
我可以将每一行添加到列表或字典等中,但错误消息有时会超过多行,因此我最终会错误地将其拆分。
Tried:
尝试:
I have tried to use code like the below to only split the lines on valid dates but I cant seem to get the error messages that go across multiple lines. I also tried regular expressions and think that's a possible solution but cant seem to find the right regex to use...NO CLUE how it works so tried a bunch of copy paste but without any success.
我尝试使用如下代码仅在有效日期拆分行,但我似乎无法获得跨越多行的错误消息。我也尝试过正则表达式,并认为这是一个可能的解决方案,但似乎无法找到正确的正则表达式来使用......不知道它是如何工作的,所以尝试了一堆复制粘贴但没有任何成功。
with open(filename,'r') as f:
for key,group in it.groupby(f,lambda line: line.startswith('2015')):
if key:
for line in group:
listNew.append(line)
Tried some crazy regex but no luck here either:
尝试了一些疯狂的正则表达式,但这里也没有运气:
logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)
Would appreciate any help...thanks
将不胜感激任何帮助...谢谢
EDIT:
编辑:
Posted a Solution below for anyone else struggling with the same thing.
在下面为其他遇到同样问题的人发布了解决方案。
采纳答案by steven.levey
Using @Joran Beasley's answer I came up with the following solution and it seems to work:
使用@Joran Beasley 的回答,我想出了以下解决方案,它似乎有效:
Main Points:
要点:
- My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
- Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
- The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
- A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.
- 我的日志文件总是遵循相同的结构:{Date} - {Type} - {Message} 所以我使用字符串切片和拆分来按照我需要的方式分解项目。例如 {Date} 总是 23 个字符,我只想要前 19 个字符。
- 使用 line.startswith("2015") 很疯狂,因为日期最终会改变,所以创建了一个新函数,它使用一些正则表达式来匹配我期望的日期格式。再一次,我的日志日期遵循特定的模式,所以我可以得到具体的信息。
- 该文件被读入第一个函数“generateDicts()”,然后调用“matchDate()”函数以查看正在处理的行是否与我正在寻找的 {Date} 格式匹配。
- 每次找到有效的 {Date} 格式时都会创建一个 NEW dict,并处理所有内容,直到遇到下一个有效的 {Date}。
Function to split up the log files.
拆分日志文件的功能。
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith(matchDate(line)):
if currentDict:
yield currentDict
currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
listNew= list(generateDicts(f))
Function to see if the line being processed starts with a {Date} that matches the format I am looking for
查看正在处理的行是否以与我要查找的格式匹配的 {Date} 开头的函数
def matchDate(line):
matchThis = ""
matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
if matched:
#matches a date and adds it to matchThis
matchThis = matched.group()
else:
matchThis = "NONE"
return matchThis
回答by Joran Beasley
create a generator (Im on a generator bend today)
创建一个发电机(我今天在发电机弯道上)
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith("2015"): #you might want a better check here
if currentDict:
yield currentDict
currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("logfile.txt") as f:
print list(generateDicts(f))
there may be a few minor typos... I didnt actually run this
可能有一些小错别字...我实际上并没有运行这个
回答by Deepak
The solution provided by @steven.levey is perfect. One addition to it that I would like to make is to use this regex pattern to determine if the line is proper and extract the required values. So that we don't have to work on splitting the lines once again after determining the format using regex.
@steven.levey 提供的解决方案是完美的。我想做的一个补充是使用这个正则表达式模式来确定该行是否正确并提取所需的值。这样我们就不必在使用正则表达式确定格式后再次拆分行。
pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'
回答by Lincoln Randall McFarland
You can get the fields you are looking for directly from the regex using groups. You can even name them:
您可以使用组直接从正则表达式中获取您要查找的字段。您甚至可以为它们命名:
>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
... print found.groupdict()
...
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'
Then create a date class where the constructor's kwargs match the group names. Use a little **magic to create an instance of the object directly from the regex groupdict and you are cooking with gas. In the constructor you can then figure out if 2016 is a leap year and Feb 29 exits.
然后创建一个日期类,其中构造函数的 kwargs 与组名称匹配。使用一点 **magic 直接从正则表达式 groupdict 创建对象的实例,并且您正在使用燃气进行烹饪。在构造函数中,您可以确定 2016 年是否为闰年,2 月 29 日是否退出。
-lrm
-lrm
回答by Sadashiv Raj Bharadwaj
list = []
with open('bla.txt', 'r') as file:
for line in file.readlines():
if len(line.split(' - ')) >= 4:
d = dict()
d['Date'] = line.split(' - ')[0]
d['Type'] = line.split(' - ')[2]
d['Message'] = line.split(' - ')[3]
list.append(d)
print(list)
Output:
输出:
[{
'Date': '2015-05-22 16:46:46,985',
'Message': 'Starting to Wait for Files\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:46:56,645',
'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:47:46,488',
'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:48:48,180',
'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
'Type': 'ERROR'
}, {
'Date': '2015-05-22 16:49:17,918',
'Message': 'Starting to Wait for Files\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:49:32,160',
'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:49:39,329',
'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:53:30,706',
'Message': 'Starting to Wait for Files',
'Type': 'INFO'
}]