将文件记录到 Pandas Dataframe

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40305122/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:18:10  来源:igfitidea点击:

Log file to Pandas Dataframe

pythonpython-3.xpandasdataframedata-analysis

提问by ukbaz

I have log files, which have many lines in the form of :

我有日志文件,其中有很多行,形式为:

LogLevel    [13/10/2015 00:30:00.650]  [Message Text]

My goal is to convert each line in the log file into a nice Data frame. I have tired to do that, by splitting the lines on the [ character, however I am still not getting a neat dataframe.

我的目标是将日志文件中的每一行转换为一个不错的数据框。我已经厌倦了这样做,通过拆分 [ 字符上的行,但是我仍然没有得到一个整洁的数据框。

My code:

我的代码:

level = []
time = []
text = []

   with open(filename) as inf:
     for line in inf:
       parts = line.split('[')
         if len(parts) > 1:  
           level = parts[0]
           time = parts[1]
           text = parts[2]
        print (parts[0],parts[1],parts[2])

 s1 = pd.Series({'Level':level, 'Time': time, 'Text':text})
 df = pd.DataFrame(s1).reset_index()

Heres my printed Data frame:

这是我打印的数据框:

Info      10/08/16 10:56:09.843]   In Function CCatalinaPrinter::ItemDescription()]

Info      10/08/16 10:56:09.843]   Sending UPC Description Message ]

How can I improve this to strip the whitespace and the other ']' character

我该如何改进以去除空格和另一个 ']' 字符

Thank you

谢谢

回答by jezrael

You can use read_csvwith separator \s*\[- whitespaces with [:

您可以使用read_csv分隔符\s*\[- 空格与[

import pandas as pd
from pandas.compat import StringIO

temp=u"""LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s*\[", names=['Level','Time','Text'], engine='python')

Then remove ]by stripand convert column Timeto_datetime:

然后取出]strip和转换列Timeto_datetime

df.Time = pd.to_datetime(df.Time.str.strip(']'), format='%d/%m/%Y %H:%M:%S.%f')
df.Text = df.Text.str.strip(']')

print (df)
      Level                    Time          Text
0  LogLevel 2015-10-13 00:30:00.650  Message Text
1  LogLevel 2015-10-13 00:30:00.650  Message Text
2  LogLevel 2015-10-13 00:30:00.650  Message Text
3  LogLevel 2015-10-13 00:30:00.650  Message Text

print (df.dtypes)
Level            object
Time     datetime64[ns]
Text             object
dtype: object

回答by jxramos

I had to parse mine manually since my separator showed up in my message body and the message body would span multiple lines as well, eg if an exception were thrown from my Flask application and the stack track recorded.

我不得不手动解析我的分隔符,因为我的分隔符出现在我的消息正文中,并且消息正文也会跨越多行,例如,如果我的 Flask 应用程序抛出异常并且记录了堆栈轨道。

Here's my log creation format...

这是我的日志创建格式...

logging.basicConfig( filename="%s/%s_MyApp.log" % ( Utilities.logFolder , datetime.datetime.today().strftime("%Y%m%d-%H%M%S")) , level=logging.DEBUG, format="%(asctime)s,%(name)s,%(process)s,%(levelno)u,%(message)s", datefmt="%Y-%m-%d %H:%M:%S" )

And the parsing code in my Utilities module

以及我的 Utilities 模块中的解析代码

Utilities.py

import re
import pandas

logFolder = "./Logs"

logLevelToString = { "50" : "CRITICAL",
                     "40" : "ERROR"   ,
                     "30" : "WARNING" ,
                     "20" : "INFO"    ,
                     "10" : "DEBUG"   ,
                     "0"  : "NOTSET"  } # https://docs.python.org/3.6/library/logging.html#logging-levels

def logFile2DataFrame( filePath ) :
    dfLog = pandas.DataFrame( columns=[ 'Timestamp' , 'Module' , 'ProcessID' , 'Level' , 'Message' ] )
    tsPattern = "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},"

    with open( filePath , 'r' ) as logFile :
        numRows = -1
        for line in logFile :
            if re.search( tsPattern , line ) :
                tokens    = line.split(",")
                timestamp = tokens[0]
                module    = tokens[1]
                processID = tokens[2]
                level     = logLevelToString[ tokens[3] ]
                message   = ",".join( tokens[4:] )
                numRows += 1
                dfLog.loc[ numRows ] = [ timestamp , module , processID , level , message ]
            else :
                # Multiline message, integrate it into last record
                dfLog.loc[ numRows , 'Message' ] += line
    return dfLog

I actually created this helper message to allow me to view my logs directly from my Flask app as I have a handy template that renders a DataFrame. Should accelerate debugging a bunch since encasing the flaskapp in a Tornado WSGI server prevents the display of the normal debug page visible from Flask when an exception gets thrown. If anyone knows how to restore that functionality in such a usage please share.

我实际上创建了这个帮助消息,让我可以直接从我的 Flask 应用程序查看我的日志,因为我有一个方便的模板来呈现一个 DataFrame。应该加速调试,因为将flaskapp 封装在Tornado WSGI 服务器中会阻止在抛出异常时显示从Flask 可见的正常调试页面。如果有人知道如何在这种用法中恢复该功能,请分享。