将文件记录到 Pandas Dataframe

Question

提问by ukbaz

I have log files, which have many lines in the form of :

我有日志文件，其中有很多行，形式为：

LogLevel    [13/10/2015 00:30:00.650]  [Message Text]

My goal is to convert each line in the log file into a nice Data frame. I have tired to do that, by splitting the lines on the [ character, however I am still not getting a neat dataframe.

我的目标是将日志文件中的每一行转换为一个不错的数据框。我已经厌倦了这样做，通过拆分 [ 字符上的行，但是我仍然没有得到一个整洁的数据框。

My code:

我的代码：

level = []
time = []
text = []

   with open(filename) as inf:
     for line in inf:
       parts = line.split('[')
         if len(parts) > 1:  
           level = parts[0]
           time = parts[1]
           text = parts[2]
        print (parts[0],parts[1],parts[2])

 s1 = pd.Series({'Level':level, 'Time': time, 'Text':text})
 df = pd.DataFrame(s1).reset_index()

Heres my printed Data frame:

这是我打印的数据框：

Info      10/08/16 10:56:09.843]   In Function CCatalinaPrinter::ItemDescription()]

Info      10/08/16 10:56:09.843]   Sending UPC Description Message ]

How can I improve this to strip the whitespace and the other ']' character

我该如何改进以去除空格和另一个 ']' 字符

Thank you

谢谢

Answer 1

回答by jezrael

You can use read_csvwith separator \s*\[- whitespaces with [:

您可以使用read_csv分隔符\s*\[- 空格与[：

import pandas as pd
from pandas.compat import StringIO

temp=u"""LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s*\[", names=['Level','Time','Text'], engine='python')

Then remove ]by stripand convert column Timeto_datetime:

然后取出]用strip和转换列Timeto_datetime：

df.Time = pd.to_datetime(df.Time.str.strip(']'), format='%d/%m/%Y %H:%M:%S.%f')
df.Text = df.Text.str.strip(']')

print (df)
      Level                    Time          Text
0  LogLevel 2015-10-13 00:30:00.650  Message Text
1  LogLevel 2015-10-13 00:30:00.650  Message Text
2  LogLevel 2015-10-13 00:30:00.650  Message Text
3  LogLevel 2015-10-13 00:30:00.650  Message Text

print (df.dtypes)
Level            object
Time     datetime64[ns]
Text             object
dtype: object

Answer 2

回答by jxramos

I had to parse mine manually since my separator showed up in my message body and the message body would span multiple lines as well, eg if an exception were thrown from my Flask application and the stack track recorded.

我不得不手动解析我的分隔符，因为我的分隔符出现在我的消息正文中，并且消息正文也会跨越多行，例如，如果我的 Flask 应用程序抛出异常并且记录了堆栈轨道。

Here's my log creation format...

这是我的日志创建格式...

logging.basicConfig( filename="%s/%s_MyApp.log" % ( Utilities.logFolder , datetime.datetime.today().strftime("%Y%m%d-%H%M%S")) , level=logging.DEBUG, format="%(asctime)s,%(name)s,%(process)s,%(levelno)u,%(message)s", datefmt="%Y-%m-%d %H:%M:%S" )

And the parsing code in my Utilities module

以及我的 Utilities 模块中的解析代码

Utilities.py

import re
import pandas

logFolder = "./Logs"

logLevelToString = { "50" : "CRITICAL",
                     "40" : "ERROR"   ,
                     "30" : "WARNING" ,
                     "20" : "INFO"    ,
                     "10" : "DEBUG"   ,
                     "0"  : "NOTSET"  } # https://docs.python.org/3.6/library/logging.html#logging-levels

def logFile2DataFrame( filePath ) :
    dfLog = pandas.DataFrame( columns=[ 'Timestamp' , 'Module' , 'ProcessID' , 'Level' , 'Message' ] )
    tsPattern = "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},"

    with open( filePath , 'r' ) as logFile :
        numRows = -1
        for line in logFile :
            if re.search( tsPattern , line ) :
                tokens    = line.split(",")
                timestamp = tokens[0]
                module    = tokens[1]
                processID = tokens[2]
                level     = logLevelToString[ tokens[3] ]
                message   = ",".join( tokens[4:] )
                numRows += 1
                dfLog.loc[ numRows ] = [ timestamp , module , processID , level , message ]
            else :
                # Multiline message, integrate it into last record
                dfLog.loc[ numRows , 'Message' ] += line
    return dfLog

I actually created this helper message to allow me to view my logs directly from my Flask app as I have a handy template that renders a DataFrame. Should accelerate debugging a bunch since encasing the flaskapp in a Tornado WSGI server prevents the display of the normal debug page visible from Flask when an exception gets thrown. If anyone knows how to restore that functionality in such a usage please share.

我实际上创建了这个帮助消息，让我可以直接从我的 Flask 应用程序查看我的日志，因为我有一个方便的模板来呈现一个 DataFrame。应该加速调试，因为将flaskapp 封装在Tornado WSGI 服务器中会阻止在抛出异常时显示从Flask 可见的正常调试页面。如果有人知道如何在这种用法中恢复该功能，请分享。

将文件记录到 Pandas Dataframe

提问by ukbaz

回答by jezrael

回答by jxramos

相关推荐

最近更新

标签

将文件记录到 Pandas Dataframe

提问by ukbaz

回答by jezrael

回答by jxramos

相关推荐

pandas “DataFrame”对象不可调用

pandas 用值交换索引的最快方法

带有索引的 Pandas Plot 导致“KeyError [] 不在索引中”

如何检查列的任何值是否在 Pandas 的范围内（在两个值之间）？

相关推荐

最近更新

标签