从文本 Python 中识别和提取日期的最佳方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19994396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:12:16  来源:igfitidea点击:

Best way to identify and extract dates from text Python?

pythonparsingdatenlp

提问by redct

As part of a larger personal project I'm working on, I'm attempting to separate out inline dates from a variety of text sources.

作为我正在进行的一个更大的个人项目的一部分,我试图从各种文本源中分离出内联日期。

For example, I have a large list of strings (that usually take the form of English sentences or statements) that take a variety of forms:

例如,我有一个很大的字符串列表(通常采用英语句子或语句的形式),它们采用多种形式:

Central design committee session Tuesday 10/22 6:30 pm

Th 9/19 LAB: Serial encoding (Section 2.2)

There will be another one on December 15th for those who are unable to make it today.

Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm

He will be flying in Sept. 15th.

中央设计委员会会议 星期二 10/22 下午 6:30

9/19 实验室:串行编码(第 2.2 节)

12 月 15 日将有另一个今天无法到达的人。

练习册 3(最低工资):截至 9 月 18 日星期三晚上 11:59

他将于 9 月 15 日起飞。

While these dates are in-line with natural text, none of them are in specifically natural language forms themselves (e.g., there's no "The meeting will be two weeks from tomorrow"—it's all explicit).

虽然这些日期与自然文本一致,但它们本身都不是特定的自然语言形式(例如,没有“会议将在明天两周后举行”——这都是明确的)。

As someone who doesn't have too much experience with this kind of processing, what would be the best place to begin? I've looked into things like the dateutil.parsermodule and parsedatetime, but those seem to be for afteryou've isolated the date.

作为对这种处理没有太多经验的人,最好的起点是什么?我已经研究了诸如dateutil.parsermodule 和parsedatetime 之类的东西,但这些似乎是你隔离了日期之后。

Because of this, is there any good way to extract the date and the extraneous text

因此,有没有什么好的方法可以提取日期和无关文本

input:  Th 9/19 LAB: Serial encoding (Section 2.2)
output: ['Th 9/19', 'LAB: Serial encoding (Section 2.2)']

or something similar? It seems like this sort of processing is done by applications like Gmail and Apple Mail, but is it possible to implement in Python?

或类似的东西?似乎这种处理是由 Gmail 和 Apple Mail 等应用程序完成的,但是否可以用 Python 实现?

采纳答案by akoumjian

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

我也在寻找解决方案,但找不到任何解决方案,所以我和一个朋友构建了一个工具来做到这一点。我想我会回来分享以防其他人发现它有帮助。

datefinder -- find and extract dates inside text

datefinder -- 在文本中查找和提取日期

Here's an example:

下面是一个例子:

import datefinder

string_with_dates = '''
    Central design committee session Tuesday 10/22 6:30 pm
    Th 9/19 LAB: Serial encoding (Section 2.2)
    There will be another one on December 15th for those who are unable to make it today.
    Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
    He will be flying in Sept. 15th.
    We expect to deliver this between late 2021 and early 2022.
'''

matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print(match)

回答by Kyle Kelley

If you can identify the segments that actually contain the date information, parsing them can be fairly simple with parsedatetime. There are a few things to consider though namely that your dates don't have years and you should pick a locale.

如果您可以识别实际包含日期信息的段,则使用parsedatetime解析它们可以相当简单。有几件事需要考虑,即您的日期没有年份,您应该选择一个语言环境。

>>> import parsedatetime
>>> p = parsedatetime.Calendar()
>>> p.parse("December 15th")
((2013, 12, 15, 0, 13, 30, 4, 319, 0), 1)
>>> p.parse("9/18 11:59 pm")
((2014, 9, 18, 23, 59, 0, 4, 319, 0), 3)
>>> # It chooses 2014 since that's the *next* occurence of 9/18

It doesn't always work perfectly when you have extraneous text.

当您有无关文本时,它并不总是能完美运行。

>>> p.parse("9/19 LAB: Serial encoding")
((2014, 9, 19, 0, 15, 30, 4, 319, 0), 1)
>>> p.parse("9/19 LAB: Serial encoding (Section 2.2)")
((2014, 2, 2, 0, 15, 32, 4, 319, 0), 1)

Honestly, this seems like the kind of problem that would be simple enough to parse for particular formats and pick the most likely out of each sentence. Beyond that, it would be a decent machine learning problem.

老实说,这似乎是一种足够简单的问题,可以解析特定格式并从每个句子中选出最可能的。除此之外,这将是一个不错的机器学习问题。

回答by hardcode

Hi I'm not sure bellow approach is machine learning but you may try it:

嗨,我不确定波纹管方法是机器学习,但您可以尝试:

  • add some context from outside text, e.g publishing time of text message, posting, now etc. (your text doesn't tell anything about year)
  • extract all tokens with separator white-space and should get something like this:

    ['Th','Wednesday','9:34pm','7:34','pm','am','9/18','9/','/18', '19','12']
    
  • process them with rule-sets e.g subsisting from weekdays and/or variations of components forming time and mark them e.g. '%d:%dpm', '%d am', '%d/%d', '%d/ %d' etc. may means time. Note that it may have compositions e.g. "12 / 31" is 3gram ('12','/','31') should be one token "12/31" of interest.

  • "see" what tokens are around marked tokens like "9:45pm" e.g ('Th",'9/19','9:45pm') is 3gram formed from "interesting" tokens and apply rules about it that may determine meaning.

  • process for more specific analysis for example if have 31/12 so 31 > 12 means d/m, or vice verse, but if have 12/12 m,d will be available only in context build from text and/or outside.

  • 从外部文本中添加一些上下文,例如文本消息的发布时间、发布时间、现在等(您的文本不会说明任何有关年份的信息)
  • 使用分隔符空格提取所有标记,应该得到如下内容:

    ['Th','Wednesday','9:34pm','7:34','pm','am','9/18','9/','/18', '19','12']
    
  • 使用规则集处理它们,例如从工作日和/或组件形成时间的变化中存在,并标记它们,例如“%d:%dpm”、“%d am”、“%d/%d”、“%d/%d” '等可能意味着时间。请注意,它可能具有组合,例如“12 / 31”是 3gram ('12','/','31') 应该是一个感兴趣的标记“12/31”。

  • “查看”标记的标记周围有哪些标记,例如“9:45pm”,例如 ('Th",'9/19','9:45pm') 是由“有趣的”标记形成的 3gram,并应用有关它的规则来确定含义.

  • 进行更具体分析的过程,例如,如果有 31/12 所以 31 > 12 意味着 d/m,或者反之亦然,但如果有 12/12 m,d 将仅在从文本和/或外部构建的上下文中可用。

Cheers

干杯

回答by Prabin S

import datefinder
string_with_dates = """
                    entries are due by January 4th, 2017 at 8:00pm
                    created 01/15/2005 by ACME Inc. and associates.
                    """
matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print match

回答by Samkit Jain

You can use the dateutil module's parsemethod with the fuzzyoption.

您可以将dateutil 模块parse方法与fuzzy选项一起使用。

>>> from dateutil.parser import parse
>>> parse("Central design committee session Tuesday 10/22 6:30 pm", fuzzy=True)
datetime.datetime(2018, 10, 22, 18, 30)
>>> parse("There will be another one on December 15th for those who are unable to make it today.", fuzzy=True)
datetime.datetime(2018, 12, 15, 0, 0)
>>> parse("Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm", fuzzy=True)
datetime.datetime(2018, 3, 9, 23, 59)
>>> parse("He will be flying in Sept. 15th.", fuzzy=True)
datetime.datetime(2018, 9, 15, 0, 0)
>>> parse("Th 9/19 LAB: Serial encoding (Section 2.2)", fuzzy=True)
datetime.datetime(2002, 9, 19, 0, 0)

回答by Afsan Abdulali Gujarati

I am surprised that there is no mention of SUTimeand dateparser's search_datesmethod.

我很惊讶没有提到SUTimedateparser 的 search_dates方法。

from sutime import SUTime
import os
import json
from dateparser.search import search_dates

str1 = "Let's meet sometime next Thursday" 

# You'll get more information about these jar files from SUTime's github page
jar_files = os.path.join(os.path.dirname(__file__), 'jars')
sutime = SUTime(jars=jar_files, mark_time_ranges=True)

print(json.dumps(sutime.parse(str1), sort_keys=True, indent=4))
"""output: 
[
    {
        "end": 33,
        "start": 20,
        "text": "next Thursday",
        "type": "DATE",
        "value": "2018-10-11"
    }
]
"""

print(search_dates(str1))
#output:
#[('Thursday', datetime.datetime(2018, 9, 27, 0, 0))]

Although I have tried other modules like dateutil, datefinder and natty (couldn't get duckling to work with python), this two seem to give the most promising results.

尽管我尝试过其他模块,如 dateutil、datefinder 和 natty(无法让小鸭与 python 一起工作),但这两个似乎给出了最有希望的结果。

The results from SUTime are more reliable and it's clear from the above code snippet. However, the SUTime fails in some basic scenarios like parsing a text

SUTime 的结果更可靠,从上面的代码片段中可以清楚地看出。但是,SUTime 在一些基本场景中失败,比如解析文本

"I won't be available until 9/19"

“我要到 9/19 才有空”

or

或者

"I won't be available between (September 18-September 20).

“我将在(9 月 18 日至 9 月 20 日)之间不可用。

It gives no result for the first text and only gives month and year for the second text. This is however handled quite well in the search_dates method. search_dates method is more aggressive and will give all possible dates related to any words in the input text.

它没有给出第一个文本的结果,只给出第二个文本的月份和年份。然而,这在 search_dates 方法中处理得很好。search_dates 方法更具侵略性,将提供与输入文本中的任何单词相关的所有可能日期。

I haven't yet found a way to parse the text strictly for dates in search_methods. If I could find a way to do that, it'll be my first choice over SUTime and I would also make sure to update this answer if I find it.

我还没有找到一种方法来严格解析 search_methods 中的日期的文本。如果我能找到一种方法来做到这一点,它将是我的首选 SUTime,如果我找到它,我也会确保更新此答案。

回答by Ramtin M. Seraj

Newer versions of parsedatetimelib provide search functionality.

较新版本的parsedatetimelib 提供搜索功能。

Example

例子

from dateparser.search import search_dates

dates = search_dates('Central design committee session Tuesday 10/22 6:30 pm')