java 识别任意日期字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3850784/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 03:40:28  来源:igfitidea点击:

Recognise an arbitrary date string

javadateclassification

提问by Joel

I need to be able to recognise date strings. It doesn't matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being a date, rather than converting it to a Date object. So, this is really a classification rather than parsing problem.

我需要能够识别日期字符串。如果我无法区分月份和日期(例如 12/12/10)也没关系,我只需要将字符串归类为日期,而不是将其转换为日期对象。所以,这实际上是一个分类而不是解析问题。

I will have pieces of text such as:

我将有一些文本,例如:

"bla bla bla bla 12 Jan 09bla bla bla 01/04/10bla bla bla"

“bla bla bla bla 12 Jan 09bla bla bla 01/04/10bla bla bla”

and I need to be able to recognise the start and end boundary for each date string within.

我需要能够识别其中每个日期字符串的开始和结束边界。

I was wondering if anyone knew of any java libraries that can do this. My google-fu hasn't come up with anything so far.

我想知道是否有人知道任何可以做到这一点的 Java 库。我的 google-fu 到目前为止还没有想出任何东西。

UPDATE: I need to be able to recognise the widest possible set of ways of representing a dates. Of course the naive solution might be to write an if statement for every conceivable format, but a pattern recognition approach, with a trained model, is ideally what I'm after.

更新:我需要能够识别最广泛的表示日期的方法。当然,天真的解决方案可能是为每种可能的格式编写一个 if 语句,但是模式识别方法和训练有素的模型是我所追求的理想之选。

采纳答案by Puspendu Banerjee

Use JChronic

使用JChronic

You may want to use DateParser2from edu.mit.broad.genome.utils package.

您可能想使用edu.mit.broad.genome.utils 包中的DateParser2

回答by Bozho

You can loop all available date formats in Java:

您可以在 Java 中循环所有可用的日期格式:

for (Locale locale : DateFormat.getAvailableLocales()) {
    for (int style =  DateFormat.FULL; style <= DateFormat.SHORT; style ++) {
        DateFormat df = DateFormat.getDateInstance(style, locale);
        try {
                df.parse(dateString);
                // either return "true", or return the Date obtained Date object
        } catch (ParseException ex) {
            continue; // unperasable, try the next one
        }
    }
}

This however won't account for any custom date formats.

但是,这不会考虑任何自定义日期格式。

回答by darioo

Rules that might help you in your quest:

可能会帮助您完成任务的规则:

  1. Make or find some sort of a database with known words that match months. Abbreviated and full names, like Janor January. While searching, it must be case insensitive, because fEBruaRy is also a month, although the person typing it must have been drunk. If you plan to search non-english months, a database is also needed, because no heuristic will find out that "Wrzesień" is polish for september.
  2. For english only, check out ordinal numbersand also make a database for numbers 1 to 31. These will be useful for days and months. If you want to use this approach for other languages, then you will have to do your own research.
  3. Once again, english only, check for "Anno Domini" and "Before Christ", that is, AD and BC respectively. They can also be in form A.D. and B.C.
  4. Concerning numbers themselves that will represent days, months and years, you must know where your limit is. Is it 0-9999, or more? That is, do you want to search for dates that represent years beyond year 9999? If no, then strings that have 1-4 consecutive digits are good guesses for a valid day, month or year.
  5. Days and months have one or two digits. Leading zeros are acceptable, so strings with a format of 0*, where * can be 1-9 are acceptable.
  6. Separators can be tricky, but if you don't allow inconsistent formatting like 10/20\1999, then you will save yourself a lot of grief. This is because 10*20*1999 can be a valid date, with * usually being one element of set {-,_, ,:,/,\,.,','}, but it's possible that * is a combination of 2 or 3 elements of mentioned set. Once again, you must choose acceptable separators. 10?20?1999 can be a valid date for somebody with a weird sense of elegance. 10 / 20 / 1999 can also be a valid date, but 10_/20_/1999 would be a very strange one.
  7. There are cases with no separator. For example: 10Jan1988. These cases use words from 1.
  8. There are special cases, like February 28th or 29th, depending on leap year. Also, months with 30 or 31 days.
  1. 制作或找到某种数据库,其中包含与月份匹配的已知单词。缩写和全名,如JanJanuary。搜索时,必须不区分大小写,因为 fEBruaRy 也是一个月,虽然打字的人肯定喝醉了。如果您计划搜索非英语月份,还需要一个数据库,因为没有启发式方法会发现“Wrzesień”是 9 月的波兰语。
  2. 仅对于英语,请查看序数,并为数字 1 到 31 创建一个数据库。这些对于几天和几个月都很有用。如果您想将这种方法用于其他语言,那么您必须自己进行研究。
  3. 再次,仅限英语,检查“Anno Domini”和“Before Christ”,即分别为 AD 和 BC。它们也可以采用 AD 和 BC 形式
  4. 关于代表天、月和年的数字本身,您必须知道您的极限在哪里。是 0-9999 还是更多?也就是说,您要搜索代表 9999 年之后年份的日期吗?如果不是,那么具有 1-4 个连续数字的字符串是有效日、月或年的好猜测。
  5. 天和月有一位或两位数。可以接受前导零,因此可以接受格式为 的字符串0*,其中 * 可以是 1-9。
  6. 分隔符可能很棘手,但如果您不允许使用不一致的格式,例如 10/20\1999,那么您将省去很多麻烦。这是因为 10*20*1999 可以是有效日期, * 通常是 set 的一个元素{-,_, ,:,/,\,.,','},但 * 也可能是上述 set 的 2 或 3 个元素的组合。同样,您必须选择可接受的分隔符。10?20?1999 对具有古怪优雅感的人来说可能是一个有效的日期。10 / 20 / 1999 也可以是一个有效日期,但 10_/20_/1999 会是一个非常奇怪的日期。
  7. 有些情况下没有分隔符。例如:10Jan1988。这些案例使用来自 1 的单词。
  8. 有特殊情况,例如 2 月 28 日或 29 日,具体取决于闰年。此外,月份为 30 或 31 天。

I think these are enough for a "naive" classification, a linguist expert might help you more.

我认为这些对于“幼稚”的分类来说已经足够了,语言专家可能会为您提供更多帮助。

Now, an idea for your algorithm. Speed doesn't matter. There might be multiple passes over the same string. Optimize when it starts to matter. When you doubt that you have found a date string, store it somewhere "safe" in a ListOfPossibleDatesand do an examination once again, with more rigid rules using combinations from 1. to 8. When you believe a date string is valid, feed it to the Dateclass to see if it's really valid. 32nd March 1999 is not valid, when you convert it to a format that Datewill understand.

现在,对您的算法有了一个想法。速度无所谓。同一字符串可能有多次传递。在开始变得重要时进行优化。当您怀疑是否找到了日期字符串时,请将其存储在 a 中“安全”的某个位置ListOfPossibleDates并再次进行检查,使用更严格的规则使用从 1. 到 8 的组合。当您认为日期字符串有效时,将其提供给在Date看到类,如果它真的很有效。1999 年 3 月 32 日无效,当您将其转换为Date可以理解的格式时。

One important recurring pattern is lookbehind and lookaround. When you believe a valid entity (day, month, year) is found, you'll have to see what lies behind and after. A stack based mechanism or recursion might help here.

一种重要的重复模式是后视和环视。当您相信找到了一个有效的实体(日、月、年)时,您必须查看后面和之后的内容。基于堆栈的机制或递归在这里可能会有所帮助。

Steps:

脚步:

  1. Search your string for words from rule 1. If you find any of them, note that location. Note the month. Now, go a few characters behind and a few ahead to see what awaits you. If there are no spaces before and after your month, and there are numbers, like in rule 7., check them for validity. If one of them represents a day (must be 0-31) and other a year (must be 0-9999, possibly with AD or BC), you have one candidate. If there are the same separators before and after, look for rules from 6. Always remember that you must be sure that a valid combination exists. so, 32Jan1999 won't do.
  2. Search your string for other english words, from rules 2. and 3. Repeat similarly like in step 1.
  3. Search for separators. Empty space will be the trickiest. Try to find them in pairs. So, if you have one "/" in your string, find another one and see what they have inbetween. If you find a combination of separators, to the same thing. Also, use the algorithm from step 2.
  4. Search for digits. Valid ones are 0-9999 with leading zeroes allowed. If you find one, look for separators like in step 3.
  1. 在您的字符串中搜索规则 1 中的单词。如果找到任何单词,请记下该位置。注意月份。现在,先走几个字符,再走几个字符,看看有什么在等着你。如果您的月份前后没有空格,并且有数字(如规则 7.),请检查它们的有效性。如果其中一个代表一天(必须是 0-31),另一个代表一年(必须是 0-9999,可能是 AD 或 BC),则您有一个候选人。如果前后有相同的分隔符,请从 6 中查找规则。永远记住,您必须确保存在有效的组合。所以,32Jan1999 不行。
  2. 根据规则 2. 和 3,在您的字符串中搜索其他英语单词。重复步骤 1 中的类似操作。
  3. 搜索分隔符。空的空间将是最棘手的。尝试成对找到它们。因此,如果您的字符串中有一个“/”,请找到另一个并查看它们之间的内容。如果你找到了分隔符的组合,来同样的事情。此外,使用步骤 2 中的算法。
  4. 搜索数字。有效值为 0-9999,允许前导零。如果找到,请查找第 3 步中的分隔符。

Since there is literally a countless amount of possibilities, you won't be able to catch them all. Once you have found a pattern that you believe could occur once again, store it somewhere and you can use it as a regex for passing other strings.

由于实际上有无数种可能性,因此您将无法全部掌握。一旦您发现您认为可能再次出现的模式,请将其存储在某处,您可以将其用作传递其他字符串的正则表达式。

Let's take your example, "bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla". After you extract the first date, 12 Jan 09, then use the rest of that string ("bla bla bla 01/04/10 bla bla bla") and apply all above steps once again. This way you'll be sure you didn't miss anything.

让我们以你的例子为例,"bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"。提取第一个日期后12 Jan 09,然后使用该字符串的其余部分 ( "bla bla bla 01/04/10 bla bla bla") 并再次应用上述所有步骤。这样你就可以确定你没有错过任何东西。

I hope these suggestions will be at least of some help. If there doesn't exist a library for do all these dirty (and more) steps for you, then you have a tough road ahead of you. Good luck!

我希望这些建议至少会有所帮助。如果不存在为您完成所有这些肮脏(和更多)步骤的库,那么您将面临一条艰难的道路。祝你好运!

回答by Matt

Very good date parser in java is Natty, you can try it here

Java 中非常好的日期解析器是Natty,您可以在这里尝试

回答by Martijn Courteaux

I did it with a huge regex (self created):

我用一个巨大的正则表达式(自己创建)做到了:

public static final String DATE_REGEX = "\b([0-9]{1,2} ?([\-/\\] ?[0-9]{1,2} ?| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ?)([\-/\\]? ?('?[0-9]{2}|[0-9]{4}))?)\b";
public static final Pattern DATE_PATTERN = Pattern.compile(DATE_REGEX, Pattern.CASE_INSENSITIVE); // Case insensitive is to match also "mar" and not only "Mar" for March

public static boolean containsDate(String str)
{
    Matcher matcher = pattern.matcher(str);
    return matcher.matches();
}

This matches following dates:

这匹配以下日期:

06 Sep 2010
12-5-2005
07 Mar 95
30 DEC '99
1101

And not this:

而不是这个:

444/11/11
bla11/11/11
11/11/11blah

It also matches dates between symbols like [],(), ,:

它还匹配诸如[], (), 之类的符号之间的日期,

Yesterday (6 nov 2010)

It matches dates without year:

它匹配没有年份的日期:

Yesterday, 6 nov, was a rainy day...

But it matches:

但它匹配

86-44/1234
00-00-0000
11/11

And this doesn't look not anymore like a date. But this is something you can solve by checking if the numbers are possible values for a month, day, year.

这看起来不再像约会了。但这是您可以通过检查数字是否为月、日、年的可能值来解决的问题。

回答by carlosdc

I am sure researchers in information extractionhave looked at this problem, but I couldn't find a paper.

我相信信息提取的研究人员已经研究过这个问题,但我找不到论文。

One thing you can try is do it as a two step process. (1) after collecting as much data as you can, extract features, some features that come to mind: number of numbers that appear in the string, number of numbers from 1-31 that appear in the string, number of numbers from 1-12 that appear in the string, number of months names that appear in the string, and so on. (2) learn from the features using some type of binary classification method (SVM for example) and finally (3) when a new string comes by, extract the features and query the SVM for a prediction.

您可以尝试的一件事是将其作为两步过程进行。(1)收集尽可能多的数据后,提取特征,想到的一些特征:出现在字符串中的数字的数量,出现在字符串中的1-31的数字的数量,1-的数字的数量出现在字符串中的 12 个、出现在字符串中的月份名称等。(2)使用某种类型的二元分类方法(例如 SVM)从特征中学习,最后(3)当一个新字符串出现时,提取特征并查询 SVM 进行预测。

回答by MD. Mohiuddin Ahmed

Here is a simple natty example :

这是一个简单的例子:

import com.joestelmach.natty.*;

List<Date> dates =new Parser().parse("Start date 11/30/2013 , end date Friday, Sept. 7, 2013").get(0).getDates();
        System.out.println(dates.get(0));
        System.out.println(dates.get(1));

//output:
        //Sat Nov 30 11:14:30 BDT 2013
        //Sat Sep 07 11:14:30 BDT 2013

回答by David Watson

What I would do is look for date characteristics, rather than the dates themselves. For example, you could search for slashes, (to get dates of the form 1/1/1001), dashes (1 - 1 - 1001), month names and abbreviations (Jan 1 1001 or January 1 1001). When you get a hit for these, collect the nearby words (2 on each side should be fine) and store that in an array of strings. Once you have scanned all input, check this string array with a function that will go into a bit more depth and pull out actual date strings, using the methods found here. The important thing is just getting the general dates down to a manageable level.

我要做的是寻找日期特征,而不是日期本身。例如,您可以搜索斜杠(以获取 1/1/1001 形式的日期)、破折号 (1 - 1 - 1001)、月份名称和缩写(1001 年 1 月 1 日或 1001 年 1 月 1 日)。当你遇到这些时,收集附近的单词(每边 2 个应该没问题)并将其存储在一个字符串数组中。扫描完所有输入后,使用一个函数检查此字符串数组,该函数将使用此处找到的方法更深入地提取实际日期字符串。重要的是将一般日期降低到可管理的水平。

回答by Pawe? Dyda

It is virtually impossible to recognize all possible date formats as dates using "standard" algorithms. That's just because there are so many of them.

使用“标准”算法几乎不可能将所有可能的日期格式识别为日期。那只是因为它们太多了。

We, humans are capable of doing that just because we learned that something like 2010-03-31 resembles date. In other words, I would suggest to use Machine Learning algorithms and teach your program to recognize valid date sequences. With Google Prediction APIthat should be feasible.

我们人类之所以能够做到这一点,只是因为我们了解到 2010-03-31 之类的东西类似于日期。换句话说,我建议使用机器学习算法并教你的程序识别有效的日期序列。使用Google Prediction API应该是可行的。

Or you can use Regular Expressions as suggested above, to detect some but not all date formats.

或者您可以使用上面建议的正则表达式来检测一些但不是所有的日期格式。

回答by zserge

Maybe you should use regular expressions?

也许你应该使用正则表达式?

Hopefully this one would work for mm-dd-yyyy format:

希望这个适用于 mm-dd-yyyy 格式:

^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$

^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$

Here (0[1-9]|1[012])matches the month 00..12, (0[1-9]|[12][0-9]|3[01])matches a date 00..31 and (19|20)\d\dmatches a year.

这里(0[1-9]|1[012])匹配月份 00..12,(0[1-9]|[12][0-9]|3[01])匹配日期 00..31 并(19|20)\d\d匹配年份。

Fields can be delmited by dash, slash or a dot.

字段可以用破折号、斜线或点分隔。

Regards, Serge

问候, 塞尔吉