Difflib.SequenceMatcher isjunk可选参数查询:如何忽略空格,制表符,空行?

时间:2020-03-06 14:51:33  来源:igfitidea点击:

我正在尝试使用Difflib.SequenceMatcher来计算两个文件之间的相似性。这两个文件几乎相同,除了一个文件包含一些额外的空格,空行而其他文件不包含。我正在尝试使用

s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()

以此目的。

因此,问题是如何为此isjunk方法编写lambda表达式,以便SequenceMatcher方法将所有空白,空行等打折。我尝试使用参数lambda x:x =="",但结果不是一样好。对于两个非常相似的文本,该比率非常低。这是非常不直观的。

出于测试目的,以下是我们可以在测试中使用的两个字符串:

What Motivates jwovu to do your Job
  Well? OK, this is an entry trying to
  win 0 worth of software development
  books despite the fact that I don‘t
  read 
  
  programming books. In order to win the
  prize you have to write an entry and

  what motivatesfggmum to do your job
  well. Hence this post. First
  motivation 
  
  money. I know, this doesn‘t sound like
  a great inspiration to many, and
  saying that money is one of the
  motivation factors might just blow my
  chances away. 
  
  As if money is a taboo in programming
  world. I know there are people who
  can‘t be motivated by money.   Mme, on
  the other hand, am living in a real
  world, 
  
  with house mortgage to pay, myself to
  feed and bills to cover. So I can‘t
  really exclude money from my
  consideration. If I can get a large
  sum of money for 
  
  doing a good job, then   definitely
  boost my morale. I won‘t care whether
  I am using an old workstation, or
  forced to share rooms or cubicle with
  other 
  
  people, or have to put up with an
  annoying boss, or whatever. The fact
  that at the end of the day I will walk
  off with a large pile of money itself
  is enough 
  
  for me to overcome all the obstacles,
  put up with all the hard feelings and
  hurt egos, tolerate a slow computer
  and even endure

这是另一个字符串

What Motivates You to do your Job
  Well? OK, this is an entry trying to
  win 0 worth of software development
  books, despite the fact that I don't
  read programming books. In order to
  win the prize you have to write an
  entry and describes what motivates you
  to do your job well. Hence this post.
  
  First motivation, money. I know, this
  doesn't sound like a great inspiration
  to many, and saying that money is one
  of the motivation factors might just
  blow my chances away. As if money is a
  taboo in programming world. I know
  there are people who can't be
  motivated by money. Kudos to them. Me,
  on the other hand, am living in a real
  world, with house mortgage to pay,
  myself to feed and bills to cover. So
  I can't really exclude money from my
  consideration.
  
  If I can get a large sum of money for
  doing a good job, then thatwill
  definitely boost my morale. I won't
  care whether I am using an old
  workstation, or forced to share rooms
  or cubicle with other people, or have
  to put up with an annoying boss, or
  whatever. The fact that at the end of
  the day I will walk off with a large
  pile of money itself is enough for me
  to overcome all the obstacles, put up
  with all the hard feelings and hurt
  egos, tolerate a slow computer and
  even endure

我运行了上面的命令,并将isjunk设置为lambda x:x ==",比率仅为0.36.

解决方案

我没有使用Difflib.SequenceMatcher,但是我们是否考虑过对文件进行预处理以删除所有空白行和空格(可能通过正则表达式),然后进行比较?

使用示例字符串:

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

有趣的是,如果""也包含为垃圾:

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

看起来新行比空格具有更大的影响。

如果匹配所有空格,则相似性会更好:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

但是,difflib对于此类问题并不理想,因为这是两个几乎完全相同的文档,但是错别字等会为difflib带来差异,因为在difflib中人们看不到很多东西。

尝试阅读tf-idf,贝叶斯概率,向量空间模型和w-shingling

我已经编写了tf-idf的实现,将其应用于向量空间,并使用点积作为距离度量对文档进行分类。

鉴于以上文字,该测试确实是建议的:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

但是,为了稍微加快速度,我们可以利用CPython的方法包装器:

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

这样可以避免许多python函数调用。