java 如何构建模型来区分关于 Apple (Inc.) 的推文和关于 Apple (fruit) 的推文?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17352469/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-01 01:45:57  来源:igfitidea点击:

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

javapythonrmachine-learningclassification

提问by SAL

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.

有关“苹果”的 50 条推文,请参见下文。我已经手工标记了关于 Apple Inc. 的正面匹配项。它们在下面标记为 1。

Here are a couple of lines:

这里有几行:

1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles
1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|@STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx  @SuryaRay

Here is the total data set: http://pastebin.com/eJuEb4eB

这是总数据集:http: //pastebin.com/eJuEb4eB

I need to build a model that classifies "Apple" (Inc). from the rest.

我需要构建一个对“Apple”(Inc)进行分类的模型。从其余。

I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Pythonpreferred).

我不是在寻找机器学习的一般概述,而是在寻找代码中的实际模型(首选Python)。

采纳答案by AMADANON Inc.

I would do it as follows:

我会这样做:

  1. Split the sentence into words, normalise them, build a dictionary
  2. With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
  3. When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.
  1. 将句子拆分为单词,将它们标准化,构建字典
  2. 对于每个词,存储它们在关于公司的推文中出现的次数,以及它们在关于水果的推文中出现的次数——这些推文必须经过人工确认
  3. 当一条新推文进来时,在字典中找到推文中的每个单词,计算一个加权分数——与公司相关的经常使用的单词将获得高公司分数,反之亦然;很少使用的词,或者同时用于公司和水果的词,不会有太大的分数。

回答by Neil McGuigan

What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fieldsto find named entities, based on having been trained to learn things about named entities.

您要查找的内容称为Named Entity Recognition。这是一种统计技术,(最常见)使用条件随机场来查找命名实体,基于已接受过有关命名实体的学习的训练。

Essentially, it looks at the content and contextof the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.

本质上,它查看单词的内容和上下文(回顾和向前看几个单词),以估计该单词是命名实体的概率。

Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")

好的软件可以查看单词的其他特征,例如它们的长度或形状(如“Vcv”,如果它以“元音-辅音-元音”开头)

A very good library (GPL) is Stanford's NER

一个非常好的库(GPL)是斯坦福的 NER

Here's the demo: http://nlp.stanford.edu:8080/ner/

这是演示:http: //nlp.stanford.edu: 8080/ner/

Some sample text to try:

一些示例文本尝试:

I was eating an apple over at Apple headquarters and I thought about Apple Martin, the daughter of the Coldplay guy

我在苹果总部吃苹果时想到了酷玩乐队的女儿苹果马丁

(the 3class and 4class classifiers get it right)

(3class 和 4class 分类器做对了)

回答by Ian Ozsvald

I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple word senseoptions), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.

我有一个解决这个问题的半工作系统,使用 scikit-learn 开源,有一系列博客文章描述了我在做什么。我要解决的问题是词义消歧(选择多个词义选项之一),这与命名实体识别不同。我的基本方法与现有解决方案有些竞争,并且(至关重要)是可定制的。

There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!

有一些现有的商业 NER 工具(OpenCalais、DBPedia Spotlight 和 AlchemyAPI)可能会给您带来足够好的商业结果——请先尝试这些!

I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (precision and recall). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code: https://github.com/ianozsvald/social_media_brand_disambiguator

我将其中一些用于客户项目(我在伦敦使用 NLP/ML 进行咨询),但我对它们的召回率(精度和召回率)不满意。基本上他们可以很准确(当他们说“这是苹果公司”时,他们通常是正确的),但召回率很低(他们很少说“这是苹果公司”,即使对人类来说这条推文显然是关于苹果公司的)。我认为构建一个为推文量身定制的开源版本在智力上很有趣。这是当前的代码:https: //github.com/ianozsvald/social_media_brand_disambiguator

I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just branddisambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.

我会注意到 - 我不是要尝试用这种方法解决广义的词义消歧问题,只是在您已经有了他们的名字时进行品牌消歧(公司、人等)。这就是为什么我相信这种直截了当的方法会奏效。

I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3 n-grams. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).

我在六周前开始使用它,它是使用 scikit-learn 用 Python 2.7 编写的。它使用一种非常基本的方法。我使用二进制计数向量化器(我只计算一个词是否出现,而不是出现次数)和 1-3 n-grams 进行向量化 。我不使用 TF-IDF 进行扩展(当文档长度可变时,TF-IDF 很好;对我来说,推文只有一两句话,我的测试结果没有显示使用 TF-IDF 有改进)。

I use the basic tokenizer which is very basic but surprisingly useful. It ignores @ # (so you lose some context) and of course doesn't expand a URL. I then train using logistic regression, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).

我使用基本的标记器,它非常基本但非常有用。它忽略 @#(因此您会丢失一些上下文)并且当然不会扩展 URL。然后我使用逻辑回归进行训练,似乎这个问题在某种程度上是线性可分的(一个类的很多术语对于另一个不存在)。目前我正在避免任何词干/清理(我正在尝试可能有效的最简单的方法)。

The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.

该代码具有完整的 README,您应该能够相对轻松地摄取您的推文,然后按照我的建议进行测试。

This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.

这适用于 Apple,因为人们不吃或喝 Apple 电脑,我们也不打字或玩水果,所以这些词很容易分成一类或另一类。当考虑诸如#definance 之类的电视节目时,这种情况可能不成立(人们也将#definance 用于与阿拉伯之春、板球比赛、考试复习和乐队有关)。这里很可能需要更聪明的方法。

I have a series of blog postsdescribing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).

我有一系列博客文章描述了这个项目,包括我在 BrightonPython 用户组上的一个小时的演示(在 DataScienceLondon 变成了一个简短的 140 人的演示)。

If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.

如果您使用诸如 LogisticRegression(您获得每个分类的概率)之类的东西,您可以只选择可信的分类,这样您就可以通过与召回率进行交易来强制高精度(因此您可以获得正确的结果,但结果更少)。您必须将其调整到您的系统。

Here's a possible algorithmic approach using scikit-learn:

这是使用 scikit-learn 的一种可能的算法方法:

  • Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)
  • Start with a Decision Tree classifier. It'll have explainable performance (see Overfitting with a Decision Treefor an example).
  • Move to logistic regression
  • Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)
  • Look at my example code in https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.pyfor a worked version of this approach
  • 使用 Binary CountVectorizer(我认为短消息中的术语计数不会增加太多信息,因为大多数单词只出现一次)
  • 从决策树分类器开始。它将具有可解释的性能(示例参见使用决策树过度拟合)。
  • 转向逻辑回归
  • 调查分类器生成的错误(读取 DecisionTree 的导出输出或查看 LogisticRegression 中的系数,通过 Vectorizer 处理错误分类的推文以查看底层的词袋表示是什么样的 - 那里的标记会少于你从原始推文开始 - 有足够的分类吗?)
  • https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py 中查看我的示例代码,了解此方法的工作版本

Things to consider:

需要考虑的事项:

  • You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)
  • Improve the tokeniser (very easy with scikit-learn) to keep # @ in tokens, and maybe add a capitalised-brand detector (as user @user2425429 notes)
  • Consider a non-linear classifier (like @oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).
  • A tweet-specific part of speech tagger (in my humble opinion not Standford's as @Neil suggests - it performs poorly on poor Twitter grammar in my experience)
  • Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)
  • 您需要更大的数据集。我使用了 2000 条带标签的推文(我花了五个小时),并且至少你想要一个平衡的集,每类 >100(参见下面的过度拟合说明)
  • 改进标记器(使用 scikit-learn 很容易)以将 # @ 保留在标记中,并且可能添加一个大写品牌检测器(如用户@user2425429 注释)
  • 当事情变得更难时,请考虑使用非线性分类器(如上面@oiez 的建议)。就我个人而言,我发现 LinearSVC 的表现比逻辑回归差(但这可能是由于我尚未减少的高维特征空间)。
  • 一个特定于推文的词性标记器(在我看来,不是像@Neil 所建议的那样斯坦福 - 根据我的经验,它在糟糕的 Twitter 语法上表现不佳)
  • 一旦你有很多标记,你可能想要做一些降维(我还没有尝试过 - 请参阅我关于 LogisticRegression l1 l2 penalisation 的博客文章)

Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.

回覆。过拟合。在我包含 2000 个项目的数据集中,我有一个来自 Twitter 的“苹果”推文的 10 分钟快照。大约 2/3 的推文是针对苹果公司的,1/3 是针对其他苹果用途的。我提取每个类的平衡子集(我认为大约 584 行)并进行五重交叉验证以进行训练。

Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.

由于我只有 10 分钟的时间窗口,因此我有很多关于同一主题的推文,这可能就是我的分类器相对于现有工具表现如此出色的原因 - 它会过度拟合训练特征而不能很好地泛化(而现有的商业工具在这个 snapshop 上表现更差,但在更广泛的数据集上更可靠)。我将扩大我的时间窗口,以将此作为后续工作进行测试。

回答by Sudipta

You can do the following:

您可以执行以下操作:

  1. Make a dict of words containing their count of occurrence in fruit and company related tweets. This can be achieved by feeding it some sample tweets whose inclination we know.

  2. Using enough previous data, we can find out the probability of a word occurring in tweet about apple inc.

  3. Multiply individual probabilities of words to get the probability of the whole tweet.

  1. 制作一个单词字典,其中包含它们在水果和公司相关推文中的出现次数。这可以通过向它提供一些我们知道其倾向的样本推文来实现。

  2. 使用足够的先前数据,我们可以找出一个词出现在关于苹果公司的推文中的概率。

  3. 将单词的单个概率相乘以获得整个推文的概率。

A simplified example:

一个简化的例子:

p_f= Probability of fruit tweets.

p_f= 水果推文的概率。

p_w_f= Probability of a word occurring in a fruit tweet.

p_w_f= 一个单词出现在水果推文中的概率。

p_t_f= Combined probability of all words in tweet occurring a fruit tweet = p_w1_f * p_w2_f * ...

p_t_f= 推文中所有单词出现水果推文的组合概率 = p_w1_f * p_w2_f * ...

p_f_t= Probability of fruit given a particular tweet.

p_f_t= 给定特定推文的水果概率。

p_c, p_w_c, p_t_c, p_c_tare respective values for company.

p_c、p_w_c、p_t_c、p_c_t分别是公司的值。

A laplacian smoother of value 1 is added to eliminate the problem of zero frequency of new words which are not there in our database.

添加值为 1 的拉普拉斯平滑器以消除我们数据库中不存在的新词出现频率为零的问题。

old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0

for tweet in old_tweets:
    company = old_tweets[tweet]
    for word in tweet.lower().split(" "):
        if not word in known_words:
            known_words[word] = {"company":0, "fruit":0 }
        if company == "1":
            known_words[word]["company"] += 1
            total_company_words += 1
        else:
            known_words[word]["fruit"] += 1
            total_fruit_words += 1

    if company == "1":
        total_company_tweets += 1
    else:
        total_fruit_tweets += 1
total_tweets = len(old_tweets)

def predict_tweet(new_tweet,K=1):
    p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
    p_c = (total_company_tweets+K)/(total_tweets+K*2)
    new_words = new_tweet.lower().split(" ")

    p_t_f = p_t_c = 1
    for word in new_words:
        try:
            wordFound = known_words[word]
        except KeyError:
            wordFound = {'fruit':0,'company':0}
        p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
        p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
    p_t_f *= p_w_f
    p_t_c *= p_w_c

    #Applying bayes rule
    p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
    p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
    if p_c_t > p_f_t:
        return "Company"
    return "Fruit"

回答by oiez

If you don't have an issue using an outside library, I'd recommend scikit-learnsince it can probably do this better & faster than anything you could code by yourself. I'd just do something like this:

如果您在使用外部库时没有问题,我建议您使用scikit-learn,因为它可能比您自己编写的任何代码都做得更好更快。我只是做这样的事情:

Build your corpus. I did the list comprehensions for clarity, but depending on how your data is stored you might need to do different things:

建立你的语料库。为了清楚起见,我做了列表理解,但根据数据的存储方式,您可能需要做不同的事情:

def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
    corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
    labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
    return (corpus, labels)

The important thing is you end up with two lists that look like this:

重要的是你最终会得到两个如下所示的列表:

([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])

The [1, 1, 0, 0] represent the positive and negative labels.

[1, 1, 0, 0] 代表正标签和负标签。

Then, you create a Pipeline! Pipeline is a scikit-learn class that makes it easy to chain text processing steps together so you only have to call one object when training/predicting:

然后,您创建一个管道!Pipeline 是一个 scikit-learn 类,它可以轻松地将文本处理步骤链接在一起,因此您在训练/预测时只需调用一个对象:

def train(corpus, labels)
    pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
                        ('tfidf', TfidfTransformer(norm='l2')),
                        ('clf', LinearSVC()),])
    pipe.fit_transform(corpus, labels)
    return pipe

Inside the Pipeline there are three processing steps. The CountVectorizer tokenizes the words, splits them, counts them, and transforms the data into a sparse matrix. The TfidfTransformer is optional, and you might want to remove it depending on the accuracy rating (doing cross validation tests and a grid search for the best parameters is a bit involved, so I won't get into it here). The LinearSVC is a standard text classification algorithm.

在流水线内部有三个处理步骤。CountVectorizer 对单词进行标记、拆分、计数,并将数据转换为稀疏矩阵。TfidfTransformer 是可选的,您可能希望根据准确度等级将其删除(进行交叉验证测试和网格搜索最佳参数有点涉及,所以我不会在这里讨论它)。LinearSVC 是一种标准的文本分类算法。

Finally, you predict the category of tweets:

最后,您预测推文的类别:

def predict(pipe, tweet):
    prediction = pipe.predict([tweet])
    return prediction

Again, the tweet needs to be in a list, so I assumed it was entering the function as a string.

同样,推文需要在列表中,所以我假设它以字符串形式输入函数。

Put all those into a class or whatever, and you're done. At least, with this very basic example.

将所有这些放入一个类或其他任何东西中,你就完成了。至少,有了这个非常基本的例子。

I didn't test this code so it might not work if you just copy-paste, but if you want to use scikit-learn it should give you an idea of where to start.

我没有测试这段代码,所以如果你只是复制粘贴它可能不起作用,但是如果你想使用 scikit-learn 它应该让你知道从哪里开始。

EDIT: tried to explain the steps in more detail.

编辑:试图更详细地解释这些步骤。

回答by Paul Dubs

Using a decision tree seems to work quite well for this problem. At least it produces a higher accuracy than a naive bayes classifier with my chosen features.

使用决策树似乎可以很好地解决这个问题。至少它比具有我选择的特征的朴素贝叶斯分类器产生更高的准确度。

If you want to play around with some possibilities, you can use the following code, which requires nltk to be installed. The nltk book is also freely available online, so you might want to read a bit about how all of this actually works: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

如果您想尝试一些可能性,可以使用以下代码,该代码需要安装 nltk。nltk 书也可以在线免费获得,因此您可能想了解一下所有这些内容的实际运作方式:http: //nltk.googlecode.com/svn/trunk/doc/book/ch06.html

#coding: utf-8
import nltk
import random
import re

def get_split_sets():
    structured_dataset = get_dataset()
    train_set = set(random.sample(structured_dataset, int(len(structured_dataset) * 0.7)))
    test_set = [x for x in structured_dataset if x not in train_set]

    train_set = [(tweet_features(x[1]), x[0]) for x in train_set]
    test_set = [(tweet_features(x[1]), x[0]) for x in test_set]
    return (train_set, test_set)

def check_accurracy(times=5):
    s = 0
    for _ in xrange(times):
        train_set, test_set = get_split_sets()
        c = nltk.classify.DecisionTreeClassifier.train(train_set)
        # Uncomment to use a naive bayes classifier instead
        #c = nltk.classify.NaiveBayesClassifier.train(train_set)
        s += nltk.classify.accuracy(c, test_set)

    return s / times


def remove_urls(tweet):
    tweet = re.sub(r'http:\/\/[^ ]+', "", tweet)
    tweet = re.sub(r'pic.twitter.com/[^ ]+', "", tweet)
    return tweet

def tweet_features(tweet):
    words = [x for x in nltk.tokenize.wordpunct_tokenize(remove_urls(tweet.lower())) if x.isalpha()]
    features = dict()
    for bigram in nltk.bigrams(words):
        features["hasBigram(%s)" % ",".join(bigram)] = True
    for trigram in nltk.trigrams(words):
        features["hasTrigram(%s)" % ",".join(trigram)] = True  
    return features

def get_dataset():
    dataset = """copy dataset in here
"""
    structured_dataset = [('fruit' if x[0] == '0' else 'company', x[2:]) for x in dataset.splitlines()]
    return structured_dataset

if __name__ == '__main__':
    print check_accurracy()

回答by SAL

Thank you for the comments thus far. Here is a working solutionI prepared with PHP. I'd still be interested in hearing from others a more algorithmic approach to this same solution.

感谢您到目前为止的评论。这是我用 PHP 准备的工作解决方案。我仍然有兴趣听取其他人对同一解决方案的更多算法方法。

<?php

// Confusion Matrix Init
$tp = 0;
$fp = 0;
$fn = 0;
$tn = 0;
$arrFP = array();
$arrFN = array();

// Load All Tweets to string
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://pastebin.com/raw.php?i=m6pP8ctM');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$strCorpus = curl_exec($ch);
curl_close($ch);

// Load Tweets as Array
$arrCorpus = explode("\n", $strCorpus);
foreach ($arrCorpus as $k => $v) {
    // init
    $blnActualClass = substr($v,0,1);
    $strTweet = trim(substr($v,2));

    // Score Tweet
    $intScore = score($strTweet);

    // Build Confusion Matrix and Log False Positives & Negatives for Review
    if ($intScore > 0) {
        if ($blnActualClass == 1) {
            // True Positive
            $tp++;
        } else {
            // False Positive
            $fp++;
            $arrFP[] = $strTweet;
        }
    } else {
        if ($blnActualClass == 1) {
            // False Negative
            $fn++;
            $arrFN[] = $strTweet;
        } else {
            // True Negative
            $tn++;
        }
    }
}

// Confusion Matrix and Logging
echo "
           Predicted
            1     0
Actual 1   $tp     $fp
Actual 0    $fn    $tn

";

if (count($arrFP) > 0) {
    echo "\n\nFalse Positives\n";
    foreach ($arrFP as $strTweet) {
        echo "$strTweet\n";
    }
}

if (count($arrFN) > 0) {
    echo "\n\nFalse Negatives\n";
    foreach ($arrFN as $strTweet) {
        echo "$strTweet\n";
    }
}

function LoadDictionaryArray() {
    $strDictionary = <<<EOD
10|iTunes
10|ios 7
10|ios7
10|iPhone
10|apple inc
10|apple corp
10|apple.com
10|MacBook
10|desk top
10|desktop
1|config
1|facebook
1|snapchat
1|intel
1|investor
1|news
1|labs
1|gadget
1|apple store
1|microsoft
1|android
1|bonds
1|Corp.tax
1|macs
-1|pie
-1|clientes
-1|green apple
-1|banana
-10|apple pie
EOD;

    $arrDictionary = explode("\n", $strDictionary);
    foreach ($arrDictionary as $k => $v) {
        $arr = explode('|', $v);
        $arrDictionary[$k] = array('value' => $arr[0], 'term' => strtolower(trim($arr[1])));
    }
    return $arrDictionary;
}

function score($str) {
    $str = strtolower($str);
    $intScore = 0;
    foreach (LoadDictionaryArray() as $arrDictionaryItem) {
        if (strpos($str,$arrDictionaryItem['term']) !== false) {
            $intScore += $arrDictionaryItem['value'];
        }
    }
    return $intScore;
}
?>

The above outputs:

以上输出:

           Predicted
            1     0
Actual 1   31     1
Actual 0    1    17


False Positives
1|Royals apple #ASGame @mlb @ News Corp Building http://instagram.com/p/bBzzgMrrIV/


False Negatives
-1|RT @MaxFreixenet: Apple no tiene clientes. Tiene FANS// error.... PAGAS por productos y apps, ergo: ERES CLIENTE.

回答by user2425429

In all the examples that you gave, Apple(inc) was either referred to as Apple or apple inc, so a possible way could be to search for:

在您提供的所有示例中, Apple(inc) 被称为Apple 或 apple inc,因此一种可能的方法是搜索:

  • a capital "A" in Apple

  • an "inc" after apple

  • words/phrases like "OS", "operating system", "Mac", "iPhone", ...

  • or a combination of them

  • 苹果中的大写字母“A”

  • 苹果之后的“公司”

  • 诸如“OS”、“操作系统”、“Mac”、“iPhone”之类的词/短语……

  • 或它们的组合

回答by Adam Gibson

To simplify answers based on Conditional Random Fields a bit...context is huge here. You will want to pick out in those tweets that clearly show Apple the company vs apple the fruit. Let me outline a list of features here that might be useful for you to start with. For more information look up noun phrase chunking, and something called BIO labels. See (http://www.cis.upenn.edu/~pereira/papers/crf.pdf)

为了简化基于条件随机场的答案......这里的上下文很大。你会想在那些清楚地向苹果公司展示苹果公司与苹果公司水果的推文中挑选出来。让我在这里概述可能对您有用的功能列表。有关更多信息,请查找名词短语分块,以及称为 BIO 标签的内容。见 ( http://www.cis.upenn.edu/~pereira/papers/crf.pdf)

Surrounding words: Build a feature vector for the previous word and the next word, or if you want more features perhaps the previous 2 and next 2 words. You don't want too many words in the model or it won't match the data very well. In Natural Language Processing, you are going to want to keep this as general as possible.

环绕词:为前一个词和下一个词建立一个特征向量,或者如果你想要更多的特征,可能是前 2 个和后 2 个词。您不想在模型中使用太多单词,否则它与数据将无法很好地匹配。在自然语言处理中,您将希望尽可能保持通用。

Other features to get from surrounding words include the following:

从周围单词中获得的其他功能包括:

Whether the first character is a capital

第一个字符是否大写

Whether the last character in the word is a period

单词的最后一个字符是否为句点

The part of speech of the word (Look up part of speech tagging)

单词的词性(查找词性标注)

The text itself of the word

这个词的文本本身

I don't advise this, but to give more examples of features specifically for Apple:

我不建议这样做,而是提供更多专门针对 Apple 的功能示例:

WordIs(Apple)

WordIs(苹果)

NextWordIs(Inc.)

NextWordIs(Inc.)

You get the point. Think of Named Entity Recognition as describing a sequence, and then using some math to tell a computer how to calculate that.

你明白了。将命名实体识别视为描述序列,然后使用一些数学方法告诉计算机如何计算它。

Keep in mind that natural language processing is a pipeline based system. Typically, you break things in to sentences, move to tokenization, then do part of speech tagging or even dependency parsing.

请记住,自然语言处理是一个基于管道的系统。通常,您将事物分解成句子,转向标记化,然后进行词性标记甚至依赖解析。

This is all to get you a list of features you can use in your model to identify what you're looking for.

这就是为您提供可在模型中使用的功能列表来确定您要查找的内容的全部内容。

回答by Pushpendre

Use LibShortText. This Pythonutility has already been tuned to work for short text categorization tasks, and it works well. The maximum you'll have to do is to write a loop to pick the best combination of flags. I used it to do supervised speech act classification in emails and the results were up to 95-97% accurate (during 5 fold cross validation!).

使用LibShortText。此Python实用程序已针对短文本分类任务进行了调整,并且运行良好。您最多需要编写一个循环来选择标志的最佳组合。我用它在电子邮件中进行有监督的言语行为分类,结果准确率高达 95-97%(在 5 折交叉验证期间!)。

And it comes from the makers of LIBSVMand LIBLINEARwhose support vector machine(SVM) implementation is used in sklearn and cran, so you can be reasonably assured that their implementation is not buggy.

它来自LIBSVMLIBLINEAR的制造商,它们支持向量机(SVM) 实现用于 sklearn 和 cran,因此您可以合理地确保它们的实现没有错误。