scala 如何使用 Spark 为文本分类创建 TF-IDF?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24548290/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I create a TF-IDF for Text Classification using Spark?
提问by eliasah
I have a CSV file with the following format :
我有一个具有以下格式的 CSV 文件:
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
product_idX 是一个整数,product_titleX 是一个字符串,例如:
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
我正在尝试从我的文件创建 TF-IDF,以便我可以将它用于 MLlib 中的朴素贝叶斯分类器。
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3and 4.
到目前为止,我正在使用 Spark for Scala 并使用我在官方页面和 Berkley AmpCamp 3和4上找到的教程。
So I'm reading the file :
所以我正在阅读文件:
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
然后我将它映射到元组中 RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
在我将元组转换成对之后 RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
但是我被困在这里,我不知道如何从中创建 Vector 以将其转换为 TFIDF。
Thanks
谢谢
回答by Metropolis
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
为了自己做这件事(使用 pyspark),我首先从语料库中创建了两个数据结构。第一个是键值结构
document_id, [token_ids]
The second is an inverted index like
第二个是倒排索引,如
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
我将分别调用这些语料库和 inv_index。
To get tf we need to count the number of occurrences of each token in each document. So
为了得到 tf,我们需要计算每个文档中每个标记出现的次数。所以
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df 只是每个术语的倒排索引的长度。由此我们可以计算idf。
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
现在我们只需要对 term_id 进行连接:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
不过,这不是一个特别高效的解决方案。调用 collect 将 idf 带入驱动程序以便它可用于 join 似乎是错误的做法。
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
当然,它需要首先标记并创建从词汇表中的每个 uniq 标记到某个 token_id 的映射。
If anyone can improve on this, I'm very interested.
如果有人可以改进这一点,我很感兴趣。

