Python pyspark 使用 partitionby 对数据进行分区

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35973590/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:13:16  来源:igfitidea点击:

pyspark partitioning data using partitionby

pythonapache-sparkpysparkpartitioningrdd

提问by user2543622

I understand that partitionByfunction partitions my data. If I use rdd.partitionBy(100)it will partition my data by key into 100 parts. i.e. data associated with similar keys will be grouped together

我知道该partitionBy函数会分区我的数据。如果我使用rdd.partitionBy(100)它,它将按密钥将我的数据分成 100 个部分。即与相似键关联的数据将被分组在一起

  1. Is my understanding correct?
  2. Is it advisable to have number of partitions equal to number of available cores? Does that make processing more efficient?
  3. what if my data is not in key,value format. Can i still use this function?
  4. lets say my data is serial_number_of_student,student_name. In this case can i partition my data by student_name instead of the serial_number?
  1. 我的理解正确吗?
  2. 是否建议分区数等于可用内核数?这会使处理效率更高吗?
  3. 如果我的数据不是键值格式怎么办。我还能使用这个功能吗?
  4. 假设我的数据是 serial_number_of_student,student_name。在这种情况下,我可以通过 student_name 而不是 serial_number 来分区我的数据吗?

回答by zero323

  1. Not exactly. Spark, including PySpark, is by default using hash partitioning. Excluding identical keys there is no practical similarity between keys assigned to a single partition.
  2. There is no simple answer here. All depends on amount of data and available resources. Too largeor too low number of partitions will degrade the performance.

    Some resourcesclaim the number of partitions should around twice as large as the number of available cores. From the other hand a single partition typically shouldn't contain more than 128MB and a single shuffle block cannot be larger than 2GB (See SPARK-6235).

    Finally you have to correct for potential data skews. If some keys are overrepresented in your dataset it can result in suboptimal resource usage and potential failure.

  3. No, or at least not directly. You can use keyBymethod to convert RDD to required format. Moreover any Python object can be treated as a key-value pairas long as it implements required methods which make it behave like an Iterableof length equal two. See How to determine if object is a valid key-value pair in PySpark

  4. It depends on the types. As long as key is hashable* then yes. Typically it means it has to be immutable structure and all values it contains have to be immutable as well. For example a list is not a valid keybut a tupleof integers is.
  1. 不完全是。Spark,包括 PySpark,默认使用哈希分区。排除相同的键,分配给单个分区的键之间没有实际相似性。
  2. 这里没有简单的答案。一切都取决于数据量和可用资源。分区数过多或过少都会降低性能。

    一些资源声称分区的数量应该是可用内核数量的两倍左右。另一方面,单个分区通常不应包含超过 128MB,并且单​​个 shuffle 块不能大于 2GB(参见SPARK-6235)。

    最后,您必须纠正潜在的数据偏差。如果某些键在您的数据集中过多,可能会导致资源使用不理想和潜在故障。

  3. 不,或者至少不是直接的。您可以使用keyBy方法将 RDD 转换为所需的格式。此外,任何 Python 对象都可以被视为键值对,只要它实现了所需的方法,使其表现得像Iterable长度等于 2 的 。请参阅如何确定对象是否是 PySpark 中的有效键值对

  4. 这取决于类型。只要密钥是可散列的* 那么是的。通常这意味着它必须是不可变的结构,并且它包含的所有值也必须是不可变的。例如,列表不是有效键,tuple整数是。


To quote Python glossary:

引用Python 词汇表

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__()method), and can be compared to other objects (it needs an __eq__()method). Hashable objects which compare equal must have the same hash value.

如果一个对象的哈希值在其生命周期内永远不会改变(它需要一个__hash__()方法),并且可以与其他对象进行比较(它需要一个方法),那么它就是可哈希的__eq__()。比较相等的可散列对象必须具有相同的散列值。

回答by Souvik Saha Bhowmik

I recently used partitionby. What I did was restructure my data so that all those which I want to put in same partition have the same key, which in turn is a value from the data. my data was a list of dictionary, which I converted into tupples with key from dictionary.Initially the partitionby was not keeping same keys in same partition. But then I realized the keys were strings. I cast them to int. But the problem persisted. The numbers were very large. I then mapped these numbers to small numeric values and it worked. So my take away was that the keys need to be small integers.

我最近使用了partitionby。我所做的是重组我的数据,以便我想放在同一个分区中的所有数据都具有相同的键,而键又是数据中的一个值。我的数据是一个字典列表,我用字典中的键将其转换为元组。最初,partitionby 没有在同一个分区中保留相同的键。但后来我意识到键是字符串。我将它们转换为 int。但问题仍然存在。数字非常大。然后我将这些数字映射到小的数值并且它起作用了。所以我的结论是键必须是小整数。