Python 按日期对火花数据框进行分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34946051/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Group spark dataframe by date
提问by César García Tapia
I've loaded a DataFrame from a SQLServer table. It looks like this:
我已经从 SQLServer 表加载了一个 DataFrame。它看起来像这样:
>>> df.show()
+--------------------+----------+
| timestamp| Value |
+--------------------+----------+
|2015-12-02 00:10:...| 652.8|
|2015-12-02 00:20:...| 518.4|
|2015-12-02 00:30:...| 524.6|
|2015-12-02 00:40:...| 382.9|
|2015-12-02 00:50:...| 461.6|
|2015-12-02 01:00:...| 476.6|
|2015-12-02 01:10:...| 472.6|
|2015-12-02 01:20:...| 353.0|
|2015-12-02 01:30:...| 407.9|
|2015-12-02 01:40:...| 475.9|
|2015-12-02 01:50:...| 513.2|
|2015-12-02 02:00:...| 569.0|
|2015-12-02 02:10:...| 711.4|
|2015-12-02 02:20:...| 457.6|
|2015-12-02 02:30:...| 392.0|
|2015-12-02 02:40:...| 459.5|
|2015-12-02 02:50:...| 560.2|
|2015-12-02 03:00:...| 252.9|
|2015-12-02 03:10:...| 228.7|
|2015-12-02 03:20:...| 312.2|
+--------------------+----------+
Now I'd like to group (and sum) values by hour (or day, or month or...), but I don't really have a clue about how can I do that.
现在我想按小时(或天、月或...)对值进行分组(和求和),但我真的不知道如何做到这一点。
That's how I load the DataFrame. I've got the feeling that this isn't the right way to do it, though:
这就是我加载 DataFrame 的方式。不过,我感觉这不是正确的做法:
query = """
SELECT column1 AS timestamp, column2 AS value
FROM table
WHERE blahblah
"""
sc = SparkContext("local", 'test')
sqlctx = SQLContext(sc)
df = sqlctx.load(source="jdbc",
url="jdbc:sqlserver://<CONNECTION_DATA>",
dbtable="(%s) AS alias" % query)
Is it ok?
可以吗?
采纳答案by zero323
Since 1.5.0 Spark provides a number of functions like dayofmonth
, hour
, month
or year
which can operate on dates and timestamps. So if timestamp
is a TimestampType
all you need is a correct expression. For example:
自 1.5.0 Spark 提供了许多函数,如dayofmonth
、hour
、month
oryear
可以对日期和时间戳进行操作。所以 if timestamp
is aTimestampType
你所需要的只是一个正确的表达。例如:
from pyspark.sql.functions import hour, mean
(df
.groupBy(hour("timestamp").alias("hour"))
.agg(mean("value").alias("mean"))
.show())
## +----+------------------+
## |hour| mean|
## +----+------------------+
## | 0|508.05999999999995|
## | 1| 449.8666666666666|
## | 2| 524.9499999999999|
## | 3|264.59999999999997|
## +----+------------------+
Pre-1.5.0 your best option is to use HiveContext
and Hive UDFs either with selectExpr
:
在 1.5.0 之前,您最好的选择是使用HiveContext
和 Hive UDF 与selectExpr
:
df.selectExpr("year(timestamp) AS year", "value").groupBy("year").sum()
## +----+---------+----------+
## |year|SUM(year)|SUM(value)|
## +----+---------+----------+
## |2015| 40300| 9183.0|
## +----+---------+----------+
or raw SQL:
或原始 SQL:
df.registerTempTable("df")
sqlContext.sql("""
SELECT MONTH(timestamp) AS month, SUM(value) AS values_sum
FROM df
GROUP BY MONTH(timestamp)""")
Just remember that aggregation is performed by Spark not pushed-down to the external source. Usually it is a desired behavior but there are situations when you may prefer to perform aggregation as a subquery to limit data transfer.
请记住,聚合是由 Spark 执行的,而不是下推到外部源。通常这是一种理想的行为,但在某些情况下,您可能更喜欢将聚合作为子查询执行以限制数据传输。
回答by Morit
Also, you can use date_format to create any time period you wish. Groupby specific day:
此外,您可以使用 date_format 创建您希望的任何时间段。Groupby 特定日期:
from pyspark.sql import functions as F
df.select(F.date_format('timestamp','yyyy-MM-dd').alias('day')).groupby('day').count().show()
Groupby specific month (just change the format):
Groupby 特定月份(只需更改格式):
df.select(F.date_format('timestamp','yyyy-MM').alias('month')).groupby('month').count().show()