Python 按日期对火花数据框进行分组

Question

提问by César García Tapia

I've loaded a DataFrame from a SQLServer table. It looks like this:

我已经从 SQLServer 表加载了一个 DataFrame。它看起来像这样：

>>> df.show()
+--------------------+----------+
|           timestamp|    Value |
+--------------------+----------+
|2015-12-02 00:10:...|     652.8|
|2015-12-02 00:20:...|     518.4|
|2015-12-02 00:30:...|     524.6|
|2015-12-02 00:40:...|     382.9|
|2015-12-02 00:50:...|     461.6|
|2015-12-02 01:00:...|     476.6|
|2015-12-02 01:10:...|     472.6|
|2015-12-02 01:20:...|     353.0|
|2015-12-02 01:30:...|     407.9|
|2015-12-02 01:40:...|     475.9|
|2015-12-02 01:50:...|     513.2|
|2015-12-02 02:00:...|     569.0|
|2015-12-02 02:10:...|     711.4|
|2015-12-02 02:20:...|     457.6|
|2015-12-02 02:30:...|     392.0|
|2015-12-02 02:40:...|     459.5|
|2015-12-02 02:50:...|     560.2|
|2015-12-02 03:00:...|     252.9|
|2015-12-02 03:10:...|     228.7|
|2015-12-02 03:20:...|     312.2|
+--------------------+----------+

Now I'd like to group (and sum) values by hour (or day, or month or...), but I don't really have a clue about how can I do that.

现在我想按小时（或天、月或...）对值进行分组（和求和），但我真的不知道如何做到这一点。

That's how I load the DataFrame. I've got the feeling that this isn't the right way to do it, though:

这就是我加载 DataFrame 的方式。不过，我感觉这不是正确的做法：

query = """
SELECT column1 AS timestamp, column2 AS value
FROM table
WHERE  blahblah
"""

sc = SparkContext("local", 'test')
sqlctx = SQLContext(sc)

df = sqlctx.load(source="jdbc",
                 url="jdbc:sqlserver://<CONNECTION_DATA>",
                 dbtable="(%s) AS alias" % query)

Is it ok?

可以吗？

Answer 1

采纳答案by zero323

Since 1.5.0 Spark provides a number of functions like dayofmonth, hour, monthor yearwhich can operate on dates and timestamps. So if timestampis a TimestampTypeall you need is a correct expression. For example:

自 1.5.0 Spark 提供了许多函数，如dayofmonth、hour、monthoryear可以对日期和时间戳进行操作。所以 if timestampis aTimestampType你所需要的只是一个正确的表达。例如：

from pyspark.sql.functions import hour, mean

(df
    .groupBy(hour("timestamp").alias("hour"))
    .agg(mean("value").alias("mean"))
    .show())

## +----+------------------+
## |hour|              mean|
## +----+------------------+
## |   0|508.05999999999995|
## |   1| 449.8666666666666|
## |   2| 524.9499999999999|
## |   3|264.59999999999997|
## +----+------------------+

Pre-1.5.0 your best option is to use HiveContextand Hive UDFs either with selectExpr:

在 1.5.0 之前，您最好的选择是使用HiveContext和 Hive UDF 与selectExpr：

df.selectExpr("year(timestamp) AS year", "value").groupBy("year").sum()

## +----+---------+----------+   
## |year|SUM(year)|SUM(value)|
## +----+---------+----------+
## |2015|    40300|    9183.0|
## +----+---------+----------+

or raw SQL:

或原始 SQL：

df.registerTempTable("df")

sqlContext.sql("""
    SELECT MONTH(timestamp) AS month, SUM(value) AS values_sum
    FROM df
    GROUP BY MONTH(timestamp)""")

Just remember that aggregation is performed by Spark not pushed-down to the external source. Usually it is a desired behavior but there are situations when you may prefer to perform aggregation as a subquery to limit data transfer.

请记住，聚合是由 Spark 执行的，而不是下推到外部源。通常这是一种理想的行为，但在某些情况下，您可能更喜欢将聚合作为子查询执行以限制数据传输。

Answer 2

回答by Morit

Also, you can use date_format to create any time period you wish. Groupby specific day:

此外，您可以使用 date_format 创建您希望的任何时间段。Groupby 特定日期：

from pyspark.sql import functions as F

df.select(F.date_format('timestamp','yyyy-MM-dd').alias('day')).groupby('day').count().show()

Groupby specific month (just change the format):

Groupby 特定月份（只需更改格式）：

df.select(F.date_format('timestamp','yyyy-MM').alias('month')).groupby('month').count().show()

Python 按日期对火花数据框进行分组

提问by César García Tapia

采纳答案by zero323

回答by Morit

相关推荐

最近更新

标签

Python 按日期对火花数据框进行分组

提问by César García Tapia

采纳答案by zero323

回答by Morit

相关推荐

无法在 OSX 中的 python 中导入 cv2

Python 从包中导入所有函数：“from .* import *”

Python 当应用中也计算了先前的值时，Pandas 是否有办法在 dataframe.apply 中使用先前的行值？

Python 用空格替换标点符号

相关推荐

最近更新

标签