pandas 用于数据处理和作业调度的 Apache Airflow 或 Apache Beam

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50249759/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:32:48  来源:igfitidea点击:

Apache Airflow or Apache Beam for data processing and job scheduling

pandasairflowapache-beam

提问by LouisB

I'm trying to give useful information but I am far from being a data engineer.

我试图提供有用的信息,但我远不是一名数据工程师。

I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.

我目前正在使用 python 库 pandas 对我的数据执行一系列转换,这些数据有很多输入(目前是 CSV 和 excel 文件)。输出是几个excel文件。我希望能够使用并行计算执行计划的受监控批处理作业(我的意思是不像我对 Pandas 所做的那样顺序),每月一次。

I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?

我不太了解 Beam 或 Airflow,我很快通读了文档,似乎两者都可以实现。我应该使用哪一种?

采纳答案by kaxil

Apache Airflowis not a data processing engine.

Apache Airflow不是数据处理引擎。

Airflow is a platform to programmatically author, schedule, and monitor workflows.

Airflow 是一个以编程方式创作、安排和监控工作流的平台。

Cloud Dataflowis a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.

Cloud Dataflow是 Google Cloud 上的一项完全托管的服务,可用于数据处理。您可以编写 Dataflow 代码,然后使用 Airflow 来安排和监控 Dataflow 作业。如果失败,Airflow 还允许您重试作业(重试次数可配置)。如果您想在 Slack 或电子邮件上发送警报,如果您的 Dataflow 管道失败,您还可以在 Airflow 中进行配置。

回答by cryanbhu

The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.

其他答案非常技术性且难以理解。我之前也处于你的位置,所以我会简单地解释一下。

Airflow can do anything. It has BashOperatorand PythonOperatorwhich means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.

气流可以做任何事情。它具有BashOperator并且PythonOperator这意味着它可以运行任何 bash 脚本或任何 Python 脚本。
它是一种在易于查看和使用的 UI 中组织(设置复杂的数据管道 DAG)、调度、监控、触发数据管道重新运行的方法。
此外,它很容易设置,一切都在熟悉的 Python 代码中。
以有组织的方式(即使用 Airflow)构建管道意味着您不会浪费时间cron到处调试一堆数据处理 ( ) 脚本。

Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.

Apache Beam 是许多数据处理框架(Spark、Flink 等)的包装器。
目的是让您只学习 Beam 并可以在多个后端(Beam runners)上运行。
如果您熟悉 Keras 和 TensorFlow/Theano/Torch,那么 Keras 与其后端之间的关系类似于 Beam 与其数据处理后端之间的关系。

Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.

Google Cloud Platform 的 Cloud Dataflow 是运行 Beam 的后端之一。
他们称之为Dataflow runner

GCP's offering, Cloud Composer, is a managed Airflow implementationas a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).

GCP 的产品Cloud Composer是一种托管的 Airflow 实现即服务,在 Google Kubernetes Engine (GKE) 的 Kubernetes 集群中运行。

So you can either:
- manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
- manual Airflow implementation calling Beam jobs
- Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
- Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
- Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolatedfashion by creating a new node-pool and specifying that the KPOpods are to be run in the new node-pool

所以你可以:
- 手动 Airflow 实现,在实例本身上做数据处理(如果你的数据很小(或者你的实例足够强大),你可以在运行 Airflow 的机器上处理数据。这就是为什么很多人对 Airflow 感到困惑的原因是否可以处理数据)
- 手动 Airflow 实现调用 Beam 作业
- Cloud Composer(托管 Airflow 即服务)调用 Cloud Dataflow 中的作业
- Cloud Composer 在 Composer 的 Kubernetes 集群环境中运行数据处理容器,使用 Airflow 的KubernetesPodOperator (KPO)
- Cloud Composer 运行数据处理Composer 的 Kubernetes 集群环境中的容器与 Airflow 的KPO,但这次以更好的隔离方式创建一个新的节点池并指定KPOPod 将在新的节点池中运行

My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserverand schedulerprocesses.

我的个人经验
Airflow 是轻量级的并且不难学习(易于实现),您应该尽可能将其用于您的数据管道。
此外,由于许多公司都在寻找使用 Airflow 的经验,如果您想成为一名数据工程师,您可能应该学习它
此外,托管 Airflow(到目前为止我只使用过 GCP 的 Composer)比自己运行 Airflow 方便得多,以及管理气流webserverscheduler流程。

回答by JanKanis

Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.

Apache Airflow 和 Apache Beam 从表面上看非常相似。它们都允许您组织一组处理数据的步骤,并确保这些步骤以正确的顺序运行并满足它们的依赖关系。两者都允许您在 GUI 中将步骤和依赖关系可视化为有向无环图 (DAG)。

But when you dig a bit deeper there are big differences in what they do and the programming models they support.

但是,当您深入挖掘时,它们的工作方式和所支持的编程模型之间存在很大差异。

Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.

Airflow 是一个任务管理系统。DAG 的节点是任务,Airflow 确保以正确的顺序运行它们,确保一个任务只有在其依赖任务完成后才开始。相关任务不会同时运行,而是一个接一个地运行。独立任务可以并发运行。

Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.

Beam 是一个数据流引擎。DAG 的节点形成一个(可能是分支的)管道。DAG 中的所有节点同时处于活动状态,它们将数据元素从一个传递到另一个,每个都对其进行一些处理。

The two have some overlapping use cases but there are a lot of things only one of the two can do well.

两者有一些重叠的用例,但有很多事情只有两者之一才能做好。

Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.

Airflow 管理相互依赖的任务。虽然这种依赖性可以由一个任务将数据传递给下一个任务组成,但这不是必需的。事实上,Airflow 甚至不关心任务做了什么,它只需要启动它们并查看它们是否完成或失败。如果任务需要相互传递数据,您需要自己协调,告诉每个任务在哪里读取和写入其数据,例如本地文件路径或某处的 Web 服务。任务可以由 Python 代码组成,但也可以是任何外部程序或 Web 服务调用。

In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.

在 Beam 中,您的步骤定义与引擎紧密集成。您可以使用受支持的编程语言定义步骤,并且它们在 Beam 流程中运行。如果可能的话,在外部进程中处理计算会很困难*,而且肯定不是 Beam 应该使用的方式。你的步骤只需要关心他们正在执行的计算,而不是存储或传输数据。在不同步骤之间传输数据完全由框架处理。

In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.

在 Airflow 中,如果您的任务处理数据,单个任务调用通常会对整个数据集进行一些转换。在 Beam 中,数据处理是核心接口的一部分,因此它不能真正做任何其他事情。Beam 步骤的调用通常处理单个或几个数据元素,而不是完整的数据集。由于这个原因,Beam 还支持无限长度数据集,这不是 Airflow 本身可以处理的。

Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.

另一个区别是 Airflow 本身是一个框架,而 Beam 实际上是一个抽象层。Beam 管道可以在 Apache Spark、Apache Flink、Google Cloud Dataflow 等上运行。所有这些都支持或多或少相似的编程模型。顺便说一下,谷歌还将 Airflow 云化为 Google Cloud Compose 服务。

*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.

*Apache Spark 对 Python 的支持实际上是通过在子进程中运行完整的 Python 解释器来实现的,但这是在框架级别实现的。