bash 检查 S3 文件是否已被修改

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38407784/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:54:45  来源:igfitidea点击:

Check if S3 file has been modified

bashshellamazon-s3

提问by loop

How can I use a shell script check if an Amazon S3 file ( small .xml file) has been modified. I'm currently using curlto check every 10 seconds, but it's making many GET requests.

如何使用 shell 脚本检查 Amazon S3 文件(小型 .xml 文件)是否已被修改。我目前使用curl每 10 秒检查一次,但它发出了许多 GET 请求。

curl "s3.aws.amazon.com/bucket/file.xml"
if cmp "file.xml" "current.xml"
then
     echo "no change"
else
     echo "file changed"
     cp "file.xml" "current.xml"
fi 
sleep(10s)

Is there a better way to check every 10 seconds that reduces the number of GET requests? (This is built on top of a rails app so i could possibly build a handler in rails?)

有没有更好的方法来每 10 秒检查一次以减少 GET 请求的数量?(这是建立在 rails 应用程序之上的,所以我可以在 rails 中构建一个处理程序?)

回答by Bruno Reis

Let me start by first telling you some facts about S3. You might know this, but in case you don't, you might see that your current code could have some "unexpected" behavior.

让我首先告诉你一些关于 S3 的事实。您可能知道这一点,但如果您不知道,您可能会发现您当前的代码可能有一些“意外”行为。

S3 and "Eventual Consistency"

S3 和“最终一致性”

S3 provides "eventual consistency" for overwritten objects. From the S3 FAQ, you have:

S3 为被覆盖的对象提供“最终一致性”。从S3 常见问题解答中,您有:

Q: What data consistency model does Amazon S3 employ?

Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

问:Amazon S3 采用什么数据一致性模型?

所有区域中的 Amazon S3 存储桶为新对象的 PUTS 提供先写后读一致性,并为覆盖 PUTS 和 DELETES 提供最终一致性

Eventual consistency for overwrites means that, whenever an object is updated (ie, whenever your small XML file is overwritten), clients retrieving the file MAY see the new version, or they MAY see the old version. For how long? For an unspecifiedamount of time. It typically achieves consistency in much less than 10 seconds, but you have to assume that it will, eventually, take more than 10 seconds to achieve consistency. More interestingly (sadly?), even aftera successful retrieval of the new version, clients MAY still receive the older version later.

覆盖的最终一致性意味着,无论何时更新对象(即,每当您的小 XML 文件被覆盖时),检索文件的客户端可能会看到新版本,也可能会看到旧版本。多长时间?在一段不确定的时间内。它通常在不到 10 秒的时间内实现一致性,但您必须假设最终需要 10 秒以上才能实现一致性。更有趣的是(可悲的是?),即使成功检索新版本之后,客户端仍可能会在稍后收到旧版本。

One thing that you can be assured of is: if a client startsdownload a version of the file, it will download that entire version (in other words, there's no chance that you would receive for example, the first half of the XML file as the old version and the second half as the new version).

您可以确定的一件事是:如果客户端开始下载文件的一个版本,它将下载整个版本(换句话说,您不可能收到例如 XML 文件的前半部分作为旧版,下半部为新版)。

With that in mind, notice that your script could fail to identify the change within your 10-second timeframe: you could make multiple requests, even after a change, until your script downloads a changed version. And even then, after you detect the change, it is (unfortunately) entirely possible the the next request would download the previous(!) version, and trigger yet another "change" in your code, then the next would give the current version, and trigger yet another "change" in your code!

考虑到这一点,请注意您的脚本可能无法在 10 秒的时间范围内识别更改:即使在更改之后,您也可以发出多个请求,直到您的脚本下载更改后的版本。即便如此,在您检测到更改之后,(不幸的是)下一个请求完全有可能下载前一个(!)版本,并触发代码中的另一个“更改”,然后下一个将提供当前版本,并在您的代码中触发另一个“更改”!



If you are OK with the fact that S3 provides eventual consistency, there's a way you could possibly improve your system.

如果您对 S3 提供最终一致性这一事实感到满意,那么有一种方法可以改进您的系统。

Idea 1: S3 event notifications + SNS

思路一:S3事件通知+SNS

You mentioned that you thought about using SNS. That could definitely be an interesting approach: you could enable S3 event notificationsand then get a notification through SNS whenever the file is updated.

您提到您考虑过使用 SNS。这绝对是一个有趣的方法:您可以启用 S3 事件通知,然后在文件更新时通过 SNS 获取通知。

How do you get the notification? You would need to create a subscription, and here you have a few options.

你如何得到通知?您需要创建一个订阅,这里有几个选项。

Idea 1.1: S3 event notifications + SNS + a "web app"

想法1.1:S3事件通知+SNS+一个“网络应用”

If you have a "web application", ie, anything running in a publicly accessible HTTP endpoint, you could create an HTTP subscriber, so SNS will call your server with the notification whenever it happens. This might or might not be possible or desirable in your scenario

如果您有一个“Web 应用程序”,即在可公开访问的 HTTP 端点中运行的任何内容,您可以创建一个 HTTP 订阅者,这样无论何时发生,SNS 都会通过通知调用您的服务器。这在您的场景中可能是也可能不是可能的或可取的

Idea 2: S3 event notifications + SQS

思路二:S3事件通知+SQS

You could create a message queue in SQS and have S3 deliver the notifications directly to the queue. This would also be possible as S3 event notifications + SNS + SQS, since you can add a queue as a subscriber to an SNS topic (the advantage being that, in case you need to add functionality later, you could add more queues and subscribe them to the same topic, therefore getting "multiple copies" of the notification).

您可以在 SQS 中创建一个消息队列,并让 S3 将通知直接传送到队列。这也可以作为S3 事件通知 + SNS + SQS,因为您可以添加一个队列作为 SNS 主题的订阅者(优点是,如果您以后需要添加功能,您可以添加更多队列并订阅它们到同一主题,因此获得通知的“多个副本”)。

To retrieve the notification you'd make a call to SQS. You'd still have to poll - ie, have a loop and call GET on SQS (which cost about the same, or maybe a tiny bit more depending on the region, than S3 GETs). The slight difference is that you could reduce a bit the number of total requests -- SQS supports long-polling requests of up to 20 seconds: you make the GET call on SQS and, if there are no messages, SQS holds the request for up to 20 seconds, returning immediately if a message arrives, or returning an empty response if no messages are available within those 20 seconds. So, you would send only 1 GET every 20 seconds, to get faster notifications than you currently have. You could potentially halve the number of GETs you make (once every 10s to S3 vs once every 20s to SQS).

要检索通知,您需要调用 SQS。您仍然需要轮询 - 即,有一个循环并在 SQS 上调用 GET(与 S3 GET 相比,成本大致相同,或者根据地区的不同可能略高一些)。稍有不同的是,您可以稍微减少总请求的数量——SQS 支持长达 20 秒的长轮询请求:您对 SQS 进行 GET 调用,如果没有消息,SQS 会保留请求到 20 秒,如果消息到达立即返回,或者如果在这 20 秒内没有可用消息,则返回空响应。因此,您将每 20 秒仅发送 1 次 GET,以获得比当前更快的通知。您可能会将 GET 的数量减半(每 10 秒一次到 S3,而每 20 秒一次到 SQS)。

Also - you could chose to use one single SQS queue to aggregate all changes to all XML files, or multiple SQS queues, one per XML file. With a single queue, you would greatlyreduce the overall number of GET requests. With one queue per XML file, that's when you could potentially "halve" the number of GET request as compared to what you have now.

此外 - 您可以选择使用一个 SQS 队列来聚合所有 XML 文件的所有更改,或多个 SQS 队列,每个 XML 文件一个。使用单个队列,您将大大减少 GET 请求的总数。如果每个 XML 文件有一个队列,那么与现在相比,您可以将 GET 请求的数量“减半”。

Idea 3: S3 event notifications + AWS Lambda

想法 3:S3 事件通知 + AWS Lambda

You can also use a Lambda function for this. This could require some more changes in your environment - you wouldn't use a Shell Script to poll, but S3 can be configured to call a Lambda Function for you as a response to an event, such as an update on your XML file. You could write your code in Java, Javascript or Python (some people devised some "hacks" to use other languages as well, including Bash).

您也可以为此使用 Lambda 函数。这可能需要对您的环境进行更多更改 - 您不会使用 Shell 脚本进行轮询,但可以将 S3 配置为调用 Lambda 函数作为对事件的响应,例如 XML 文件的更新。您可以使用 Java、Javascript 或 Python 编写代码(有些人设计了一些“技巧”来使用其他语言,包括 Bash)。

The beauty of this is that there's no more polling, and you don't have to maintain a web server (as in "idea 1.1"). Your code "simply runs", whenever there's a change.

这样做的好处是不再需要轮询,而且您不必维护 Web 服务器(如“idea 1.1”中所示)。只要有变化,您的代码就会“简单地运行”。

Notice that, no matter which one of these ideas you use, you still have to deal with eventual consistency. In other words, you'd know that a PUT/POST has happened, but once your code sends a GET, you could still receive the older version...

请注意,无论您使用这些想法中的哪一个,您仍然必须处理最终一致性。换句话说,您会知道PUT/POST 已经发生,但是一旦您的代码发送了 GET,您仍然可以收到旧版本...

Idea 4: Use DynamoDB instead

想法 4:改用 DynamoDB

If you have the ability to make a more structural change on the system, you could consider using DynamoDB for this task.

如果您有能力对系统进行更多结构性更改,则可以考虑使用 DynamoDB 来完成此任务。

The reason I suggest this is because DynamoDB supports strong consistency, even for updates. Notice that it's not the default - by default, DynamoDB operates in eventual consistency mode, but the "retrieval" operations (GetItem, for example), support fully consistent reads.

我建议这样做的原因是因为 DynamoDB 支持强一致性,即使对于更新也是如此。请注意,这不是默认设置 - 默认情况下,DynamoDB 在最终一致性模式下运行,但“检索”操作(例如 GetItem)支持完全一致的读取。

Also, DynamoDB has what we call "DynamoDB Streams", which is a mechanism that allows you to get a stream of changes made to any (or all) items on your table. These notifications can be polled, or they can even be used in conjunction with a Lambda function, that would be called automatically whenever a change happens! This, plus the fact that DynamoDB can be used with strong consistency, could possibly help you solve your problem.

此外,DynamoDB 具有我们所说的“DynamoDB Streams”,这是一种机制,允许您获取对表中任何(或所有)项目所做的更改流。可以轮询这些通知,或者甚至可以将它们与 Lambda 函数结合使用,该函数会在发生更改时自动调用!这一点,加上 DynamoDB 可以以强一致性使用的事实,可能会帮助您解决问题。

In DynamoDB, it's usually a good practice to keep the records small. You mentioned in your comments that your XML files are about 2kB - I'd say that could be considered "small enough" so that it would be a good fit for DynamoDB! (the reasoning: DynamoDB reads are typically calculated as multiples of 4kB; so to fully read 1 of your XML files, you'd consume just 1 read; also, depending on how you do it, for example using a Query operation instead of a GetItem operation, you could possibly be able to read 2 XML files from DynamoDB consuming just 1 read operation).

在 DynamoDB 中,保持较小的记录通常是一个好习惯。您在评论中提到您的 XML 文件大约为 2kB - 我想说这可以被认为“足够小”,因此它非常适合 DynamoDB!(推理:DynamoDB 读取通常计算为 4kB 的倍数;因此要完全读取 1 个 XML 文件,您只需要读取 1 次;此外,取决于您的操作方式,例如使用查询操作而不是GetItem 操作,您可能能够从 DynamoDB 读取 2 个 XML 文件,仅消耗 1 个读取操作)。

Some references:

一些参考:

回答by Rash

I can think of another way by using S3 Versioning; this would require the leastamount of changes to your code.

我可以通过使用S3 Versioning想到另一种方法;这将需要对您的代码进行最少的更改。

Versioning is a means of keeping multiple variants of an object in the same bucket.

版本控制是一种将对象的多个变体保存在同一个存储桶中的方法。

This would mean that every time a new file.xmlis uploaded, S3 will create a new version.

这意味着每次file.xml上传新版本时,S3 都会创建一个新版本。

In your script, instead of getting the object and comparing it, get the HEAD of the objectwhich contains the VersionIdfield. Match this version with the previous version to find out if the file has changed.

在您的脚本中,不是获取对象并进行比较,而是获取包含该字段的对象HEADVersionId。将此版本与以前的版本进行匹配以查明文件是否已更改。

If the file has indeed changed, get the new file, and also get the new version of that file and save it locally so that next time you can use this version to check if a newer-newer version has been uploaded.

如果文件确实发生了变化,请获取新文件,并获取该文件的新版本并将其保存在本地,以便下次您可以使用此版本检查是否已上传较新的版本。

Note 1:You will still be making lots of calls to S3, but instead of fetching the entire file every time, you are only fetching the metadata of the file which is much faster and smaller in size.

注意 1:您仍然会对 S3 进行大量调用,但不是每次都获取整个文件,而是只获取文件的元数据,该文件的元数据更快、尺寸更小。

Note 2:However, if your aim was to reduce the number of calls, the easiest solution I can think of is using lambdas. You can trigger a lambda function every time a file is uploaded that then calls the REST endpoint of your service to notify you of the file change.

注意 2:但是,如果您的目标是减少调用次数,我能想到的最简单的解决方案是使用 lambda。您可以在每次上传文件时触发 lambda 函数,然后调用服务的 REST 端点以通知您文件更改。