`published` / `unpublished` webhooks sent before content is available - kentico-kontent

So I'm trying to build a content pipeline with Kentico Cloud. One of the requirements is that pressing the big green Publish is not the final step of the process. The published content then has to be collected, its representation transformed and forwarded elsewhere. Subscribing to publish / unpublish webhook events then processing the related content looked like the way to go, but apparently these are sometimes firing before the content is available via the Delivery API.
What are my options? I really don't want to do polling - the nested structure of the content combined with the inability to filter by parent items makes it far from trivial.

As it turns out, the answer is right there in the API docs:
If the requested content has changed since the last request, the header determines whether to wait while fetching content. This can be useful when retrieving changed content in reaction to a webhook call. By default, when the header is not set, the API serves old content (if cached by the CDN) while it's fetching the new content to minimize wait time. To always fetch new content, set the header value to true.


Google Cloud Storage metadata updates

I have a bit of a two-part question regarding the nature of metadata update notifications in GCS. // For the mods: if I should split this into two, let me know and I will.
I have a bucket in Google Cloud Storage, with Pub/Sub notifications configured for object metadata changes. I routinely get doubled metadata updates, seemingly out of nowhere. What happens is that at one point, a Cloud Run container reads the object designated by the notification and does some things that result in
a) a new file being added.
b) an email being sent.
And this should be the end of it.
However, app. 10 minutes later, a second notification fires for the same object, with the metageneration incremented but no actual changes being evident in the notification object.
Strangely, the ETag seems to change minimally (CJ+2tfvk+egCEG0 -> CJ+2tfvk+egCEG4), but the CRC32C and MD5 checksums remain the same - this is correct in the sense that the object is not being written.
The question is twofold, then:
- What exactly constitutes an increment in the metageneration attribute, when no metadata is being set/updated?
- How can the ETag change if the underlying data does not, as shown by the checksums (I guess the documentation does say "that they will change whenever the underlying data changes"[1], which does not strictly mean they cannot change otherwise).
1: https://cloud.google.com/storage/docs/hashes-etags#_ETags
As commented by #Brandon Yarbrough If the metageneration number increases, the most likely cause is an explicit call from somewhere unexpected to update the metadata in some fashion, and a way to verify that no extra update calls are being executed is by enabling Stackdriver or bucket access logs.
Regarding the ETag changes, the ETag documentation on Cloud Storage states that
Users should make no assumptions about those ETags except that they will change whenever the underlying data changes.
This indicates that the only scenario that is guaranteed that the ETag will be changed is on the data change, however, other events may trigger an ETag change as well, so you should not use ETags as a reference for file changes.

How to invalidate AWS APIGateway cache

We have a service which inserts into dynamodb certain values. For sake of this question let's say its key:value pair i.e., customer_id:customer_email. The inserts don't happen that frequently and once the inserts are done, that specific key doesn't get updated.
What we have done is create a client library which, provided with customer_id will fetch customer_email from dynamodb.
Given that customer_id data is static, what we were thinking is to add cache to the table but one thing which we are not sure that what will happen in the following use-case
client_1 uses our library to fetch customer_email for customer_id = 2.
The customer doesn't exist so API Gateway returns not found
APIGateway will cache this response
For any subsequent calls, this cached response will be sent
Now another system inserts customer_id = 2 with its email id. This system doesn't know if this response has been cached previously or not. It doesn't even know that any other system has fetched this specific data. How can we invalidate cache for this specific customer_id when it gets inserted into dynamodb
You can send a request to the API endpoint with a Cache-Control: max-age=0 header which will cause it to refresh.
This could open your application up to attack as a bad actor can simply flood an expensive endpoint with lots of traffic and buckle your servers/database. In order to safeguard against that it's best to use a signed request.
In case it's useful to people, here's .NET code to create the signed request:
We've built a Lambda which takes care of re-filling cache with updated results. It's a quite manual process, with very little re-usable code, but it works.
Lambda is triggered by the application itself following application needs. For example, in CRUD operations the Lambda is triggered upon successful execution of POST, PATCH and DELETE on a specific resource, in order to clear the general GET request (i.e. clear GET /books whenever POST /book succeeded).
Unfortunately, if you have a View with a server-side paginated table you are going to face all sorts of issues because invalidating /books is not enough since you actually may have /books?page=2, /books?page=3 and so on....a nightmare!
I believe APIG should allow for more granular control of cache entries, otherwise many use cases aren't covered. It would be enough if they would allow to choose a root cache group for each request, so that we could manage cache entries by group rather than by single request (which, imho, is also less common).
Did you look at this https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html ?
There is way to invalidate entire cache or a particular cache entry

What kind of data should/can each SQS message contain?

Suppose I have a task of updating an user via a third party API call. Is it okay to put the actual user data inside the message (if it fits)? Or should I only provide an ID in the message so the worker can retrieve the updated record from my local database?
You need to check what level of compliance is required for your infrastructure, to see what kind of data you want to put in the queue.
If there aren't any compliance restrictions, you are free to put any kind of data in your own infrastructure on AWS.

In Sitecore, how to view the content of the submit queue file?

So in the Sitecore site, in Data/Submit Queue, there is a file without an extension that is representing the content of the Submit Queue.
If you try viewing it as a text file, it shows some content, but there is some strange characters in the mix.
So, has someone made an application to view this file? Is it suppose to be in a specific format that should opened with an application able to view that format?
Extra info: Sitecore 8.0, no there is nothing about it in the control panel or in sitecore/admin.
Mark is right, the submit queue isn't meant for users to view. A couple of months ago, I wrote a post on this exact subject.
From Akinori Taira, a member of the xDB product team:
In the event that the collections database is unavailable, there is a
special ‘Submit Queue’ mechanism that flushes captured data to the
local hard drive (the ‘Data\Submit Queue’ folder by default). When
the collections database comes back online, a background worker
process submits the data from the ‘Submit Queue’ on disk.
No, you're not meant to be opening the Submit Queue and do anything with it.
It is used by xDB (in your case) to submit data, when the xDB cannot be reached. It will be a format related to MongoDB in some way, but I've never seen any formal documentation for it.
Sitecore 8.1: Purpose of Submit Queue and MediaIndexing folders under $(dataFolder)
This file contains the analytics data that was not flushed to the Mongo database.
In case xDB collection server is unavailable, Sitecore would/must handle this situation correctly. There is a special 'Submit Queue' mechanism introduced that flushes captured data to local server hard drive ( 'Data\Submit Queue' folder by default ) in case xDB is not available.
When xDB is up again, a background worker would submit the data saved on disk, so no data is lost.
As a quick suggestion on this I recommend you to check whether your MongoDB server is available for your Sitecore instance. Once it becomes available, all data from the file should be flushed to the xDB.
The submit queue file stores serialized values as follows: first value - number of entities, second value - position of the next entity, which must be submitted to xDB, the next values contain serialized analytics data.
The submit queue is processed using this class: Sitecore.Analytics.Data.DataAccess.SubmitQueue.FileSubmitQueue
If you want to debug to see how is processed decompile the class and create your own class and replace in Sitecore.Analytics.Tracking.confing
<queue type="Sitecore.Analytics.Data.DataAccess.SubmitQueue.FileSubmitQueue, Sitecore.Analytics" singleInstance="true" />

Amazon S3 conditional put object

I have a system in which I get a lot of messages. Each message has a unique ID, but it can also receives updates during its lifetime. As the time between the message sending and handling can be very long (weeks), they are stored in S3. For each message only the last version is needed. My problem is that occasionally two messages of the same id arrive together, but they have two versions (older and newer).
Is there a way for S3 to have a conditional PutObject request where I can declare "put this object unless I have a newer version in S3"?
I need an atomic operation here
That's not the use-case for S3, which is eventually-consistent. Some ideas:
You could try to partition your messages - all messages that start with A-L go to one box, M-Z go to another box. Then each box locally checks that there are no duplicates.
Your best bet is probably some kind of database. Depending on your use case, you could use a regular SQL database, or maybe a simple RAM-only database like Redis. Write to multiple Redis DBs at once to avoid SPOF.
There is SWF which can make a unique processing queue for each item, but that would probably mean more HTTP requests than just checking in S3.
David's idea about turning on versioning is interesting. You could have a daemon that periodically trims off the old versions. When reading, you would have to do "read repair" where you search the versions looking for the newest object.
Couldn't this be solved by using tags, and using a Condition on that when using PutObject? See "Example 3: Allow a user to add object tags that include a specific tag key and value" here: https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.html#tagging-and-policies