Hazelcast Reducer to update map - mapreduce

I have a reducer that returns an object that is PartitionAware and is supposed to be stored in same partition as other objects it was sourced from, and as I understand, job itself. Does it make sense to update map directly from the Reducer? If it implements HazelcastInstanceAware I can technically access map to store the result. The isea is to minimize cluster traffic.

Related

What are the minimum SAS permissions that the EventProcessorClient needs from Storage Accounts?

I couldn't find what SAS permissions I need to give, for a storage account I'm solely using to connect to Eventhub for consumption.
picture of permissions
So it's stored in blobs, definitely needs to read...does it update? or Write?
Documentation only shows with connection strings.
The EventProcessorClient needs to be able to:
List blobs in a container
Add a new blob to a container
Update an existing blob in the container (metadata only)
Read an existing blob in the container (metadata only)
We generally recommend using a container dedicated to the processor and allowing the processor control over that container.

DynamoDB Upsert - Update or Create?

We use DynamoDB UpdateItem.
This acts as an "upsert" as we can learn from the documentation
Edits an existing item's attributes, or adds a new item to the table if it does not already exist. [...]
When we make a request, to determine if an item was created or an existing item was updated, we request ALL_OLD. This works great and allows us to differentiate between update and create.
As an additional requirement we also want to return ALL_NEW, but still know the type of operation that was performed.
Question: Is this possible to do in a single request or do we have to make a second (get) request?
By default this is not supported in DynamoDB, there is no ALL or NEW_AND_OLD_IMAGES as there is in DynamoDB streams, but you can always go DIY.
When you do the UpdateItem call, you have the UpdateExpression, which is basically the list of changes to apply to the item. Given that you told DynamoDB to return the item as it looked like before that operation, you can construct the new state locally.
Just create a copy of the ALL_OLD response and locally apply the changes from the UpdateExpression to it. That's definitely faster than two API calls at the cost of a slightly more complex implementation.

AWS S3 Bucket policy to prevent Object updates

I have set of objects in an S3 Bucket, all with a common prefix. I want to prevent updating of the currently existing objects, however allow users to add new objects in the same prefix.
As I understand it, the S3:PutObject action is both used to update existing objects AND create new ones.
Is there a bucket policy that can limit updating, while allowing creating?
ex: forbid modifying already existing s3:/bucket/Input/obj1, but allow creating s3:/bucket/Input/obj2
edit, context: We're using S3 as a store for regression test data, used to test our transformations. As we're continuously adding new test data, we want to ensure that the already ingested input data doesn't change. This would resolve one of the current causes of failed tests. All our input data is stored with the same prefix, and likewise for the expected data.
No, this is not possible.
The same API call, and the same permissions, are used to upload an object regardless of whether an object already exists with the same name.
You could use Amazon S3 Versioning to retain both the old object and the new object, but that depends on how you will be using the objects.
It is not possible in a way you describe, but there is a mechanism of sorts, called S3 object lock, which allows you to lock a specific version of file. It will not prevent creation of new versions of file, but the version you lock is going to be immutable.

How to build an index of S3 objects when data exceeds object metadata limit?

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?

Get arbitrary object from Riak bucket

Is there a way to get a random object from a specific bucket by using Riak's HTTP API? Let's say that you have no knowledge about the contents of a bucket, the only thing you know is that all objects in a bucket share a common data structure. What would be a good way to get any object from a bucket, in order to show its data structure? Preferably using MapReduce over Search, since Search will flatten the response.
The best option is to use predictable keys so you don't have to find them. Since that is not always possible, secondary indexing is the next best.
If you are using eLevelDB, you can query the $BUCKET implicit index with max_results set to 1, which will return a single key. You would then issue a get request for that key.
If you are using Bitcask, you have 2 options:
list all of the keys in the bucket
Key listing in Bitcask will need to fold over every value in all buckets in order to return the list of keys in a single bucket. Effectively this means reading your entire dataset from disk, so this is very heavy on the system and could bring a production cluster to its knees.
MapReduce
MapReduce over a full bucket uses a similar query to key listing so it is also very heave on the system. Since the map phase function is executed separately for each object, if your map phase returns an object, every object in the bucket would be passed over the network to the node running the reduce phase. Thus it would be more efficient (read: less disastrous) to have the map phase function return just the key with no data, then have your reduce phase return the first item in the list, which leaves you needing to issue a get request for the object once you have the key name.
While it is technically possible to find a key in a given bucket when you have no information about the keys or the contents, if you designed your system to create a key named <<"schema">> or <<"sample">> that contains a sample object in each bucket, you could simply issue a get request for that key instead of searching, folding, or mapping.
If you are using Riak 2.X then search (http://docs.basho.com/riak/latest/dev/using/search/) is recommended over Map Reduce or 2i queries in most use cases and it is available via the HTTP API.