Bulk check if value exists before inserting - mapreduce

I have a database in Couch db with documents like:
{"foo":["bar1":"baz","bar2":18,"bar3":23.2]}
Is there a way for me to do batch checking of each value before inserting new documents?
What I want to achieve is if in any document in the database the key value pair foo.bar1 = baz and foo.bar2=1 exists where baz is equal to the new value I want to insert, the batch function should not insert the new document.
More concretely, the foo.bar is a datetime and if that datetime exists and another value in the same document is of a given value, no update shall be executed.
I could solve this by doing single requests for insert by first doing a request for the values and then let the client decide. But that will be very time consuming with all the data sent back and forth between client and couch db. Also I prefer to rely on the integrity of a database when doing this kinds of checks. Or is that the sql way of solving the problem?

You need to use Bulk Document API. After you send a list of desired keys to _all_docs, you will get a response with statuses. Not found keys will have status 'error' (not found)

Related

I have a boto3 lambda function which uses table.scan to scan the dynamoDB. I want to send limited data to frontend to show 10 entries, cant use query

I want someone to help me with how to implement pagination in boto3 without query, NextToken. I want the dynamoDB data to be sent to the frontend but 10 entries at a time and every time send the next 10 entries to the frontend.
First I thought I could use the query but the primary key would filter most of the data and I want all the results, just 10 entries at a time.
Also on the frontend, it has to be like page1(10 entries), page2(next 10 ) , like that .
There's no direct approach from what I'm getting your question.
Above mentioned approach by Mark B is good but not for your case you won't be able to move to any page other than next or go to previous page directly with this approach.
You can implement a not so cost efficient approach by adding a field to db of page number and using a lambda function to maintain pagination in db
To implement pagination with a DynamoDB scan operation you have to save the LastEvaluatedKey value that was returned from the first scan, and pass that as the ExclusiveStartKey on the next scan.
On the first scan operation (page == 0) you would simply pass Limit=10 in the query parameters.
On the second scan operation (page == 1) you would pass Limit=10, ExclusiveStartKey=<lastScanResult.LastEvaluatedKey>.
Note that you will have to pass the LastEvaluatedKey value to the frontend as part of your result data, the frontend will have to store that in memory, and send it as part of the request data when it requests the next page.

Parallel insertion in Dynamo DB

I have 3 clients. All of them want to insert items in the same database.
Whenever a client sends a request,
I need to read the last entered record in ddb.
Increase its id by 1.
Push this new request in the ddb with the increased id.
What's the best aws based architecture to implement this?
What if there were 100 clients?
What use is it to have an increasing Id as a partition key, assuming that's the use-case?
Unlike relational databases where this would be a good pattern, typically in a key-value store it's not as it leada.to difficulty reading the data back.
My suggestion would be to use a useful Id that is known to your application to allow you to read the items back efficiently. If those known values are not unique, then you can add a sort key which will become a primary key to define your uniqueness.

How does google Big Query handle table updates with missing fields?

I'm interested in using a streaming pipeline from google pub/sub to big query, I wanted to know how it would handle a case where an updated json object is sent with missing fields/branches that are presently already in a big query table / schema. For example will it set the value in the table to empty/null, retain what's in the table and update fields/branches that are present, or simply fail because the sent object does not match the schema one to one.

Finding expired data in aws dynamoDB

I have a requirement where I need to store some data in dynamo-db with a status and a timestamp. Eg. <START, 20180203073000>
Now, above status flips to STOP when I receive a message in SQS. But, to make my system error-proof, I need some mechanism through which I can identify whether a data having START status present in dynamo-db is older than 1 day then set it's status to STOP. So that, it may not wait indefinitly for the message to arrive from SQS.
Is there an aws feature which I can use to achieve this, without polling for data at regular interval ?
Not sure if this will fit your needs, but here is one possibility:
Enable TTL on your DynamoDB table. This will work if your timestamp
data attribute is a Number data type containing time in epoch
format. Once the timestamp expires, the corresponding item is
deleted from the table in the background.
Enable Streams on your DynamoDB table. Items that are deleted by TTL
will be sent to the stream.
Create Trigger that connects DynamoDB stream to Lambda function. In your case the
trigger will receive your entire deleted item.
Modify your record (set 'START' to 'STOP'), remove your timestamp attribute (items with no TTL attribute are not deleted) and re-insert into the table.
This way you will avoid the table scans searching for expired items, but on other side there might be cost associated with lambda execution.
You can try creating a GSI using the status as primary key and timestamp as sort key. When querying for expired items, use a condition expression like status = "START" and timestamp < 1-day-ago.
Be careful though, because this basically creates 2 hot partitions (START and STOP), so make sure the projection expression only has the data you need and no more.
If you have a field that's set on the status = START state but doesn't exist otherwise, you'd be able to take advantage of a sparse index (basically, DynamoDB won't index any items in a GSI if the GSI keys don't exist on the item, so you don't need to filter them on query)

Real Time Google Analytics API - Identify user session

I'm retreiving event data using Real Time Google Analytics API, so as to trigger responses each time conditions are met - while the user navigates.
This is my actual query on Google Analytics Real Time API (which works perfectly!)
return service.data().realtime().get(
ids='ga:' + profile_id,
metrics='rt:totalEvents',
dimensions='rt:eventAction,rt:eventLabel,rt:eventCategory',
max_results='25').execute()
I'd like to show results grouped by each particular session or user. So as to trigger a message to this particular user if some conditions are met.
Is that possible? And if so, how do apply this criteria to this query?
"Trigger a message to a particular user" would imply that you either have personally identifiable data stored in GA, which would violate Googles TOS, or that you map an anonymous ID (clientid or UserID or similar) to a key stored in an external database (which might be legally murky, depending on your legislation). Since I don't want to throw away the answer I have written before reading your question to the end :-) I am going to assume the latter.
So, is that possible? No, not really. By default GA does not identify neither an identifier for the user (client id or user id) nor for the session (a session identifier is present only in the BigQuery export schema).
The realtime API has a very limited set of dimensions (mostly I think because data aggregation does not happen in realtime), so you can't even use custom dimensions. Your only chance would be to overwrite one of the standard fields, i.e. campaign information.
Of course this destroys the original data in the field. So you should use an extra view for the API query, send a custom dimension with the user identifier along, and then use an advanced filter to copy the custom dimension value to a standard field (while you original data is safe in your other data views). This is a bit hackish, though.
Also the realtime API only displays the current hit per user, so you cannot group by user in the query in any case - you'd need to download and store the data to an external database and do your aggregation there.