Are there ways to find amount of data queried per lambda statement for AWS redshift?

Are there ways to find amount of data queried per lambda statement for AWS redshift? - amazon-web-services

I am trying to find the amount of data queried per statement from AWS Lambda on Redshift, but all I can find is amount of data queried per query ID. There are multiple lambdas which I am running but I can't seem to relate the lambdas to the query ID.
I tried to look up the documentation on AWS Redshift system views, but there doesn't seem to be any tables which contain these values.

So there are a few ways to do this. First off the Lambda can find its session id with PG_BACKEND_PID(). This can be reported out / logged from the Lambda to report all statements from this session. Or you can add a unique comment to to all the queries coming from Lambda and you can search on this in svl_statementtext. Or you can do both. Once you have the query id and session id you look at the query statistics (SVL_QUERY_REPORT or other catalog tables).
Be aware that query ids and session ids repeat over time so also check the date to make sure you are not seeing a query from some time ago.

Related

Dataset shows only 5 event tables after re-linking Firebase with another Google Analytics account

Recently unlinked and re-linked a Firebase project with a different Google Analytics account.
The BigQuery integration configured to export GA data created the new dataset and data started populating into that.
The old dataset corresponding to the unlinked, "default" GA account, which contained ~2 years of data is still accessible in the BigQuery UI, however only the 5 most recent event_ tables are visible in the dataset. (5 days worth of event data)
Is it possible to extract historical data from the old, unlinked dataset?

What I could suggest, it's to do some queries for further validate the data that you have within your BigQuery dataset.
In this case, I would start by getting the dates for each table to see the amount (days) of data contained on the dataset.
SELECT event_date
FROM `firebase-public-project.analytics_153293282.events_*`
GROUP BY event_date ORDER BY event_date
EDIT
A better way to do this, and get all the tables within the dataset, is using the bq command line tool, see reference here.
bq ls firebase-public-project:analytics_153293282
You'll get something like this:
You could also do a COUNT(event_date), so you can see how many records you have per day, and compare this to the content that you have or you can see on your Firebase project.
SELECT event_date, COUNT(event_date) ...
On the case that there's data missing, you could use table decorators, to try to recover that data, see example here.
About the table's expiration date you can see this, in short, expiration time can be set by default at dataset level and it would be applied for new tables (existing tables require a manual update of their expiration time one by one), and expiration time can be set during the creation of the table. To see if there was any change on the expiration time you could look into your logs for protoPayload.methodName="tableservice.update", and see if there was set an expireTime as follows:
tableUpdateRequest: {
resource: {
expireTime: "2020-12-31T00:00:00Z"
...
}
}
Besides this, if you have a GCP support plan, you could reach them looking for further assistance on what could have happened with your tables on that dataset. Otherwise, you could open an issue tracker. Keep in mind that Firebase doesn't delete your data when unlinking a Firebase project from BigQuery, so in theory the data should be there.

DynamoDB - UUID and avoiding a full table scan

This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?

Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.

There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.

Logstash and looking up additional data from a relational table?

I have mobile app log data being posted daily (eventually it will be a data stream). I am looking at different solutions for processing this log data and providing analytics. I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app.
However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user.
I have two questions regarding this approach
Is there a limit to have large this lookup file can be? Mine would be < 500K records so I'd imagine it would be fine?
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - i.e. each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana
The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack)
Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line.
Thanks!

logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff.
As you found, you can also use the translate filter. The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (e.g. cron) without restarting logstash. Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated.

How to query data in AWS AppSync in a specific range then sort its result by another key?

I create a temple name BlogAuthor in AWS DynamoDB with following structure:
authorId | orgId | age |name
Later I need to make a query like this: get all authors from organization id = orgId123 with age between 30 and 50, then sort their name in alphabet order.
I'm not sure it's possible to perform such query in DynamoDB (later I'll apply it in AppSync), hence the first solution is to create an index (GSI) with partitionKey=orgId, sortKey=age (final name is orgId-age-index).
But next, when try to query in DynamoDB, set partitionKey orgId=orgId123, sortKey age=[30;50] and no filter; then I can have a list of authors. However, there is no way to sort that list by name from above query.
I retry another solution by create new index with partitionKey=orgId and sortKey=name. Then, query (not scan) in DynamoDB with partitionKey orgId=orgId123, set empty sortKey value (because we only want to sort by name instead of getting a specific name), and filter age in range [30;50]. This solution seems works, however I notice the filter is applied on the result list - for example the result list with 100 items, but after apply filter by age, then may by 70 items remaining, or nothing. But I always hope it returns 100 items.
Could you please tell me is there anything wrong with my approaches? Or, is it possible to make such query in DynamoDB?
Another (small) question is when connect that table to an AppSync API: if it's not possible to perform such query, then it's not possible for such query in AppSync too?

You are not going to be able to do everything you want in a single DynamoDB query.
Option 1:
You can do what you want as long as you are ok with sorting objects on the client. This would work for organizations with a relatively small number of people.
Pros:
Allows you to efficiently query users in a particular organization between a range of users.
Cons:
Results are not sorted by name on the server.
Option 2:
Pros:
Allows you to paginate through users at an organization that are ordered by the name.
Cons:
You cannot efficiently get all users in an organization within an age range. You would effectively be scanning the index and would need multiple round trip calls.
Option 3:
A third option, would be to stream information from DynamoDB into ElasticSearch using DynamoDB streams and AWS Lambda. Once the data is in Elasticsearch, you can do much more advanced queries. You can see more information on the Elasticsearch search APIs here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html.
Pros:
Much more powerful query engine.
Cons:
More overhead w/ the DynamoDB stream and AWS Lambda function.

Deleting Data from DynamoDb Table automatically

Is there any kind of life retention period concept in DynamoDB.
I mean is there any way such that data inside a table will be deleted after some time like we can set some retention period in S3.
Thanks,

DynamoDB has introduced a Time to Live (TTL) feature. You can create a numeric field and set the value to "time in seconds" (since epoch) when you want the record to be deleted.
DynamoDB will automatically delete the record at the specified time. Of course, TTL has to be configured on a per table basis.
You can fnd more details at: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-how-to.html

No, there is no "retention" setting available in DynamoDB.
You could run a daily/monthly query that uses a date field to filter results, and use that output to determine which Items to delete. This would need to be implemented in your own programming code.
Some users choose to use separate tables to provide ageing. For example, create a separate table for each month. Then, delete old tables once they pass a certain age. However, your software would have to know how to handle multiple tables of data.
Examples:
Reference to monthly rotation of tables
Understand Access Patterns for Time Series Data

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js