AWS Timestream: Unable to ingest records into AWS Timestream - amazon-web-services

As you all know, AWS Timestream was made generally available in the last week.
Since then, I have been trying to experiment with it and understanding how it models and stores the data.
I am facing an issue in ingesting records into Timestream.
I have certain records dated 23rd April 2020. On trying to insert these records into a Timestream table, I get the RecordRejected error.
According to this link, a record is rejected if it has the same dimension, timestamp or if the timestamp is beyond the retention period of the memory store of the table.
I have set the retention period of the memory store of my table to 12 months. According to the documentation: any records having a timestamp beyond 12 months would be rejected.
However, the above mentioned record gets rejected despite having a timestamp within 12 months from now.
On investigating further, I have noticed that, records with today's date (5th Oct 2020) get ingested successfully, however, records with a date 30 days before do not get ingested, i.e. 5th September 2020. To ensure this, I have also tried inserting a record with the date 6th Sept and a few more days between today's date and 5th Sept. All these are getting inserted successfully.
Could somebody explain why I am not able to insert records having a timestamp within the retention period of the memory store? It only allows me to insert records that are at the most 30 days old.
I would also like to know if there is a way we could insert historical data directly into the magnetic store. The memory store retention period may not be sufficient for my use case and I may need to insert data that is 2 years old or more. I understand this is not a classic use case of timestream, but I am still curious to know.
I am stuck on this issue and would really appreciate some help.
Thank you in advance.

I had a very similar issue, and for me it turned out that I had to set the Memory Store Retention Period to 8766 hours - which is slightly MORE than one year. I've no clue why that is, and why it works, but it worked for me importing older data.
PS: I'm pretty sure it's a bug in timestream
PPS: I've found the value by using the "default" set in the aws console. No other value worked for me.

Timestream loads data into the memory store only if the timestamp is within the timespan of its retention period. So if the retention period is 1 day, the timestamp can't be more than 1 day ago.
AWS TimeStream: Records that are older than one day are rejected

Related

Why is this happening in BigQuery Sandbox Environment ? - it only creates empty partitioned tables in BigQuery

It's same for me as well, somehow when running below DDL statement to create Partitioned Table - it's creating empty Partitioned Tables, Not Sure Why!! -- This is happening in BigQuery Sandbox Environment.
Could someone please tell - why this is happening ?
CREATE OR REPLACE TABLE
stack_it.questions_2018_clustered
PARTITION BY DATE(creation_date)
CLUSTER BY tags
AS
SELECT id, title, accepted_answer_id, creation_date, answer_count, comment_count, favorite_count, view_count, tags
FROM
bigquery-public-data.stackoverflow.posts_questions
WHERE creation_date BETWEEN '2018-01-01' AND '2019-01-01';
As mentioned in this documentation, in a BigQuery Sandbox environment, the tables and partitions expire after 60 days and this limit cannot be increased unless upgraded from the Sandbox. The reason for the empty partitioned table is this 60 day limit.
The partition expiration date is calculated independently for each partition based on the partition time. Since, the new table is created with dates that are more than 60 days old (year 2018, 2019), any partitions with these dates will be dropped. I tested this behaviour by creating a table with recent dates (within 60 days) and the new partitioned table was populated with the data as expected. For more information on partition expiration refer here.
To remove the BigQuery Sandbox limit, I would suggest you upgrade your project. After you upgrade from the Sandbox, you can still use the free tier, but you can generate charges. To manage BigQuery quotas, consider setting up cost controls. More info on upgrading from Sandbox and updating the expiration periods can be found here.

ETL Job in AWS Glue for Historical Data

I have a client who sends us data each month. However, each month's file contains all of their historical data as well as the new data from the current month. For example, in November 2020, the client sent us data from September 2018 - November 2020 inclusive. The file they will send in December 2020 will include data from September 2018 - December 2020 inclusive.
I want to perform an ETL job in Glue for only the new data (i.e. the data from the current month) each month. For example, I would like to extract just the December 2020 data from a CSV file that contains data from September 2018-December 2020. I have tried the following approaches to no avail:
Making use of job bookmarking: When enabled, the job will "ignore" data files that it has previously processed but still performs the ETL task on the entire new data file each month.
Making use of the SplitRows Class: While this can accomplish what I am looking for, it would require that I create a separate job for each month and year since the comparison-dict would have to be tailored to the month of interest. If possible, I would like to avoid having to do this.
While I am looking for solutions using AWS Glue, if none exist, I am open to making use of other AWS services (e.g. AWS DataPipeline) if necessary.
Thanks in advance for any assistance you can provide!

Why are there random dates data on intraday tables in Bigquery?

I have set a Google Analytics to BigQuery daily export for one of our views. According to my knowledge the intraday tables which is populated thrice a day gets deleted once ga_sessions() populates.
Recently I observed that there are random dates data on the intraday tables.
Having a look at the logs on stack driver I don't observe any anomaly.
Please could some one explain this case.
Refer this image!!
BigQuery UI propose you to select only the existing partition. If you haven't data for a giving date, there is no partition, and thus, there is no proposal on the GUI.
Can seem strange, but it's useful!!

How to get expired table data in bigquery, If the expired time is more than two days?

I have a process in which I get the table data in bigquery on the daily basis. I need some old table data but unfortunately they're expired now and their expiration time is more than two days. I know we can get back the table data if it's deleted and deleted time is less than two days, but is it possible in the case of expired table and the time is more than 2 days?
I tried using timestamp of 2 days back and tried to get it using bq tool, but I need data which was deleted 2 days before.
GCP Support here!
Actually if you read through the SO question linked by #niczky12 and as stated in the documentation:
It's possible to restore a table within 2 days of deletion. By leveraging snapshot decorator functionality, you may be able to reference a table prior to the deletion event and then copy it. Note the following:
You cannot reference a deleted table if you have already created a new table with the same name in the same dataset.
You cannot reference a deleted table if you deleted the dataset that housed the table, and you have already created a new dataset with the same name.
At this point, unfortunately it is impossible to restore the deleted data.
Bigquery tables don't necessarily expire in 2 days. You can set them to whatever you like:
https://cloud.google.com/bigquery/docs/managing-tables#updating_a_tables_expiration_time
Once they expired, there's no way to retrieve the table, unless you have snapshots in place. In that case, you can restore a snapshot and use that to get the data you want. See this SO question on how to do that:
How can I undelete a BigQuery table?
To add for future searchers here. I was able to follow the explanation below on medium and restore data that was still there 7 days ago.
Actually the Cloud Shell in the UI gave back the max time to go back when i tried a date that was too far int he future. The max time they gave back was 7 days in EPOCH Miliseconds. Just type that in the convertor below and add 1 or 2 hours and you should be good. Don't take the exact copy of what the console provides, as that is outdated by the time it's printed.
https://medium.com/#dhafnar/how-to-instantly-recover-a-table-in-google-bigquery-544a9b7e7a8d
https://www.epochconverter.com/
And make sure to set future tables to never delete! (or a date you know). This can be found in the table details, and also on dataset level in console.cloud environment for bigquery.
(as of 2022 at least) In general, you can recover data BQ tables for 7 days via time-travel. See GCP doc:
https://cloud.google.com/bigquery/docs/time-travel
and the related:
How can I undelete a BigQuery table?

Timezone related issues in BigQuery (for partitioning and query)

We have a campaign management system. We create and run campaigns on various channels. When user clicks/accesses any of the Adv (as part of campaign), system generates a log. Our system is hosted in GCP. Using ‘Exports’ feature logs are exported to BigQuery
In BigQuery the Log Table is partitioned using ‘timestamp’ field (time when log is generated). We understand that BigQuery stores dates in UTC timezone and so partitions are also based on UTC time
Using this Log Table, We need to generate Reports per day. Reports can be like number of impressions per each day per campaign. And we need to show these reports as per ETC time.
Because the BigQuery table is partitioned by UTC timezone, query for ETC day would potentially need to scan multiple partitions. Had any one addressed this issue or have suggestions to optimise the storage and query so that its takes complete advantage of BigQuery partition feature
We are planning to use GCP Data studio for Reports.
BigQuery should be smart enough to filter for the correct timezones when dealing with partitions.
For example:
SELECT MIN(datehour) time_start, MAX(datehour) time_end, ANY_VALUE(title) title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE DATE(datehour) = '2018-01-03'
5.0s elapsed, 4.56 GB processed
For this query we processed the 4.56GB in the 2018-01-03 partition. What if we want to adjust for a day in the US? Let's add this in the WHERE clause:
WHERE DATE(datehour, "America/Los_Angeles") = '2018-01-03'
4.4s elapsed, 9.04 GB processed
Now this query is automatically scanning 2 partitions, as it needs to go across days. For me this is good enough, as BigQuery is able to automatically figure this out.
But what if you wanted to permanently optimize for one timezone? You could create a generated, shifted DATE column - and use that one to PARTITION for.