Is there anyway to keep only one week data in redshift table - amazon-web-services

I have a source where everyday file get populated and every day it is loaded into redshift table
but I want to keep only one week of data in the table after one week it should delete the data which are older.
Suggest a, way for that.

A common method is:
Load each day's data into a separate table
Use CREATE VIEW to create a combined view of the past week's tables
For example:
CREATE VIEW data
AS
SELECT * FROM monday_table
UNION ALL
SELECT * FROM tuesday_table
UNION ALL
SELECT * FROM wednesday_table
...etc
Your users can simply use the View as a normal table.
Then, each day when new data has arrived, DROP or TRUNCATE the oldest table and load the new data
Either load the new data in the same-named table as the one dropped/truncated, or re-create the View to include this new table and not the dropped one
There is no automatic process to do the above steps, but you could make it part of the script that runs your Load process.

Related

Schedule the creation a partitioned table overwriting an existing table in BigQuery GCP

Yesterday I scheduled daily the overwriting of a table. The new table will be partitioned as well as the overwritten one... It did not run at the corresponding time, nor gave an error... It just did not started.
My feeling is that it has to be with the partitioning option. To mention that the casting of the field date_formatted that will be used as partition field works fine.
As far as I know, when scheduling a query you can't use the create or replace table T partitioned by column C as select...
You starts from the select... clause, as shows in the image, and I don't know if the problem comes from here.
PS: I had no troubles scheduling the appending to a partitioned by day table with this same procedure.
the destination table is in the same dataset.
if the very same query is scheduled to deliver the results in a table with the same name, but in a different dataset (located in the same project), it works.
by the way, the input table and the output table never were the same.

How to backfill partitioned data in BigQuery?

I am trying to backfill data from GCP billing export table to another table say T1.
Both tables are partitioned.
Below scheduled query runs everyday to get yesterday’s data.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
Now I need to backfill the data , say for 15th May - how do I do that ?
I tried the backfill feature with the below query - expecting the backfill utility to take the past date i.e. May 15th as a param for the #run_date but that didn’t help.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = #run_date
The data is pulled for 15th May from the source table(gcp_billing_export_v1) but is populated against current date in the destination table i.e May 15th May data is populated against June 22nd in the destination table T1. Where am I going wrong ?
Any guidance ?
Looks like you're using ingestion partitioning.
You would need to create a new table with the partitioning you want ie EventDate and populate that new table with historical and new daily data - as you can't overwrite an existing partition.
Link here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table
As #Lemon already pointed out that you're using Ingestion time partitioned tables(both source and dest), you need to understand how it works. Ingestion time partitioned tables are different from the Regular partitioned tables.
From the Documentation-
When you create a table partitioned by ingestion time,BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data.
This type of table has a pseudo-column named _PARTITIONTIME.The value of this column is the ingestion time for each row.
Since you are using the SELECT * FROM gcp_billing_export_v1 you are getting all the data but without any _PARTITIONTIME column. And when you are saving the same result into the destination table , it is updating the _PARTITIONTIME column as per destintaion table's data ingestion-time.
Thus you have old data with the current date in _PARTITIONTIME
To avoid this your destination table needs to be either a normal table or regular partitioned table.
Also you need to have an extra column to hold the Datetime value from the source's_PARTITIONTIME column.You can create a regular partition on this new column.
Then to get _PARTITIONTIME in your result set , in your quer you need to mention the column name specifically in your query.
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = #run_date
The above query will return all the data from the gcp_billing_export_v1 table with 1 extra column ingestionTime.
Now you can backfill the data for 15th,May and save it to the new table.
You can also tweak around this below query to achieve the same
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = DATE_ADD(#run_date, INTERVAL -1 DAY)
It will run daily as per your need .Now if you want to pull data for 15th,May then you have to schedule the backfill for 16th,May(as per the where clause)

How to insert Billing Data from one Table into another Table in BigQuery

I have two tables both billing data from GCP in two different regions. I want to insert one table into the other. Both tables are partitioned by day, and the larger one is being written to by GCP for billing exports, which is why I want to insert the data into the larger table.
I am attempting the following:
Export the smaller table to Google Cloud Storage (GCS) so it can be imported into the other region.
Import the table from GCS into Big Query.
Use Big Query SQL to run INSERT INTO dataset.big_billing_table SELECT * FROM dataset.small_billing_table
However, I am getting a lot of issues as it won't just let me insert (as there are repeated fields in the schema etc). An example of the dataset can be found here https://bigquery.cloud.google.com/table/data-analytics-pocs:public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1
Thanks :)
## Update ##
So the issue was exporting and importing the data with the Avro format and using the auto-detect schema when importing the table back in (Timestamps were getting confused with integer types).
Solution
Export the small table in JSON format to GCS, use GCS to do the regional transfer of the files and then import the JSON file into a Bigquery table and DONT use schema auto detect (e.g specify the schema manually). Then you can use INSERT INTO no problems etc.
I was able to reproduce your case with the example data set you provided. I used dummy tables, generated from the below queries, in order to corroborate the cases:
Table 1: billing_bigquery
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='BigQuery' limit 1000
Table 2: billing_pubsub
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='Cloud Pub/Sub' limit 1000
I will propose two methods for performing this task. However, I must point that the target and the source table must have the same columns names, at least the ones you are going to insert.
First, I used INSERT TO method. However, I would like to stress that, according to the documentation, if your table is partitioned you must include the columns names which will be used to insert new rows. Therefore, using the dummy data already shown, it will be as following:
INSERT INTO `billing_bigquery` ( billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits )#invoice, cost_type
SELECT billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits
FROM `billing_pubsub`
Notice that for nested fields I just write down the fields name, for instance: service and not service.description, because they will already be used. Furthermore, I did not select all the columns in the target dataset but all the columns I selected in the target's tables are required to be in the source's table selection as well.
The second method, you can simply use the Query settings button to append the small_billing_table to the big_billing_table. In BigQuery Console, click in More >> Query settings. Then the settings window will appear and you go to Destination table, check Set a destination table for query results, fill the fields: Project name,
Dataset name and Table name -these are the destination table's information-. Subsequently, in
Destination table write preference check Append to table, which according to the documentation:
Append to table — Appends the query results to an existing table
Then you run the following query:
Select * from <project.dataset.source_table>
Then after running it, the source's table data should be appended in the target's table.

Alter sort and distribution for dependent tables

This is the query to change the sort and dist key in Redshift tables.
CREATE TABLE new_dummy
DISTKEY (id)
SORTKEY (account_id,created_at)
AS (SELECT * FROM dummy);
ALTER TABLE dummy RENAME TO old_dummy;
ALTER TABLE new_dummy RENAME TO dummy;
DROP TABLE old_dummy;
It throws the below error:
ERROR: cannot drop table old_dummy because other objects depend on it
HINT: Use DROP ... CASCADE to drop the dependent objects too.
So is it not possible to change the keys for dependent tables?
It appears that you have VIEWs that are referencing the original (dummy) table.
When a table is renamed, the VIEW continues to point to the original table, regardless of what it is named. Therefore, trying to delete the table results in an error.
You will need to drop the view before dropping the table. You can then recreate the view to point to the new dummy table.
So, the flow would be:
Create new_dummy and load data
Drop view
Drop dummy
Rename new_dummy to dummy
Create view
You might think that his is bad, but it's actually a good feature because renaming a table will not break any views. The view will automatically stay with the correct table.
UPDATE:
Based on Joe's comment below, the flow would be:
CREATE VIEW ... WITH NO SCHEMA BINDING
Then, for each reload:
Create new_dummy and load data
Drop dummy
Rename new_dummy to dummy
This answer is based upon the fact that you have foreign key references within the table definition, which are not compatible with the process of renaming and dropping tables.
Given this situation, I would recommend that you load data as follows:
Start a transaction
DELETE * from the table
Load data with INSERT INTO
End the transaction
This means you are totally reloading the contents of the table. Wrapping it in a transaction means that there is no period where the table will appear 'empty'.
However, this leaves the table in a bit of a messy state, requiring a VACUUM to delete the old data.
Alternatively, you could:
TRUNCATE the table
Load data with INSERT INTO
TRUNCATE does not require a cleanup since it clears all data associated with the table (not just marking it for deletion). However, TRUNCATE immediately commits the transaction, so there will be a gap where the table will be empty.

First time Updating a table

I was recently given permissions to update a single table in our database but this is not something I have done before and I do not what to mess anything up. I have tried searching for something online that was similar to what I am wanting to do with no success.
The table name is dbo.Player_Miles and it only has two columns of data Player_ID and Miles both of which are set as (int,null).
Currently there are about 300K records in this table and I have a csv file I need to use to update this table. In the file there are 500k Records so I need to be able to:
INSERT the new records ~250k records
UPDATE the records with that have new information ~200K records
Leave untouched and record that has the same information(although updating those to the same thing would not hurt the database would be a resource hog I would guess) ~50K records
Also leave untouched any records in the table currently that are not in the updated file. ~50k records
I am using SSMS 2008 but the Server is 2000.
You could approach this in stages...
1) Backup the database
2) Create a temporary SQL table to hold your update records
create table Player_Miles_Updates (
PlayerId int not null,
Miles int null)
3) Load the records from your text file into your temporary table
bulk insert Player_Miles_Updates
from 'c:\temp\myTextRecords.csv'
with
(
FIELDTERMINATOR =' ,',
ROWTERMINATOR = '\n'
)
4) Begin a transaction
begin transaction
5) Insert your new data
insert into Player_Miles
select PlayerId, Miles
from Player_Miles_Updates
where PlayerId not in (select PlayerId from Player_Miles)
6) Update your existing data
update Player_Miles
set Player_Miles.Miles = pmu.Miles
from Player_Miles pm join Player_Miles_Updates pmu on pm.Player_Id = pmu.Player_Id
7) Select a few rows to make sure what you wanted to happen, happened
select *
from Player_Miles
where Player_Id in (1,45,86,14,83) -- use id's that you have seen in the csv file
8a) If all went well
commit transaction
8b) If all didn't go well
rollback transaction
9) Delete the temporary table
drop table Player_Miles_Updates
You should use SSIS (or DTS, which was replaced by SSIS in SQL Server 2005).
Use the CSV as your source and "upsert" the data to your destination table.
In SSIS there are different ways to get this task done.
An easy way would be to use a lookup task on Player_ID.
If there's a match update the value and if there's no match just insert the new value.
See this link for more informations on lookup-pattern-upsert