Is there a way (using existing templates) to select data from multiple tables by joining them using AWS datapipeline. My usecase requires me to combine data from multiple RDS tables to export to Redshift.
For eg. RDS has Tables School, Student, District. I want to export data like:-
select sch.Name, stu.Name, dis.Name from School sch inner join Student stu on stu.schoolid = sch.id inner join District dis on dis.id = sch.districtid;
Is there a way in AWS Datapipeline for me to select data from multiple tables ?
There is a field name "select Query" in Data node. You can write your transformation SQL which can pull out data from different tables.
Please refer below image.
Select Query in Data Node
You can create a single pipeline that will have different activities for each table that you want to replicate.
This way, you won't have to write join query to replicate multiple tables.
Answering to an old question, so that it can help others still searching for something like this.
Related
I want to find out what my table sizes are (in BigQuery).
However I want to sum up the size of of all tables that belong to a specific set of sharded tables.
So I need to find metadata that shows that a table is part of a set of sharded tables.
So I can do: How to get BigQuery storage size for a single table
select
sum(size_bytes)/pow(2, 30) as size_gb
from
<your_dataset>.__TABLES__
But here I can't see if the table is part of a set of sharded set of tables.
This is what my Google Analytics sharded tables look like in BQ:
So somewhere must be metadata that indicates that tables with for example name ga_sessions_20220504 belong to a sharded set ga_sesssions_
Where/how can I find that metadata?
I think you are exploring the right query, most of the time, I use the following query to drill down on shards & it's sizes
SELECT
project_id,
dataset_id,
table_id,
array_reverse(SPLIT(table_id, '_'))[OFFSET(0)] AS shard_pt,
DATE(TIMESTAMP_MILLIS(creation_time)) creation_dt,
ROUND(size_bytes/POW(1024, 3), 2) size_in_gb
FROM
`<project>.<dataset>.__TABLES__`
WHERE
table_id LIKE 'ga_sessions_%'
ORDER BY
4 DESC
Result (on some random GA dataset I have access to FYI)
There is no metadata on Sharded tables via SQL.
Tables being displayed as Sharded in BigQuery UI happens when you do the following ->
Create 2 or more tables that have the following characteristics:
exist in the same dataset
have the exact same table schema
the same prefix
have a suffix of the form _YYYYMMDD (eg. 20210130)
These are something of a legacy feature, they were more commonly used with bigquery’s legacy SQL.
This blog was very insightful on this:
https://mark-mccracken.medium.com/bigquery-date-sharding-vs-date-partitioning-cee3754f7900
Thanks for taking your time to read this!
I have multiple tables within an AWS glue catalog database and want to create an ER diagram from that database.
It should contain all the fields and data types.
Is there a straightforward tool to achieve this, like pointing a schema creation tool like DBschema to the glue catalog?
I have data coming from multiple hotels. These hotels are not using the same naming convention for storing the order information. I have a predefined dataset created in the bigquery(called hotel_order). I
want to map the data coming from different hotels to the single dataset in GCP, so it is easier for me to do comparisons in the bigquery.
If the column name(from hotel1) matches the bigquery dataset columnname, then the bigquery should load the data in the column, if the columnnames (from hotel orders data and dataset in bigquery) don't match, then column in the bigquery should have the null value. How do I do implement this in GCP? Problem of mapping in the GCP?
If you want to join tables together, and show a null value when a match doesn't exist, then you can do so using 'left join'.
Rough example
from hotel.orders as main left join hotel_number_one as Hotel_One on main.order_information = Hotel_One.order_information
Difficult to give a more detailed answer without more details or a working example using dbfiddle.
I have folder structure as following in S3
Data
table1/output/table1.csv
table2/output/table2.csv
table3/output/table3.csv
My ideal goal is to have a Glue Crawler to have 3 respective tables created. Instead what is created is 1 table called data with partitions table1, table2, table3 and output. I have messed around with various combinations in the configuration page but still no luck. Any recommendations?
AWS Redshift team recommend using TRUNCATE in order to clean up a large table.
I have a continuous EC2 service that keeps adding rows to a table. I would like to apply some purging mechanism, so that when the cluster is near full it will auto delete old rows (say using the index column).
Is there some best practice for doing that?
Do I need to write my own code to handle that? (if so is there already a Python script for that that I can use e.g. in a Lambda function?)
A common practice when dealing with continuous data is to create a separate table for each month, eg Sales-2018-01, Sales-2018-02.
Then create a VIEW that combines the tables:
CREATE VIEW sales AS
SELECT * FROM Sales-2018-01
UNION
SELECT * FROM Sales-2018-02
Then, create a new table each month and remove the oldest month from the View. This effectively gives a 12-month rolling view of the data.
The benefit is that data does not have to be deleted from tables (which would then require a VACUUM). Instead, the old table can simply be dropped, or kept around for historical reporting with a different View.
See: Using Time Series Tables - Amazon Redshift