AWS Glue joining - amazon-web-services

AWS Glue joining - amazon-web-services

I am new to AWS Glue and trying to join two Redshift SQL queries but not sure how we can have only selected fields as my main table has more than 1000 fields.
Below is the query which I am trying to make in Glue.
SELECT v.col1,
v.col2,
s.col3
FROM
(
SELECT col1,
col2
FROM t1
WHERE col1 > 0
) v
LEFT JOIN
(
SELECT col1,
col3
FROM t2
WHERE col1 > 0
GROUP BY col1
) s
ON v.col1 = s.col1

If you are writing in Python, I would either use Spark sql or use pysparks join functions.
For Spark SQL
1) Convert to a Apache Spark DataFrame using the toDF() function.
2) Make the Spark Data Frame Spark SQL Table using createOrReplaceTempView().
Then run sql and to what you posted above.
OR
Use PySpark
left_join = t1.join(t2, t1.col1 == t2.name,how='left')
left_join.filter(col('col1' > 0)) # Then filter afterwards
Would that work for you?

Related

How to add column with query folding using snowflake connector

I am trying to add a new column to a power query result that is the result of subtracting one column from another. according to the power bi documentation basic arithmetic is supported with query folding but for some reason it is showing a failure to query fold. I also tried simply adding a column populated with the number 1 and it still was not working. Is there some trick to getting query folding a new column to work on snowflake?

If the computation is made based only on data from source, then it could be computed during table import as SQL Statement:
SELECT col1, col2, col1 + col2 AS computed_total
FROM my_table_name
EDIT:
The problem with this solution is that native SQL statement for snowflake is only supported on PBI desktop and I want to have this stored in a dataflow (so pbi web client) for reusability and other reasons.
Option 1:
Create a view istead of table at source:
CREATE OR REPLACE VIEW my_view
AS
SELECT col1, col2, col1 + col2 AS computed_total
FROM my_table_name;
Option 2:
Add computed column to the table:
ALTER TABLE my_table_name
ADD COLUMN computed_total NUMBER(38,4) AS (col1 + col2);

Join 2 tables results in query timeout

I have a few tables created in AWS Athena under "TestDB". These tables are created by running an AWS Glue crawler through the S3 buckets. I am trying to create a new table by joining 2 existing tables under "TestDB". It is a simple left outer join as follows:
CREATE TABLE IF NOT EXISTS TestTab1
AS (
SELECT *
FROM (
(
SELECT col1, col2, col3, col4
FROM "TestDB"."tab1" a
WHERE a.partition_0 = '10-24-2021'
AND substring(a.datetimestamp, 1, 10) = '2021-10-24'
)
LEFT OUTER JOIN (
SELECT col1, col2, col3,col4
FROM "TestDB"."tab2" b
WHERE b.partition_0 = '10-24-2021'
AND substring(b.datetimestamp,1,10) = '2021-10-24'
)
ON (a.col1 = b.col1)
)
)
The query scans around 5GB worth of data but times out after ~30 mins since that is the timeout limit. Other than requesting an increase in timeout limit, is there any other way to create a join of 2 tables on AWS?

It's very hard to say from the information you provide, but it's probably down to the result becoming very big or an intermediate result becoming big enough for the executors to run out of memory and having to spill to disk.
Does running just the query work? You can also try to run EXPLAIN SELECT … to get the query plan and see if that tells you anything.
Your query is unnecessarily complex with multiple nested SELECT statements. I think Athena's query planner will be smart enough to rewrite it to something like the following, which is easier to read and understand:
CREATE TABLE IF NOT EXISTS TestTab1 AS
SELECT col1, col2, col3, col4
FROM "TestDB"."tab1" a LEFT JOIN "TestDB"."tab2" b USING (col1)
WHERE a.partition_0 = '10-24-2021'
AND b.partition_0 = '10-24-2021'
AND substring(a.datetimestamp, 1, 10) = '2021-10-24'
AND substring(b.datetimestamp, 1, 10) = '2021-10-24'

How to update multiple columns in same update statement with one column depends upon another new column new value in Redshift

I want to update multiple columns in same update statement with one column depends upon another new column new value.
Example:
Sample Data: col1 and col2 is the column names and test_update is the table name.
SELECT * FROM test_update;
col1 col2
col-1 col-2
col-1 col-2
col-1 col-2
update test_update set col1 = 'new', col2=col1||'-new';
SELECT * FROM test_update;
col1 col2
new col-1-new
new col-1-new
new col-1-new
What I need to achieve is col2 is updated as new-new as we updated value of col1 is new.
I think may be its not possible in one SQL statement. If possible How can we do that, If its not What is best way of handling this problem in Data Warehouse environment, like execute multiple update 1st on col1 and then on col2 or any other.
Hoping my question is clear.

You cannot update the second column based on the result of updating the first column. However this can be achieved in a single by "pre-calculating" the result you want and then updating based on that.
The following update using a join is based on the example provided in the Redshift documentation:
UPDATE test_update
SET col1 = precalc.col1
, col2 = precalc.col2
FROM (
SELECT catid
, 'new' AS col1
, col1 || '-new' AS col2
FROM test_update
) precalc
WHERE test_update.id = precalc.id;
;

Python UDF in Redshift fast with CTE than direct select query

I have a function written in python for redshift, code below (calculates business days between two dates "South African Holidays and weekends"):
CREATE OR REPLACE FUNCTION b_days (start_date timestamp, end_date timestamp)
RETURNS INTEGER IMMUTABLE as $$
from pandas import date_range
from pandas.tseries.holiday import AbstractHolidayCalendar, Holiday
the code is super quick when test on
Jupyter Notebook
run function on a simple select: select business_days('2018-06-29','2018-07-02') return 2 as answer
using CTE
with cte as(
select id, sdate, edate
from bigtable --with million rows
)
select id, sdate, edate, business_days(sdate,edate)
from cte
but it runs non stop when i say:
select id, sdate, edate, business_days(sdate,edate)
from bigtable --with million rows

UFT API TEST: Create SQL query based on values from previous step activity at run time

Steps to be performed in UFT API Test:
Get JSON RESPONSE from added REST activity in test flow
Add Open DB connection activity
Add Select Data activity with query string
SELECT Count(*) From Table1 Where COL1 = 'XXXX' and COL2 = ' 1234'
(here COL2 value has length of 7 characters including spaces)
In the above query values in where clause is received(dynamically at run time) from JSON response.
When i try to link the query value using link to data source with custom expression
eg:
SELECT COUNT(*) FROM Table1 Where COL1 =
'{Step.ResponseBody.RESTACTIVITYxx.OBJECT[1].COL1}' and COL2 =
'{Step.ResponseBody.RESTACTIVITYxx.OBJECT[1].COL2}'
then the QUERY changed (excluding spaces in COL2) to:
SELECT Count(*) From Table1 Where COL1 = 'XXXX' and COL2 = '1234'
I eventried with concatenate and Replace string activity but same happens.
Please kindly help..

You can use the StringConcatenation action, to build de Query String.
Use the String as Query in Database "Select data"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue joining - amazon-web-services

Related

How to add column with query folding using snowflake connector

Join 2 tables results in query timeout

How to update multiple columns in same update statement with one column depends upon another new column new value in Redshift

Python UDF in Redshift fast with CTE than direct select query

UFT API TEST: Create SQL query based on values from previous step activity at run time

Categories

Resources