saving athena results to another table with partitions - amazon-web-services

From AWS Athena
I am trying to concatenate multiple tables then save it with partitoned key.
after running
select *
from t1
union all
select *
from t2
select *
from t3
create table on the console creates query like this,
create table db.table_name
with(
format='parquet',
external_location=...
) AS
select *
from t1
union all
select *
from t2
select *
from t3;
But I want to add partitoned by column. I've tried
adding partitioned by on top and bottom. Also saved query result then created new table from that using CREATE EXTERNAL TABLE command (this works but return empty row -> even after running MSCK REPAIR)
From https://aws.amazon.com/premiumsupport/knowledge-center/athena-create-use-partitioned-tables/ It seems like I need to save data into S3 by partitions so in bucket1 it will have bucket1/2021, bucket1/2022 if 'year' column is the partition column. Correct? If yes, is there efficient to create partitioned buckets?

I have successfully created new, partitioned tables by using this method:
CREATE TABLE my_table
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year']
)
AS
SELECT
...

Related

Athena insert data into new added column

Trying to insert data into a new column I added. Athena does not have an update table command. Is there anyway to do this without reloading the whole table?
I created a test table and then added the column doing this:
ALTER TABLE MikeTest ADD COLUMNS (monthNum int);
I want to update the column with this SQL statement:
month(date_parse("date", '%m/%d/%Y'))
Amazon Athena reads its data from Amazon S3. It is not possible to 'update' a table because this would require re-writing the files in S3.
You could create a new table with the additional column:
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/folder/',
format = 'Parquet',
write_compression = 'SNAPPY'
)
AS
SELECT
*,
month(date_parse("date", '%m/%d/%Y')) as month
from old_table
This will copy the data to a new location in S3, while populating the new column

Athena query return empty result because of timing issues

I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.
How can I know when all the partitions have been loaded to the table?
The following code returns an empty result -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
But when I add some delay, it works greate -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
time.sleep(3)
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
The following is the query for creating the table -
query_create_table = '''
CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
`time` string,
`user_advertiser_id` string,
`predictions` float
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://{bucket}/path/'
'''
app_query_create_table = query_create_table.format(bucket=bucket,
athena_db=athena_db,
athena_db_partition=athena_db_partition)
I would love to get some help.
The start_query_execution call only starts the query, it does not wait for it to complete. You must run get_query_execution periodically until the status of the execution is successful (or failed).
Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.
Also, in general, avoid using MSCK REPAIR TABLE, it is slow and inefficient. There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/

how to insert/update data in sql database using azure databricks notebook jdbc

I got lots of example to append/overwrite table in sql from AZ Databricks Notebook. But no single way to directly update, insert data using query or otherway.
ex. I want to update all row where (identity column)ID = 1143, so steps which I need to taken care are
val srMaster = "(SELECT ID, userid,statusid,bloburl,changedby FROM SRMaster WHERE ID = 1143) srMaster"
val srMasterTable = spark.read.jdbc(url=jdbcUrl, table=srMaster,
properties=connectionProperties)
srMasterTable.createOrReplaceTempView("srMasterTable")
val srMasterTableUpdated = spark.sql("SELECT userid,statusid,bloburl,140 AS changedby FROM srMasterTable")
import org.apache.spark.sql.SaveMode
srMasterTableUpdated.write.mode(SaveMode.Overwrite)
.jdbc(jdbcUrl, "[dbo].[SRMaster]", connectionProperties)
Is there any other sufficient way to achieve the same.
Note : Above code is also not working as SQLServerException: Could not drop object 'dbo.SRMaster' because it is referenced by a FOREIGN KEY constraint. , so it look like it drop table and recreate...not at all the solution.
You can use insert using a FROM statement.
Example: update values from another table in this table where a column matches.
INSERT INTO srMaster
FROM srMasterTable SELECT userid,statusid,bloburl,140 WHERE ID = 1143;
or
insert new values to rows where one of the existing column value matches
UPDATE srMaster SET userid = 1, statusid = 2, bloburl = 'https://url', changedby ='user' WHERE ID = '1143'
or just insert multiple values
INSERT INTO srMaster VALUES
(1, 10, 'https://url1','user1'),
(2, 11, 'https://url2','user2');
In SQL Server, you cannot drop a table if it is referenced by a FOREIGN KEY constraint. You have to either drop the child tables before removing the parent table, or remove foreign key constraints.
For a parent table, you can use the below query to get foreign key constraint names and the referencing table names:
SELECT name AS 'Foreign Key Constraint Name',
OBJECT_SCHEMA_NAME(parent_object_id) + '.' + OBJECT_NAME(parent_object_id) AS 'Child Table'
FROM sys.foreign_keys
WHERE OBJECT_SCHEMA_NAME(referenced_object_id) = 'dbo' AND
OBJECT_NAME(referenced_object_id) = 'PARENT_TABLE'
Then you can alter the child table and drop the constraint by its name using the below statement:
ALTER TABLE dbo.childtable DROP CONSTRAINT FK_NAME;

How to update redshift column: simple text replacement

I have a large target table with columns (id, value). I want to update value='old' to value='new'.
The simplest way would be to UPDATE target SET value='new' WHERE value='old';
However, this deletes and creates new rows and is not recommended, possibly. So I tried to do a merge column update:
# staging
CREATE TABLE stage (LIKE target INCLUDING DEFAULTS);
INSERT INTO stage (SELECT id, value FROM target WHERE value=`old`);
UPDATE stage SET value='new' WHERE value='old'; # ??? how do you update value?
# merge
begin transaction;
UPDATE target
SET value = stage.value FROM stage
WHERE target.id = stage.id and target.distkey = stage.distkey; # collocated join?
end transaction;
DROP TABLE stage;
This can't be the best way of creating the table stage: I have to do all these UPDATE delete/writes when I update this way. Is there a way to do it in the INSERT?
Is it necessary to force the collocated join when I use CREATE TABLE LIKE?
Are you updating all the rows in the table?
If yes you can use CTAS (create table as) which is recommended method
Assuming you table looks like this
table1
id, col1,col2, value
You can use the following SQL to create a new table
CREATE TABLE tmp_table AS
SELECT id, col1,col2, 'new_value'
FROM table1;
After you verify data in tmp_table
DROP TABLE table1;
ALTER TABLE tmp_table RENAME TO table1;
If you are not updating all the rows you can use a filter to do a CTAS and insert the rest of the rows to the new table, let me know if you need more info if this is the case
CREATE TABLE tmp_table AS
SELECT id, col1,col2, 'new_value'
FROM table1
WHERE value = 'old'
INSERT INTO tmp_table SELECT * from table1;
Next step would be DROP the tmp table and rename table1
Update: Based on your comment you can do the following, let me know if this solves your case.
This method basically creates a new table to replace your existing table.
I have used some of your code
CREATE TABLE stage (LIKE target INCLUDING DEFAULTS);
INSERT INTO stage SELECT id, 'new' FROM target WHERE value=`old`;
Above INSERT inserts rows to be updated with 'new', no need to run an UPDATE after this.
Bring unchanged rows
INSERT INTO stage SELECT id, value FROM target WHERE value!=`old`;
After this point you have target table which is your original table intact
stage table will have both sets of rows, updated rows with 'new' value and rows you did not want to change
To replace your target with stage
DROP TABLE target;
or to keep it further verification
ALTER TABLE target RENAME TO target_old;
ALTER TABLE stage RENAME TO target;
From a redshift developer:
This case doesn't require an upsert, or update+insert, and it is fine to just run the update:
UPDATE target SET value='new' WHERE value='old';
Another way would be to INSERT the rows you need and DELETE the other rows, but that's unnecessarily complicated.

How can I check the partition list from Athena in AWS?

I want to check the partition lists in Athena.
I used query like this.
show partitions table_name
But I want to search specific table existed.
So I used query like below but there was no results returned.
show partitions table_name partition(dt='2010-03-03')
Because dt contains hour data also.
dt='2010-03-03-01', dt='2010-03-03-02', ...........
So is there any way to search when I input '2010-03-03' then it search '2010-03-03-01', '2010-03-03-02'?
Do I have to separate partition like this?
dt='2010-03-03', dh='01'
And show partitions table_name returned only 500 rows in Hive. Is the same in Athena also?
In Athena v2:
Use this SQL:
SELECT dt
FROM db_name."table_name$partitions"
WHERE dt LIKE '2010-03-03-%'
(see the official aws docs)
In Athena v1:
There is a way to return the partition list as a resultset, so this can be filtered using LIKE. But you need to use the internal information_schema database like this:
SELECT partition_value
FROM information_schema.__internal_partitions__
WHERE table_schema = '<DB_NAME>'
AND table_name = '<TABLE_NAME>'
AND partition_value LIKE '2010-03-03-%'