Querying S3 inventory details in Athena - amazon-athena

I have S3 inventory details in S3 bucket and I am querying it through Athena.
My first two columns are shown below:
bucket key
bke-p0d-bke-lca-data dl/xxxxxx/plant/archive/01-01-2019/1546300856.json
bke-pod-bke-lca-data dl/xxxx/plant/archive/01-01-2019/1546300856.json
bke-pod-bke-lca-data dl/xxx/plant/archive/01-01-2019/1546300856.json
I need them to split the key information to below:
bucket Categ Type Date File
bke-pod-bke-lca-data xxxxxx archive 01/01/2019 1546300856.json
bke-pod-bke-lca-data xxxx working 01/01/2019 1546300856.json
bke-pod-bke-lca-data xxx archive 01/01/2019 1546300856.json
I tried substr it didnt work.
How I do split based on /?

The 6.8. String Functions and Operators — Presto 0.172 Documentation has:
split_part(string, delimiter, index)
Splits string on delimiter and returns the field index. Field indexes start with 1. If the index is larger than than the number of fields, then null is returned.
So, you should be able to use something like:
SELECT
bucket,
split_part(key, '/', 2) as category,
split_part(key, '/', 4) as type,
split_part(key, '/', 5) as date,
split_part(key, '/', 6) as file

Related

How to move Partitioned Big Query Tables to GCS bucket as multiple files having the partitioned date in file name? Instead of having default numbers

I'm trying to move the Bigquery table data in to GCS as multiple files(avro/parquet) having the partition date in file name ( not creating files having date as current datetime() in the file name).
I have tried the below query and I could see that it inserts only current date() or current datetime(). Also this extracts as one single file. I need to have multiple files based on the partitiondate.
EXPORT DATA OPTIONS(
uri='gs://test/'||Currentdate()||'/_*.avro',
format='avro',
overwrite=true) AS
SELECT * from test_table
Instead of current_date how can I add _PARTITIONDATE in file name ?
I have seen a similar question being asked few years back .
How can i export data from a big single non-partitioned table to Google Cloud Storage as Date Partitioned files?
But the solution was like this :
Query the original table by the column you want to partition and set
the new table's desired partition as the destination. Do this as many
times as the number of partitions you want.
bq query --allow_large_results --replace --noflatten_results
--destination_table 'mydataset.partitionedtable$20160101' \ 'SELECT stn,temp from [mydataset.oldtable] WHERE mo="01" AND da="01" limit
100'
Just like that I have 100 days partition and querying 100 times is not an optimal solution in this case.
The possible solution in this case :
Add a Python script for example with Python Bigquery client
Execute your query in the script
Add a Group by with Python in your script on the partitioned date
You will have for example a structure like this :
'2022-05-22' => [
{
key1: value1,
key2: value2,
},
{
key1: value1,
key2: value2,
}
...
]
For each group, generate the filename based on the key date and partition
Export the group values to GCS file

Joining based on email within strings on AWS Athena

I have two S3 buckets that I am looking to join on Athena. In the first bucket, I have an email address in a CSV file with an email column.
sample#email.com
In the other bucket, I have a JSON file with nested email addresses used by the client. The way this has been set up in Glue means the data looks like this:
[sample#email.com;email#sample.com;com#email.sample]
I am trying the join the data by finding the email from the first bucket inside of the string from the second bucket. I have tried:
REGEXP_LIKE(lower("emailaddress"), lower("emails"))
with no success, I have also tried:
select "source".*, "target".*
FROM "source"
inner join "target"
on "membername" = "first_name"
and "memberlastname" = "last_name"
and '%'||lower("emailaddress")||'%' like lower("emails")
With no success. I am doing something wrong and it is evading me where I am making this error.
It seems you need to reverse your like arguments:
-- sample data
WITH dataset (id, email) AS (
VALUES (1,'sample#email.com'),
(2,'non-present#email.com')
),
dataset2 (emails) as (
VALUES ('[sample#email.com;email#sample.com;com#email.sample]')
)
-- query
SELECT *
FROM dataset
INNER JOIN dataset2 on
lower(emails) like '%' || lower(email) || '%'
Output:
id
email
emails
1
sample#email.com
[sample#email.com;email#sample.com;com#email.sample]

Import csv file in s3 bucket with semi colon separated fields

I am using AWS Data Pipelines to copy SQL data to a CSV file in AWS S3. Some of the data has a comma between string quotes, e.g.:
{"id":123455,"user": "some,user" .... }
While importing this CSV data into DynamoDB, it takes the comma as the end of the field value. This way it results in errors, as the data given in the mapping does not match the schema we have provided.
My solution for this is - while copying the data from SQL to an S3 bucket - to separate our CSV fields with a ; (semicolon). In that way values within the quotes will be taken as one. And the data would look like (note the blank space within the quote string after the comma):
{"id" : 12345; "user": "some, user";....}
My stack looks like this:
- database_to_s3:
name: data-to-s3
description: Dumps data to s3.
dbRef: xxx
selectQuery: >
select * FROM USER;
s3Url: '#{myS3Bucket}/xxxx-xxx/'
format: csv
Is there any way I can use a delimiter to separate fields with a ; (semicolon)?
Thank you!
give a try to AWS Glue, where you can marshal your data before insert into dynamoDB.

Amazon S3 Select Query : Column values get left shifted in CSV output when we query a non-existant key

I have a JSON file in S3 of the format:
{
"A":"a",
"C":{"C1":"c1","C2":"c2"},
"E":"e"
}
And when I query like
select S3Object.A,S3Object.C1,S3Object.C2,S3Object.E from S3Object
I get below in CSV output:
A,C1,C2,E
a,e
I understand that the correct query is below:
select S3Object.A,S3Object.C.C1,S3Object.C.C2,S3Object.E from S3Object
and that will give me the output
A,C.C1,C.C2,E
a,c1,c2,e
But by querying a non-existant column, why does value e get shifted to show the value below C1 column instead of E column?
I could not find any documentation in Amazon AWS about the behavior if I query a non-existant column.
select S3Object.A,S3Object.D,S3Object.E from S3Object
The output being:
A,D,E
a,e
Is there any way to??
1. Hide D from column header when it doesn't exist in JSON source??
A,E
a,e
2. Or, leave the field as empty corresponding to the column header D??
A,D,E
a,,e

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.