Joining based on email within strings on AWS Athena - amazon-web-services

I have two S3 buckets that I am looking to join on Athena. In the first bucket, I have an email address in a CSV file with an email column.
sample#email.com
In the other bucket, I have a JSON file with nested email addresses used by the client. The way this has been set up in Glue means the data looks like this:
[sample#email.com;email#sample.com;com#email.sample]
I am trying the join the data by finding the email from the first bucket inside of the string from the second bucket. I have tried:
REGEXP_LIKE(lower("emailaddress"), lower("emails"))
with no success, I have also tried:
select "source".*, "target".*
FROM "source"
inner join "target"
on "membername" = "first_name"
and "memberlastname" = "last_name"
and '%'||lower("emailaddress")||'%' like lower("emails")
With no success. I am doing something wrong and it is evading me where I am making this error.

It seems you need to reverse your like arguments:
-- sample data
WITH dataset (id, email) AS (
VALUES (1,'sample#email.com'),
(2,'non-present#email.com')
),
dataset2 (emails) as (
VALUES ('[sample#email.com;email#sample.com;com#email.sample]')
)
-- query
SELECT *
FROM dataset
INNER JOIN dataset2 on
lower(emails) like '%' || lower(email) || '%'
Output:
id
email
emails
1
sample#email.com
[sample#email.com;email#sample.com;com#email.sample]

Related

how to query AWS Athena where data is JsonSerDe format?

I need to query some data in AWS Athena. The source data in s3 is compressed json .gz format. It was created with the parameter
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
If I just do 'select *' there's one column like this:
{userid={s=my_email#gmail.com}, timestamp=2022-07-21 10:00:00, appID={s=greatApp}, etc.}
I am trying to query like this:
with dataset as
(select * FROM "default"."my_table" limit 10)
select json_extract(item, '$.userid') as user
from dataset;
But getting an error:
Expected: json_extract(varchar(x), JsonPath) , json_extract(json, JsonPath)
Is there something wrong with my query?
I got it. You just use "dot" notation to access the keys:
select item.userid.s as user,
item.timestamp,
item.appID.s as appID
from my_table limit 10;

Import csv file in s3 bucket with semi colon separated fields

I am using AWS Data Pipelines to copy SQL data to a CSV file in AWS S3. Some of the data has a comma between string quotes, e.g.:
{"id":123455,"user": "some,user" .... }
While importing this CSV data into DynamoDB, it takes the comma as the end of the field value. This way it results in errors, as the data given in the mapping does not match the schema we have provided.
My solution for this is - while copying the data from SQL to an S3 bucket - to separate our CSV fields with a ; (semicolon). In that way values within the quotes will be taken as one. And the data would look like (note the blank space within the quote string after the comma):
{"id" : 12345; "user": "some, user";....}
My stack looks like this:
- database_to_s3:
name: data-to-s3
description: Dumps data to s3.
dbRef: xxx
selectQuery: >
select * FROM USER;
s3Url: '#{myS3Bucket}/xxxx-xxx/'
format: csv
Is there any way I can use a delimiter to separate fields with a ; (semicolon)?
Thank you!
give a try to AWS Glue, where you can marshal your data before insert into dynamoDB.

Snowflake table is not accepting null values in date field

I have one table in snowflake, I am performing bulk load using.
one of the columns in table is date, but in the source table which is on sql server is having null values in date column.
The flow of data is as :
sql_server-->S3 buckets -->snowflake_table
I am able to perform the sqoop job in EMR , but not able to load the data into snowflake table, as it is not accepting null values in the date column.
The error is :
Date '' is not recognized File 'schema_name/table_name/file1', line 2, character 18 Row 2,
column "table_name"["column_name":5] If you would like to continue loading when an error is
encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
can anyone help, where I am missing
Using below command you can able to see the values from stage file:
select t.$1, t.$2 from #mystage1 (file_format => myformat) t;
Based on the data you can change your copy command as below:
COPY INTO my_table(col1, col2, col3) from (select $1, $2, try_to_date($3) from #mystage1)
file_format=(type = csv FIELD_DELIMITER = '\u00EA' SKIP_HEADER = 1 NULL_IF = ('') ERROR_ON_COLUMN_COUNT_MISMATCH = false EMPTY_FIELD_AS_NULL = TRUE)
on_error='continue'
The error shows that the dates are not arriving as nulls. Rather, they're arriving as blank strings. You can address this a few different ways.
The cleanest way is to use the TRY_TO_DATE function on your COPY INTO statement for that column. This function will return database null when trying to convert a blank string into a date:
https://docs.snowflake.com/en/sql-reference/functions/try_to_date.html#try-to-date

Does AWS Athena supports Sequence File

Has any one tried creating AWS Athena Table on top of Sequence Files. As per the Documentation looks like it is possible. I was able to execute below create table statement.
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
STORED AS sequencefile
location 's3://bucket/sequencefile/';
The Statement executed Successfully but when i try to read data from the table it throws below error
Your query has the following error(s):
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://viewershipforneo4j/2017-09-26/000030_0 (offset=372128055, length=62021342) using org.apache.hadoop.mapred.SequenceFileInputFormat: s3://viewershipforneo4j/2017-09-26/000030_0 not a SequenceFile
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 9f0983b0-33da-4686-84a3-91b14a39cd09.
Sequence file are valid one . Issue here is there is not deliminator defined.
Ie row format delimited fields terminated by is missing
if in your case if tab is column deliminator row data is in next row it will be
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
row format delimited fields terminated by '\t'
STORED AS sequencefile
location 's3://bucket/sequencefile/';

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.