Athena CTAS replacing null values in tables with \N - amazon-web-services

When I use Athena CTAS to generate CSV files, I found that null values in the Athena table are replaced by "\N".
How do I get it to just leave these values as empty columns?
The CTAS query I'm using is something like this:
CREATE TABLE table_name WITH (format = 'TEXTFILE', field_delimiter=',', external_location='s3://bucket_name/location') AS SELECT * FROM "db_name"."src_table_name";
Am I doing something wrong?

This is the default token for NULL for LazySimpleSerDe, and CTAS does not expose any mechanism for changing it unfortunately.
If you'd rather have empty fields for your NULL values you have to ensure they are all empty strings, e.g. … AS SELECT COALESCE(col1, ''), COALESCE(col2, ''), ….

Related

Snowflake table is not accepting null values in date field

I have one table in snowflake, I am performing bulk load using.
one of the columns in table is date, but in the source table which is on sql server is having null values in date column.
The flow of data is as :
sql_server-->S3 buckets -->snowflake_table
I am able to perform the sqoop job in EMR , but not able to load the data into snowflake table, as it is not accepting null values in the date column.
The error is :
Date '' is not recognized File 'schema_name/table_name/file1', line 2, character 18 Row 2,
column "table_name"["column_name":5] If you would like to continue loading when an error is
encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
can anyone help, where I am missing
Using below command you can able to see the values from stage file:
select t.$1, t.$2 from #mystage1 (file_format => myformat) t;
Based on the data you can change your copy command as below:
COPY INTO my_table(col1, col2, col3) from (select $1, $2, try_to_date($3) from #mystage1)
file_format=(type = csv FIELD_DELIMITER = '\u00EA' SKIP_HEADER = 1 NULL_IF = ('') ERROR_ON_COLUMN_COUNT_MISMATCH = false EMPTY_FIELD_AS_NULL = TRUE)
on_error='continue'
The error shows that the dates are not arriving as nulls. Rather, they're arriving as blank strings. You can address this a few different ways.
The cleanest way is to use the TRY_TO_DATE function on your COPY INTO statement for that column. This function will return database null when trying to convert a blank string into a date:
https://docs.snowflake.com/en/sql-reference/functions/try_to_date.html#try-to-date

incorrect row format returned by DDL

I have csv data formatted differently and this syntax works as expected. But when I use "Generate create table DDL" option, it does not return the same parameters for row format.
original and correct row format:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
"quoteChar" = "\""
)
row format generated by SHOW CREATE TABLE xyz syntax:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
I will like to know how to get back the exactly same create table statement that I used in the first place.
That is not possible. SHOW CREATE TABLE … will at best give you SQL that can actually be used, but many times will give you something that won't even run, let alone create a identical copy of the table.
You should use the Glue API instead. Use GetTable to retrieve the table structure, modify what you need (the name, the database, and/or the location, for example), and then use CreateTable to create the new table.
What SHOW CREATE TABLE … does is that it looks up the table metadata in Glue and then does (a poor) conversion of what it finds into SQL DDL. You will be much better off doing the Glue operations yourself.

Reading Json data from Athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)

AWS Athena flattened data from nested JSON source

I'd like to create a table from a nested JSON in Athena. The solutions described here using tools like hive Openx-JsonSerDe attempt to mirror the JSON data in the SQL statement. I just want to get a few fields from the JSON file and create the table. I can't seem to find any resources on how to do that.
E.g.
JSON file {"records": [{"a": "data1", "b": "data2", "c": "data3"}]}
The table I'd like to create just only has columns a and b
I think what you are trying to achieve is unnesting the array to transform one array entry into one row.
This is possible through the correct querying of your data structure.
table definition:
CREATE external TABLE complex (
records array<struct<a:string,b:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/test1/';
query:
select record.a,record.b from complex
cross join UNNEST(complex.records) as t1(record);

Matching number sequences in SQLite with random character separators

I have an sqlite database which has number sequences with random separators. For example
_id data
0 123-45/678>90
1 11*11-22-333
2 4-4-5-67891
I want to be able to query the database "intelligently" with and without the separators. For example, both these queries returning _id=0
SELECT _id FROM myTable WHERE data LIKE '%123-45%'
SELECT _id FROM myTable WHERE data LIKE '%12345%'
The 1st query works as is, but the 2nd query is the problem. Because the separators appear randomly in the database there are too many combinations to loop through in the search term.
I could create two columns, one with separators and one without, running each query against each column, but the database is huge so I want to avoid this if possible.
Is there some way to structure the 2nd query to achieve this as is ? Something like a regex on each row during the query ? Pseudo code
SELECT _id
FROM myTable
WHERE REPLACEALL(data,'(?<=\\d)[-/>*](?=\\d)','') LIKE '%12345%'
Ok this is far from being nice, but you could straightforwardly nest the REPLACE function. Example:
SELECT _id FROM myTable
WHERE REPLACE(..... REPLACE(REPLACE(data,'-',''),'_',''), .... '<all other separators>','') = '12345'
When using this in practice (--not that I would recommend it, but at least its simple), you surely might wrap it inside a function.
EDIT: for a small doc on the REPLACE function, see here, for example.
If I get it right, is this what you want?
SELECT _id
FROM myTable
WHERE Replace(Replace(Replace(data, '?', ''), '/', ''), '-', '') LIKE '%12345%'