AWS CTAS: How to double quote values?

AWS CTAS: How to double quote values? - amazon-athena

In AWS Athena, how can I specify having the values double quoted "value". I managed to specify the delimiter using the field_delimiter expression.

Assuming you have table cust_transaction with two columns id, amount where amount is int datatype, you can CTAS is as follows. The approach is quite manual and can be cumbersome if the number of columns are big. You will need to explicitly cast non-string data types to varchar too. Hope that helps. Is it what you were looking for?
create table cust_transaction_pipe_1
with (external_location = 's3://aws_bucket/cust_tx_pipe_1/',format='TEXTFILE',field_delimiter = '|')
as
select concat(chr(34), id ,chr(34)) as id,concat(chr(34) , cast(amount as varchar) ,chr(34)) as amount from cust_transaction

Related

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?

I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:
id,date,string,int_values,double_values
"6F87U",2021-03-21,"Text",0,1.1483
"8DU87",2021-03-22,"More text, oh yes",1,2.525
"79LO2",2021-03-23,"Moar, give me moar, text",2,3.485489
When I run a Crawler with everything default, querying with Athena like so:
select * from tb_csv_data
...the results in Athena are thus:
id
date
string
int_values
double_values
"6F87U"
2021-03-21
"Text"
0
1.1483
"8DU87"
2021-03-22
"More text
oh yes"
1
"79LO2"
2021-03-23
"Moar
give me moar
text
The problem at this level seems to be with proper detection (read: ignoring) of commas as delimiters within quotation marks. So I have a CSV classifier with the following characteristics that I have attached to the Crawler, I run the Crawler again with the classifier attached, and the resulting table properties are thus:
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.serde2.OpenCSVSerde
Serde parameters
quoteChar "
separatorChar ,
Table properties
sizeKey 4356512114
objectCount 3
UPDATED_BY_CRAWLER crawler-name
CrawlerSchemaSerializerVersion 1.0
recordCount 3145398
averageRecordSize 1384
CrawlerSchemaDeserializerVersion 1.0
compressionType none
columnsOrdered true
areColumnsQuoted true
delimiter ,
typeOfData file
The resulting table with the same simple Athena query as above seems to be correct:
id
date
string
int_values
double_values
6F87U
2021-03-21
Text, yes
0
1.1483
8DU87
2021-03-22
More text, oh yes
1
2.525
79LO2
2021-03-23
Moar, give me moar, text
2
3.485489
The expected automatic inference of data types is supposed to be this (let's simplify and presume the date is correct as a string):
Column name
Data type
id
string
date
string
string
string
int_values
bigint (or long)
double_values
double
...but instead they're all strings!
Column name
Data type
id
string
date
string
string
string
int_values
string
double_values
string
I need this data to be accurately queryable from Athena as it is, where it is, so what can I do without further processing of the raw data? I suppose I could manually adjust the table properties in the Console but is that really correct when I need the entire pipeline to be automated? I also want to avoid having to cast types in queries 80+ times for each field as most of these columns are numeric. What can I do?
Thank you!

The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.
For date type to be detected it has to be in UNIX numeric format, such as 1562112000 according to the doc.

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?

Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV

Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV

This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

Convert Map<string><string> in QuickSight

I have a column of type Map string->string in Athena and this is not recognized in AWS QuickSight. I am trying to convert this field to varchar in QuickSight using SQL
SELECT cast(body as varchar) FROM db.events;
But it fails
Cannot cast map(varchar,varchar) to varchar
How can I convert this field correctly so QuickSight can query against it?

I think there is is no easy way to do that, but maybe there are some workarounds.
If each map has two keys with known names you can create two new columns:
SELECT
ELEMENT_AT(map_col,'key1') AS key1_col
,ELEMENT_AT(map_col,'key2') AS key2_col
FROM
(
SELECT
MAP(
ARRAY['key1','key2'],
ARRAY['val1','val2']
) AS map_col
)
Which will output:
key1_col
key2_col
val1
val2
If your map column has just one key you can adapt the snippet above and use it or use this one:
SELECT
ARRAY_JOIN(MAP_KEYS(map_col), ', ') AS keys
,ARRAY_JOIN(MAP_VALUES(map_col), ', ') AS vals
FROM
(
SELECT
MAP(
ARRAY['key1'],
ARRAY['val1']
) AS map_col
)
which will result in:
keys
vals
key1
val1
As said above, there is no correct way, if you have many keys you can try to use the second snippet to create strings to store keys and values, and later use calculated fields (maybe using split) to access them.
Hope it helps (:

How to tweak LISTAGG to support more than 4000 character in select query?

Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production.
I have a table in the below format.
Name Department
Johny Dep1
Jacky Dep2
Ramu Dep1
I need an output in the below format.
Dep1 - Johny,Ramu
Dep2 - Jacky
I have tried the 'LISTAGG' function, but there is a hard limit of 4000 characters. Since my db table is huge, this cannot be used in the app. The other option is to use the
SELECT CAST(COLLECT(Name)
But my framework allows me to execute only select queries and no PL/SQL scripts.Hence i dont find any way to create a type using "CREATE TYPE" command which is required for the COLLECT command.
Is there any alternate way to achieve the above result using select query ?

You should add GetClobVal and also need to rtrim as it will return delimiter in the end of the results.
SELECT RTRIM(XMLAGG(XMLELEMENT(E,colname,',').EXTRACT('//text()')
ORDER BY colname).GetClobVal(),',') from tablename;

if you cant create types (you can't just use sql*plus to create on as a one off?), but you're OK with COLLECT, then use a built-in array. There's several knocking around in the RDBMS. run this query:
select owner, type_name, coll_type, elem_type_name, upper_bound, length
from all_coll_types
where elem_type_name = 'VARCHAR2';
e.g. on my db, I can use sys.DBMSOUTPUT_LINESARRAY which is a varray of considerable size.
select department,
cast(collect(name) as sys.DBMSOUTPUT_LINESARRAY)
from emp
group by department;

A derivative of #anuu_online but handle unescaping the XML in the result.
dbms_xmlgen.convert(xmlagg(xmlelement(E, name||',')).extract('//text()').getclobval(),1)

For IBM DB2, Casting the result to a varchar(10000) will give more than 4000.
select column1, listagg(CAST(column2 AS VARCHAR(10000)), x'0A') AS "Concat column"...

I end up in another approach using the XMLAGG function which doesn't have the hard limit of 4000.
select department,
XMLAGG(XMLELEMENT(E,name||',')).EXTRACT('//text()')
from emp
group by department;

You can use:
SELECT department
, REGEXP_REPLACE(XMLCAST(XMLAGG(XMLELEMENT(x, name, ',')) AS CLOB), ',$')
FROM emp
GROUP BY department
it will return CLOB that has no size limit, handles correctly XML entity escapes and separators.
Instead of REGEXP_REPLACE(..., ',$')) you can use RTRIM(..., ','), which should be faster, but will remove all separators from the end of the result (including those that can appear in name at the end, or previous ones if last names are empty).

RegEx in the select for DB2

I have a table with one column having a large json object in the format below. The column datatype is VARCHAR
column1
--------
{"key":"value",....}
I'm interested in the first value of the column data
in regex I can do it by .*?:(.*),.* with group(1) giving me the value
How can i use it in the select query

Don't do that, it's bad database design. Shred the keys and values to their own table as columns, or use the XML data type. XML would work fine because you can index the structure well, and you can use XPATH queries on the data. XPATH supports regexp natively.

You can use regular expression with xQuery, you just need to call the function matches from a SQL query or a FLORW query.
This is an example of how to use regular expressions from SQL:
db2 "with val as (
select t.text
from texts t
where xmlcast(xmlquery('fn:matches(\$TEXT,''^[A-Za-z 0-9]*$'')') as integer) = 0
)
select * from val"
For more information:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.xml.doc/doc/xqrfnmat.html
http://angocadb2.blogspot.fr/2014/04/regular-expressions-in-db2.html

DB2 doesn't have any built in regex functionality, unfortunately. I did find an article about how to add this with libraries:
http://www.ibm.com/developerworks/data/library/techarticle/0301stolze/0301stolze.html
Without regexes, this operation would be a mess. You could make a function that goes through the string character by character to find the first value. Or, if you will need to do more than this one operation, you could make a procedure that parses the json and throws it into a table of keys/values. Neither one sounds fun, though.

In DB2 for z/OS you will have to pass the variable to XMLQUERY with the PASSING option
db2 "with val as (
select t.text
from texts t
where xmlcast(xmlquery('fn:matches($TEXT,''^[A-Za-z 0-9]*$'')'
PASSING t.text as "TEXT") as integer) = 0
)
select * from val"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS CTAS: How to double quote values? - amazon-athena

In AWS Athena, how can I specify having the values double quoted "value". I managed to specify the delimiter using the field_delimiter expression.

Related

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?

How to read quoted CSV with NULL values into Amazon Athena

Convert Map<string><string> in QuickSight

How to tweak LISTAGG to support more than 4000 character in select query?

RegEx in the select for DB2

Categories

Resources