On Amazon Athena can I use compressed geospatial JSON data? - amazon-web-services

I'm using Athena to query some geospatial data encoded in GEOJSON. If I do it with uncompressed GEOJSON files, it works fine, but if I compress those files using gzip I get:
HIVE_CURSOR_ERROR: Illegal character ((CTRL-CHAR, code 31)): only regular white space (\r, \n, \t) is allowed between tokens at [Source: org.apache.hadoop.fs.FSDataInputStream#6acf4fe5: org.apache.hadoop.fs.BufferedFSInputStream#2793fa25; line: 1, column: 2]
Is it possible to use compressed geospatial data on Athena?
EDIT: As requested, here is the create table statement I used:
CREATE EXTERNAL TABLE `locations`(
`id` bigint COMMENT 'from deserializer',
`boundaryshape` binary COMMENT 'from deserializer')
ROW FORMAT SERDE
'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT
'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mydata/transformed/'
TBLPROPERTIES (
'classification'='json',
'write.compression'='GZIP')

Related

Access S3 CSV file in Amazon Athena

I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?
You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')
The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.

AWS Athena case sensitive orc.column.index.access=false?

I've created a table in Athena with below config
CREATE EXTERNAL TABLE `extern` (`FirstName` STRING, `LastName` STRING, `Email` STRING, `Phone` STRING, `AddressLine1` STRING, `City` STRING, `State` STRING, `PostalCode` STRING,
`time_on_page` DECIMAL(10,3), `page` STRING, `login_time` TIMESTAMP) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES ('orc.column.index.access'='false')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 's3://location/' tblproperties ("orc.compress"="ZLIB")
and when I query it with select * from extern I get empty values for all columns except for time_on_page, page and login_time. However, this returns data in all columns if I use ('orc.column.index.access'='true').
The schema of the column names in the ORC file are exact with what's defined in the create statement.
File Version: 0.12 with ORC_135
Rows: 13559
Compression: ZLIB
Compression size: 131072
Type: struct<FirstName:string,LastName:string,Email:string,Phone:string,AddressLine1:string,City:string,State:string,PostalCode:string,time_on_page:decimal(10,3),page:string,login_time:timestamp>
The question is, could it be possible that when orc.column.index.access=false the engine tries to read column names with case sensitive constraints?
Apparently Presto version of Amazon is based on PrestoDB and this feature doesn't seem to be available on Prestodb as of this writing; Prestosql, however, seem to have had this support from Release 312 and onwards (ref: https://github.com/prestosql/presto/issues/802).
The solution is to make sure the schema in the ORC file uses lowercase column name for it to be matched with names from hive metastore (which are always lowercased).
Debug code for presto db: https://github.com/prestodb/presto/blob/6647e13f64883f7cfa89221d91b981bcc3a57618/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L984
https://github.com/prestodb/presto/blob/95bcb3947cad1570e19a0adaebc58009aa362ada/presto-orc/src/main/java/com/facebook/presto/orc/OrcReader.java#L130

Why is this SQL query on Amazon Athena not pulling the required table from s3 to my database?

What is the problem in my syntax that the query is not running?
(error and error code mentioned below)
All the names have been fixed.
"foldername3" has only one file and its name is pinmap.csv.
There are only 9 columns in the csv file.
CREATE EXTERNAL TABLE IF NOT EXISTS default.`pinmap`(
'circle' string,
'region' string,
'division' string,
'office' string,
'pin' int,
'office_type' string,
'delivery' string,
'district' string,
'state' string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/foldername3/'
TBLPROPERTIES (
'skip.header.line.count'='1');
Error code:
line 1:8: no viable alternative at input 'create external' (service:
amazonathena; status code: 400; error code: invalidrequestexception;
Ideally the query should import the csv file from s3 to amazon athena as a table named "pinmap" in the database named "default".
Try to use backticks instead of apostrophes
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`pinmap`(
`circle` string,
`region` string,
`division` string,
`office` string,
`pin` int,
`office_type` string,
`delivery` string,
`district` string,
`state` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/foldername3/'
TBLPROPERTIES (
'skip.header.line.count'='1');
Result:
Query successful.
Also note, that this query only defines meta information about your data in S3, e.g. table schema, database etc, which is then stored in AWS Glue datacatalog. So there is no actual import of csv file, they still remain in S3.

Need to skip CSV header when reading from s3

when i'm trying to load csv file from s3, headers are injecting into columns. i tried to skip header by
TBLPROPERTIES (
"skip.header.line.count"="1")
But still no use.
Any advice please?
CREATE EXTERNAL TABLE skipheader(
permalink string,
company string,
numemps bigint,
category string,
city string,
state string,
fundeddate string,
raisedamt bigint,
raisedcurrency string,
round string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/filename/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Looking at the release notes for when the feature was released it says
Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. This is currently supported for queries that use the OpenCSV SerDe, and not for Grok or Regex SerDes.
My interpretation of this is that it won't work with LazySimpleSerde, which is what you get when you say ROW FORMAT DELIMITED, and that you have to use the OpenCSV serde:
CREATE EXTERNAL TABLE skipheader ( … )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/filename/'
TBLPROPERTIES ("skip.header.line.count"="1")
The OpenCSV serde works differently from LazySimpleSerde, it has much more limited data type support, but on the other hand it is more configurable.
If you can use the OpenCSV SerDe and make it work for you like described by Theo, go for it. However, if you have other tables of other formats, you can get around it in the following way even though it is a bit of a hack. You can simply add a WHERE clause that excludes the headers like
SELECT * FROM skipheader WHERE permalink != 'permalink'. Recently, Athena added the ability to make a new table as result of query (https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html) so if you could even filter out the headers and save to a new location using Athena if that was better for you.

AWS Glue/Data catalog showing quotes around data

When I query my files from Data Catalog using Athena, all the data appears wrapped with quotes. Isit possible to remove those quotes?
I tried adding quoteChar option in the table settings, but it didnt help
UPDATE
As requested, the DDL:
CREATE EXTERNAL TABLE `holidays`(
`id` bigint,
`start` string,
`end` string,
`createdat` string,
`updatedat` string,
`deletedat` string,
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://pinfare-glue/holidays/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-holidays',
'averageRecordSize'='84',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'recordCount'='29',
'sizeKey'='2494',
'skip.header.line.count'='1',
'typeOfData'='file')
I know its late but I think the issue is with the "Serde serialization lib"
In
AWS GLUE --> Click on the table --> Edit Table --> check "Serde serialization lib"
it's value should be "org.apache.hadoop.hive.serde2.OpenCSVSerde"
Than Click Apply
This should solve your issue. Below is a sample image for your reference.