Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed - amazon-web-services

I'm querying data in AWS Athena from JSON files stored in S3. I've loaded all the JSON files into Athena using AWS Glue, and it's been working perfectly so far. However, the timestamp formatting has changed in the JSON files from
2018-03-23 15:00:30.998
to
2018-08-29T07:59:50.568Z
So the table ends up having entries like this
2018-08-29T07:59:42.803Z
2018-08-29T07:59:42.802Z
2018-08-29T07:59:32.500Z
2018-03-23 15:03:43.232
2018-03-23 15:03:44.697
2018-03-23 15:04:11.951
This results in parsing errors when I try to run queries against the full DB.
How do I accommodate this in AWS Glue (or Athena), so I don't have to split up the data when querying? I've tried looking into custom classifiers, but I'm unsure of how to use them in this particular case.
Thanks in advance.

Unfortunately you have to unify the data. If you decide to use "2018-08-29T07:59:50.568Z" format you can read such data by using org.apache.hive.hcatalog.data.JsonSerDe library with the following serde property: 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'

Related

How does Athena match data to the correct fields when creating a table from Amazon S3?

I'm still getting to grips with Athena, so apologies if this question doesn't make sense. I'm trying to create a table in Athena from a csv file I have uploaded to my Amazon S3 bucket. During the table creation process on Athena, I need to define field names and datatypes. How are these fields matched with those stored in the Amazon S3 bucket? Is it just the order in which they appear in the csv?
Yes indeed depending on the order.
https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
This depends on the data format and the serde used. For CSV data, columns are mapped by index. The same goes for the regex and grok serde, as well as ORC. JSON, Avro, and Parquet is instead mapped by name.
Grok, ORC and Parquet can be configured to be either index or name, but the above are the defaults.

Glue Crawler does not recognize Timestamps

I have JSON files in an S3 Bucket that may change their schema from time to time. To be able to analyze the data I want to run a glue crawler periodically on them, the analysis in Athena works in general.
Problem: My timestamp string is not recognized as timestamp
The timestamps currently have the following format 2020-04-06T10:37:38+00:00, but I have also tried others, e.g. 2020-04-06 10:37:38 - I have control over this and can adjust the format.
The suggestion to set the serde parameters might not work for my application, I want to have the scheme completely recognized and not have to define each field individually. (AWS Glue: Crawler does not recognize Timestamp columns in CSV format)
Manual adjustments in the table are generally not wanted, I would like to deploy Glue automatically within a CloudFormation stack.
Do you have an idea what else I can try?
This is a very common problem. The way we got around the problem when reading text/json files is we had an extra step in between to cast and set proper data types. The crawler data types are a bit iffy sometimes and is based on the data sample available at that point in time

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

I have two problems in my intended solution:
1.
My S3 store structure is as following:
mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz
All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.
I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.
2.
The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.
I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.
You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.
Crawler will not take compressed and uncompressed data together , so it will not work out of box.
It is better to write spark job in glue and use spark.read()

Converting data in AWS S3 to another schema structure (also in S3)

quite a beginner's question -
I have log data stored in S3 files, in zipped JSON format.
The files reside in a directory hierarchy which reflects partitioning, in the following way: s3://bucket_name/year=2018/month=201805/day=201805/some_more_partitions/file.json.gz
I recently changed the schema of the logging to a slightly different directory structure. I Added some more partition levels, the fields currently reside inside of the JSON and I want to move them to the folder hierarchy. Also, I changed the inner JSON schema slightly. They reside in a different S3 bucket.
I wish to convert the old logs to the new format, because I have Athena mapping over the new schema structure.
Is AWS EMR the tool for this? If so, what's the simplest way to achieve this? I thought I need an EMR cluster of type step execution but it probably creates just one output file, no?
Thanks
Yes, Amazon EMR is an appropriate tool to use.
You could use Hive, which has similar-ish syntax to Athena:
Create an External Table pointing to your existing data, using your old schema
Create an External Table pointing to where you wish to store the data, using your new schema
INSERT INTO new-table SELECT * FROM old-table
If your intention is to query the data with Amazon Athena, you can use Amazon EMR to convert the data into Parquet format, which will give even better query performance.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Yes EMR can be used for such conversion.
Here's the sample code where to covert the data coming as csv (stg folder aka source folder) format to orc file format. You may want to do the insert overwrite in case you have overlapping partitions between your staging (aka source) files and Target files
DROP TABLE IF EXISTS db_stg.stg_table;
CREATE EXTERNAL TABLE `db_stg.stg_table`(
GEO_KEY string,
WK_BEG_DT string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket.name/stg_folder_name/'
TBLPROPERTIES ('has_encrypted_data'='false');
drop table db_tgt.target_table;
CREATE EXTERNAL TABLE db_tgt.target_table(
GEO_KEY string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
PARTITIONED BY(FIS_WK_NUM)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
location 's3://bucket.name/tgt_folder_name/'
TBLPROPERTIES (
'orc.compress'='SNAPPY');
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table db_tgt.target_table partition(FIS_WK_NUM)
select
GEO_KEY ,
WK_BEG_DT ,
FIS_WK_NUM ,
AMOUNT1
from db_stg.stg_table;
Agree with John that converting to a columnar file format like Parquet or ORC (along with compression like SNAPPY) will give you the best performance with AWS Athena.
Remember the key to using Athena is to optimize the amount of data you scan an read. Hence, if the data is in columnar format and you are reading certain partitions, you AWS Athena cost will go down significantly. All you need to do is to make sure you are using the filter condition in your Athena queries that selects the required partitions.

How to split data when archiving from AWS database to S3

For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects