Data Transformation in AWS EMR without using Scala or Python - amazon-web-services

I have a star schema kind of database structure, like one fact table having all the id’s & skeys, whereas there are multiple dimension tables having the actual id, code, descriptions for the id’s referred in the fact table.
we are moving all these tables (fact & dimensions) to S3 (cloud) individually and each table data are split into multiple parquet files in S3 location (one S3 object per table)
Query: i need to perform a transformation on cloud (ie) i need strip of all the id’s & skeys referred in the fact table and replace it with the actual code that is residing in the dimension tables and create another file and store the final output back in S3 location. This file will later be consumed by Redshift for Analytics.
My Doubt:
Whats the best way to achieve this solution, cos i don’t need raw data (skeys & id’s) in Redshift for cost and storage optimization?
Do we need to first combine these split files (parquet) into one large file (ie) before performing the data transformation. Also, after data transformation, I am planning to save the final output file in parquet format, but the catch is, Redshift doesn’t allow copy of parquet file, so is there a workaround for that
I am not a hardcore programmer and want to avoid using scala/python in a EMR, but I am good at SQL, so is there a way to perform data transformation in cloud thru SQL thru EMR and save the output data into a file or files. Please advise

You should be able to run redshift type queries directly against your s3 parquet data by using amazon athena
some information on that
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/

Related

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

I have two problems in my intended solution:
1.
My S3 store structure is as following:
mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz
All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.
I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.
2.
The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.
I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.
You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.
Crawler will not take compressed and uncompressed data together , so it will not work out of box.
It is better to write spark job in glue and use spark.read()

Load Parquet files into Redshift

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.
Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?
Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.
Appreciate your help!
I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).
The Amazon Redshift COPY command can natively load Parquet files by using the parameter:
FORMAT AS PARQUET
See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats
The table must be pre-created; it cannot be created automatically.
Also note from COPY from Columnar Data Formats - Amazon Redshift:
COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.
use parquet-tools from GitHub to dissect the file :
parquet-tool schema <filename> #will dump the schema w/datatypes
parquet-tool head <filename> #will dump the first 5 data structures
Use the jsonpaths file to specify mappings

Converting data in AWS S3 to another schema structure (also in S3)

quite a beginner's question -
I have log data stored in S3 files, in zipped JSON format.
The files reside in a directory hierarchy which reflects partitioning, in the following way: s3://bucket_name/year=2018/month=201805/day=201805/some_more_partitions/file.json.gz
I recently changed the schema of the logging to a slightly different directory structure. I Added some more partition levels, the fields currently reside inside of the JSON and I want to move them to the folder hierarchy. Also, I changed the inner JSON schema slightly. They reside in a different S3 bucket.
I wish to convert the old logs to the new format, because I have Athena mapping over the new schema structure.
Is AWS EMR the tool for this? If so, what's the simplest way to achieve this? I thought I need an EMR cluster of type step execution but it probably creates just one output file, no?
Thanks
Yes, Amazon EMR is an appropriate tool to use.
You could use Hive, which has similar-ish syntax to Athena:
Create an External Table pointing to your existing data, using your old schema
Create an External Table pointing to where you wish to store the data, using your new schema
INSERT INTO new-table SELECT * FROM old-table
If your intention is to query the data with Amazon Athena, you can use Amazon EMR to convert the data into Parquet format, which will give even better query performance.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Yes EMR can be used for such conversion.
Here's the sample code where to covert the data coming as csv (stg folder aka source folder) format to orc file format. You may want to do the insert overwrite in case you have overlapping partitions between your staging (aka source) files and Target files
DROP TABLE IF EXISTS db_stg.stg_table;
CREATE EXTERNAL TABLE `db_stg.stg_table`(
GEO_KEY string,
WK_BEG_DT string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket.name/stg_folder_name/'
TBLPROPERTIES ('has_encrypted_data'='false');
drop table db_tgt.target_table;
CREATE EXTERNAL TABLE db_tgt.target_table(
GEO_KEY string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
PARTITIONED BY(FIS_WK_NUM)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
location 's3://bucket.name/tgt_folder_name/'
TBLPROPERTIES (
'orc.compress'='SNAPPY');
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table db_tgt.target_table partition(FIS_WK_NUM)
select
GEO_KEY ,
WK_BEG_DT ,
FIS_WK_NUM ,
AMOUNT1
from db_stg.stg_table;
Agree with John that converting to a columnar file format like Parquet or ORC (along with compression like SNAPPY) will give you the best performance with AWS Athena.
Remember the key to using Athena is to optimize the amount of data you scan an read. Hence, if the data is in columnar format and you are reading certain partitions, you AWS Athena cost will go down significantly. All you need to do is to make sure you are using the filter condition in your Athena queries that selects the required partitions.

How to split data when archiving from AWS database to S3

For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects

Best storage format to backup hive internal table

I have one hive internal table which has around 500 million records.
My hive is deployed on top of AWS EMR. I do not want to keep the AWS EMR always running. Hence I want to backup the hive internal table data.
One easy way of doing it to create an external table pointing to S3 Location and then moving all records into that external table using insert command.
When ever I need internal table back, I can use this external S3 table to get all the data back.
Since this table only purpose is for backup, I want to ask which stored as format will be best choice for me.
Hive as of now supports following formats
TEXTFILE
SEQUENCEFILE
ORC
PARQUET
AVRO
RCFILE
Also is there any other way to backup your internal tables other than the approach mentioned above.
In Short
I'd think changing file format(the list you mentioned) will not have much difference in size. But file size and type of access you want on that file plays crucial role your cloud account billing.
So consider following,
Compression - To reduce the size
Amazon Glacier - Cost effective solution than S3 in AWS, as the data is less likely to access (archival)
Things to consider when choosing a solution, How much time you can buy
To access file from archival storage.
to convert data format to Hive managed table (if you change during archival)
to data uncompress(each compression is trade of between time and size)
Extended answer
Here are some of the file formats with their decompression speed and space efficiency, pick the balanced(means time/space as per above questions) and available compression format for you.
more compress and compress benchmarks at