using gzip files in Data Crawler - amazon-web-services

I have gzip files in a S3 Bucket. They are not CSV files , they are text files with columns separated by space . I am new using Glue and it Is some way to use Glue - Data Crawler to read this content ?

Glue is just Spark under the hood. So you can just use the same spark code to process the space delimited file i.e. splitBy etc. Glue Crawler will create the metadata for the table by parsing the data. If your data is space separated, then Glue crawler won't be able to parse it. It will basically consider the whole line as one single text column. To process it, you will need to write a custom classifier using Grok pattern. Unfortunately there is no clear example provided in AWS documentation. I am giving an example below:
Assuming your data is like below: (it can be in the gzip file as well)
qwe 123 22.3 2019-09-02
asd 123 12.3 2019-09-02
de3 345 23.3 2019-08-22
we3 455 12.3 2018-08-11
ccc 543 12.0 2017-12-12
First you have to create a custom classifier
Grok Pattern
%{NOTSPACE:name} %{INT:class_num} %{BASE10NUM:balance} %{CUSTOMDATE:balance_date}
Custom patterns
CUSTOMDATE %{YEAR}-%{MONTHNUM}-%{MONTHDAY}
Now create a crawler using the custom classifier you just created. Run the crawler. Then check the metadata created in your database to see if it has recognised the data properly.
Please let me know if any question. You can also share few lines from the file you are trying to process.
If you are new to Glue and keen to try, you may like to read the blog I have written in LinkedIn regarding Glue. Please click this link.

Related

Regex pattern to pick latest file in nifi

I have a NiFi flow where I am getting all data from s3 and putting it in the destination folder. Now, the requirement is if there is any latest data then just transfer the latest data only. I have a data file in s3 like below:
20201130-011101493.parquet
20201129-011101493.parquet
And the regex I tried:
\d[0-9]{8}.parquet
The problem is it is not picking the first file which is the latest data i.e 30/11/2020
How can I modify my regex so that it will be picking the latest file only if the job runs once per day? I also referred this SO post but I guess I am not able to get my regex correct.

AWS Glue Crawler Unable to Classify CSV files

I'm unable to get the default crawler classifier, nor a custom classifier to work against many of my CSV files. The classification is listed as 'UNKNOWN'. I've tried re-running existing classifiers, as well as creating new ones. Is anyone aware of a specific configuration for a custom classifier for CSV files that works for files of any size?
I'm also unable to find any errors specific to this issue in the logs.
Although I have seen reference to issues for JSON files over 1MB in size, I can't find anything detailing this same issue for CSV files, nor a solution to the problem.
AWS crawler could not classify the file type stores in S3 if its size >1MB
AWS Glue Crawler Classifies json file as UNKNOWN
Default CSV classifiers supported by Glue Crawler:
CSV - Checks for the following delimiters: comma (,), pipe (|), tab
(\t), semicolon (;), and Ctrl-A (\u0001). Ctrl-A is the Unicode
control character for Start Of Heading.
If you have any other delimiter, then it will not work with default CSV classfier. In that case you will have to write grok pattern.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

How handle schema changes in glue and get the expected output in csv?

I am trying to crawl some files having different sachems(Data compatible ) using AWS Glue.
As I read in the AWS documentation that Glue crawlers update the catalog tables for any change in the schema(add new columns and remove missing columns).
I have checked the "Update the table definition in the Data Catalog" and "Create a single schema for each S3 path" while creating the crawler.
Example:
let's say I have a file "File1.csv" as shown below:
name,age,loc
Ravi,12,Ind
Joe,32,US
Say I have another file "File2.csv" as shown below:
name,age,height
Jack,12,160
Jane,32,180
After crawlers run in the schema was updated as:
name,age,loc,height -This is as expcted
but When I tried to read the files using Athena or tried writing the content of both the files to csv using Glue ETL job,I have observed that:
the output looks like:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,160,,
Jane,32,180,,
last two rows should have blank for loc as the second file didn't have loc column.
where as expected:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,,160
Jane,32,,180
In short glue is trying to fill up the column in contiguous manner in the combined output.Is there any way I can get the expected output?
I got the expected output with Parquet files. Initially, I was using CSV, but csv deserializer doesn't understand how to put the elements into the correct position when schema changes.
Changing the individual csvs into parquet and then crawling them one after another helped me in incorporating the changing schema.

AWS Glue custom crawler based on file name

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:
s3://my-bucket/some_id/some_subfolder/datetime.json
When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.
For example if I run crawler on:
s3://my-bucket/10001/fromage/2017-10-10.json
I get table schema like this:
Partition 1: 10001
Partition 2: fromage
Array: JSON data
I did try to add custom classifier based on Grok pattern:
%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}
However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?
Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:
s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json