AWS Glue Crawler Unable to Classify CSV files

AWS Glue Crawler Unable to Classify CSV files - amazon-web-services

I'm unable to get the default crawler classifier, nor a custom classifier to work against many of my CSV files. The classification is listed as 'UNKNOWN'. I've tried re-running existing classifiers, as well as creating new ones. Is anyone aware of a specific configuration for a custom classifier for CSV files that works for files of any size?
I'm also unable to find any errors specific to this issue in the logs.
Although I have seen reference to issues for JSON files over 1MB in size, I can't find anything detailing this same issue for CSV files, nor a solution to the problem.
AWS crawler could not classify the file type stores in S3 if its size >1MB
AWS Glue Crawler Classifies json file as UNKNOWN

Default CSV classifiers supported by Glue Crawler:
CSV - Checks for the following delimiters: comma (,), pipe (|), tab
(\t), semicolon (;), and Ctrl-A (\u0001). Ctrl-A is the Unicode
control character for Start Of Heading.
If you have any other delimiter, then it will not work with default CSV classfier. In that case you will have to write grok pattern.

Related

remove backslash from a .csv file to load data to redshift from s3

I am getting an issue when I am loading my file , I have backslash in my csv file
how and what delimited can I use while using my copy command so that I don't get
error loading data from s3 to redshift.
Though I used the QUOTE command but gave me a syntax error so seems like new format
doesn't like the QUOTE key word.
Please if any one can provide a new and correct
command or dow I need to clean or preprocess my data before uploading to s3.
If the
Data size is too big it might not be a very feasible solution
If I have to process it , Do I use pyspark or python(PANDAS) to do it?
Below is the copy command I am using to copy data from s3 to redshift
I tried passing a quote command in the copy command but seems like it doesn't take
that anymore also there is no example in amazon docs on how to do or acheive it
If someone can suggest a command which can replace especial characters while loading
the data
COPY redshifttable from 'mys3filelocation'
CREDENTIALS 'aws_access_key_id=myaccess_key;aws_secret_access_key=mysecretID'
region 'us-west-2'
CSV
DATASET:
US063737,2019-11-07T10:23:25.000Z,richardkiganga,536737838,Terminated EOs,"",f,Uganda,Richard,Kiganga,Business owner,Round Planet DTV Uganda,richardkiganga,0.0,4,7.0,2021-06-1918:36:05,"","",panama-
Disc.s3.amazon.com/photos/…,\"\",Mbale,Wanabwa p/s,Eastern,"","",UACE Certificate,"",drive.google.com/file/d/148dhf89shh499hd9303-JHBn38bh/… phone,Mbale,energy_officer's_id_type,letty
mainzi,hakuna Cell,Agent,8,"","",4,"","","",+647739975493,Feature phone,"",0,Boda goda,"",1985-10-12,Male,"",johnatlhnaleviski,"",Wife

Can glue Crawler read xml zip file

I have a xml zip file. Can i create Schema using glue crawler.
I was trying to use crawler XML classifier and added the classifier into crawler to create table.
since its zip file. not able to read. Can anyone experience using the Zip file in glue crawler

AWS glue can read zip files but the zip must contain only one file. From docs:
ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
However, reading xml is very limited. Not all xml files can be read. For example, you can't read self closing elements as shown in the docs.

How handle schema changes in glue and get the expected output in csv?

I am trying to crawl some files having different sachems(Data compatible ) using AWS Glue.
As I read in the AWS documentation that Glue crawlers update the catalog tables for any change in the schema(add new columns and remove missing columns).
I have checked the "Update the table definition in the Data Catalog" and "Create a single schema for each S3 path" while creating the crawler.
Example:
let's say I have a file "File1.csv" as shown below:
name,age,loc
Ravi,12,Ind
Joe,32,US
Say I have another file "File2.csv" as shown below:
name,age,height
Jack,12,160
Jane,32,180
After crawlers run in the schema was updated as:
name,age,loc,height -This is as expcted
but When I tried to read the files using Athena or tried writing the content of both the files to csv using Glue ETL job,I have observed that:
the output looks like:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,160,,
Jane,32,180,,
last two rows should have blank for loc as the second file didn't have loc column.
where as expected:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,,160
Jane,32,,180
In short glue is trying to fill up the column in contiguous manner in the combined output.Is there any way I can get the expected output?

I got the expected output with Parquet files. Initially, I was using CSV, but csv deserializer doesn't understand how to put the elements into the correct position when schema changes.
Changing the individual csvs into parquet and then crawling them one after another helped me in incorporating the changing schema.

SageMaker RCF Data

I have a DynamoDB table filled with nice data. I use Datapipeline to extract this to S3 and it generates a folder with 3 files.
1) "139xx-x911-407x-83xx-06x5x659xx16" that contains all DB data in this format:
{"TimeStamp":{"s":"1539699960"},"SystemID":{"n":"1001"},"AccMin":{"n":"497"},"AccMax":{"n":"509"},"CustomerID":{"n":"10001"},"SensorID":{"n":"101"}}
2) "manifest"
{"name":"DynamoDB-export","version":3,
entries: [
{"url":"s3://cxxxx/2018-10-18-15-25-02/139xx-x911-407x-83xx-06x5x659xx16","mandatory":true}
]}
3) "_SUCCESS" No data inside.
I then go to SageMaker -> Training Jobs -> Create Training Job. Here I fill in everything to create a Random Cut Forest model, and point it towards the above data (I have tried both manifest file and the bigger data-file.
The training fails with error:
"ClientError: No data was found. Please make sure training data is
provided."
What am I doing wrong?

Thank you for your interest in SageMaker.
The manifest is optional, but if provided it should conform to the schema described at https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html . Also, RandomCutForest does not support input data in JSON format. Only protobuf and CSV are supported, see https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
In order to get training working you have to convert input data to CSV or protobuf format and set content_type value appropriately. If you want to use a manifest file, then S3 location should point to that file and context has to be fixed to conform the schema. You can however remove the manifest and point S3 location to s3://bucket/path/to/data/.
I hope this helps.
Regards,
Yury

using gzip files in Data Crawler

I have gzip files in a S3 Bucket. They are not CSV files , they are text files with columns separated by space . I am new using Glue and it Is some way to use Glue - Data Crawler to read this content ?

Glue is just Spark under the hood. So you can just use the same spark code to process the space delimited file i.e. splitBy etc. Glue Crawler will create the metadata for the table by parsing the data. If your data is space separated, then Glue crawler won't be able to parse it. It will basically consider the whole line as one single text column. To process it, you will need to write a custom classifier using Grok pattern. Unfortunately there is no clear example provided in AWS documentation. I am giving an example below:
Assuming your data is like below: (it can be in the gzip file as well)
qwe 123 22.3 2019-09-02
asd 123 12.3 2019-09-02
de3 345 23.3 2019-08-22
we3 455 12.3 2018-08-11
ccc 543 12.0 2017-12-12
First you have to create a custom classifier
Grok Pattern
%{NOTSPACE:name} %{INT:class_num} %{BASE10NUM:balance} %{CUSTOMDATE:balance_date}
Custom patterns
CUSTOMDATE %{YEAR}-%{MONTHNUM}-%{MONTHDAY}
Now create a crawler using the custom classifier you just created. Run the crawler. Then check the metadata created in your database to see if it has recognised the data properly.
Please let me know if any question. You can also share few lines from the file you are trying to process.
If you are new to Glue and keen to try, you may like to read the blog I have written in LinkedIn regarding Glue. Please click this link.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Crawler Unable to Classify CSV files - amazon-web-services

Related

remove backslash from a .csv file to load data to redshift from s3

Can glue Crawler read xml zip file

How handle schema changes in glue and get the expected output in csv?

SageMaker RCF Data

using gzip files in Data Crawler

Categories

Resources