Glue custom classifiers for CSV with non standard delimiter - amazon-web-services

I am trying to use AWS Glue to crawl a data set and make it available to query in Athena. My data set is a delimited text file using ^ to separate columns. Glue is not able to infer the schema for this data as the CSV classifier only recognises comma (,), pipe (|), tab (\t), semicolon (;), and Ctrl-A (\u0001). Is there a way of updating this classifer to include non standard delimeters? The option to build custom classifiers only seems to support Grok, JSON or XML which are not applicable in this case.

You will need to create a custom classifier using the custom Grok pattern and use that in the crawler. Suppose your data is like below with four fields:
qwe^123^22.3^2019-09-02
To process the above data, your custom pattern will look like below:
%{NOTSPACE:name}^%{INT:class_num}^%{BASE10NUM:balance}^%{CUSTOMDATE:balance_date}
Please let me know if that worked for you.

Related

What is the equivalent of the SQL function REGEXP_EXTRACT in Azure Synapse?

I want to convert my code that I was running in Netezza (SQL) to Azure Synapse (T-SQL). I was using the built-in Netezza SQL function REGEXP_EXTRACT but this function is not built-in Azure Synapse.
Here's the code I'm trying to convert
-- Assume that "column_v1" has datatype Character Varying(3) and can take value between 0 to 999 or NULL
SELECT
column_v1
, REGEXP_EXTRACT(column_v1, '[0-9]+') as column_v2
FROM INPUT_TABLE
;
Thanks,
John
regexExtract() function is supported in Synapse.
In order to implement it, you need to use couple of things, here is a demo that i built, here im using the SalesLT.Customer data that is supported as a sample data in microsoft:
In Synapse -> Integrate tab:
Create new pipeline
Add dataflow activity to your pipline
In dataflow activity: under settings tab -> create new data flow
double click on the dataflow (it should open it) Add source (it can be blob storage / files on prem etc.)
add a derived column transformation
in derived column add new column (or override an existing column) in Expression: add this command regexExtract(Phone,'(\\d{3})') it will select the 3 first digits, since my data has dashes in it, its makes more sense to replace all characters that are not digits using regexReplace method: regexReplace(Phone,'[^0-9]', '')
add sink
DataFlow activities:
derived column transformation:
Output:
please check MS docs:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions
Regex_extract is not available in T-SQL. Thus, we try to do similar functionalities using Substring/left/right functions along with Patindex function
SELECT input='789A',
extract= SUBSTRING('789A', PATINDEX('[0-9][0-9][0-9]', '789A'),4);
Result
Refer Microsoft documents patindex (T-SQL), substring (T-SQL) for additional information.

AWS Glue not detecting header in CSV

Hi I have a bunch of CSV's located in S3, a crawler setup via AWS Glue, this crawler builds about 10 tables as it scan 10 folders and only 1 of them where the headers are not being detected. The structure of the csv is the same as all the others. Advice please?
AWs glue crawler interprets header based on multiple rules. if the first line in your file doest satisfy those rules, the crawler wont detect the fist line as a header and you will need to do that manually. its a very common problem and we integrated a fix for this within our code to do it is part of our data pipeline.
Excerpt from aws doco
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has
content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex
requirements for a column name.
The header row must be sufficiently different from the data rows. To
determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
You can create the table yourself and instead of crawling point to an s3 path, you can crawl based on an existing table. This is the concept used when a crawler is not detecting the schema especially just column headings.
Also check if the skip.header.line.count=1 is being added automatically, if not you can add manually and it an update the schema to the correct one you require. On your subsequent runs for your crawler, you can change the properties so that it will ignore schema updates and only perform partition updates to your table.
You could use a custom classifier on your crawler to solve this problem: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Normally choosing Has headings in the classifier options Column Headings section will do the trick, if not, it may be necessary to enter in a list of headings in text box for that purpose.
because your columns are all classified as strings, it's likely that the columns violate the rules. in my case, i had a column name that was greater than 150 characters so Glue read the first row as data, as opposed to a header, and then assumed all columns were strings.

AWS glue: ignoring spaces in JSON properties

I have a dataset with JSON files in it. Some of the entries of these JSONs have spaces in the entries like
{
'propertyOne': 'something',
'property Two': 'something'
}
I've had this data set crawled by several different crawlers to try and get the schema I want. For some reason on one of my crawls, the spaces were removed, but on trying to replicate the process, I cannot get the spaces to be removed and when querying in Athena I get this error
HIVE_METASTORE_ERROR: : expected at position x in 'some string' but ' ' found instead.
Position x is the position of the space between 'property' and 'Two' in the JSON entry.
I would like to just be able to exclude this field or have the space removed when crawled, but I'm not sure how. I can't change the JSON format. Any help is appreiated
this is actually a bug with aws gule json classifier because it doesn't play nice with nested properties that have spaces in them. The syntax error is in the schema generated by the crawler, not in the json. It generates something like this:
struct<propertyOne:string, property Two:string>
The space in "property two" should have been escaped, by the crawler. At this point, generating the DDL for the table is also not working. We are also facing this issue and looking for workarounds
I believe your only option, in this case, would be to create your own custom JSON classifier to select only those attributes you want the Crawler to add to the Data Catalog.
I.e. if you want to only retrieve propertyOne you can use specify the JSONPath expression as $.propertyOne.
Note also that your JSON should be in double quotes, the single quotes could also be causing issues when parsing the data.

How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe?

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

Parse JSON as key value in Dataflow job

How to parse JSON data in apache beam and store in bigquery table ?
For example: JSON data
[{ "name":"stack"},{"id":"100"}].
How to parse JSON data and convert to PCollection K,V that will store in BQ table?
Appreciate your help!!
Typically you would use a built in JSON parser in the programming language (Are you using beam or python). Then create a TableRow object and use that for the PCollection which you are passing to the BQ table.
Note: Some JSON parsers disallow JSON which starts with a root list, as you have shown in your example. They tend to prefer something like this, with a root map. I believe this is the case in python's json library.
{"name":"stack", "id":"100"}
Please see this example pipeline, for an example on how to create the PCollection and use BigqueryIO.
You may also want to consider using one of the X to BigQuery template pipelines.