How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe? - amazon-web-services

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.

This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.

Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

Related

Amazon Redshift: Check column names during COPY

Can I check columns names during copy from S3 to Redshift?
For example, I have "good" CSV:
name ,sur_name
BOB , FISCHER
And I have "wrong" CSV:
sur_name,name
FISCHER , BOB
Can I check names of columns during copy command?
I don't want to use AWS Glue or AWS Lambda for checks because I don't want to open/load/save the same file many times.
(The same problem for other files with columns names.)
This is very simple check so Redshift should allowed that but I can't find any information about that.
Or if this is not possible? Can you give me some idea how do it without reading all files?
(For example, a Lambda function that reads only headers without getting all file.)
From Column mapping options - Amazon Redshift:
You can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data.
Therefore, the only way to read such files would be to specify the column names in their correct order. This requires you to look inside the file to determine the order of the columns.
When reading an object from Amazon S3, it is possible to specify the range of bytes to be read. So, instead of reading the entire file, it could read just the first 200 bytes (or whatever size would be sufficient to include the header row). An AWS Lambda function could read these bytes, extract the column names, then generate a COPY command that would import the columns in the correct order (without having to read the entire file first).

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?
The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

AWS Glue not detecting header in CSV

Hi I have a bunch of CSV's located in S3, a crawler setup via AWS Glue, this crawler builds about 10 tables as it scan 10 folders and only 1 of them where the headers are not being detected. The structure of the csv is the same as all the others. Advice please?
AWs glue crawler interprets header based on multiple rules. if the first line in your file doest satisfy those rules, the crawler wont detect the fist line as a header and you will need to do that manually. its a very common problem and we integrated a fix for this within our code to do it is part of our data pipeline.
Excerpt from aws doco
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has
content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex
requirements for a column name.
The header row must be sufficiently different from the data rows. To
determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
You can create the table yourself and instead of crawling point to an s3 path, you can crawl based on an existing table. This is the concept used when a crawler is not detecting the schema especially just column headings.
Also check if the skip.header.line.count=1 is being added automatically, if not you can add manually and it an update the schema to the correct one you require. On your subsequent runs for your crawler, you can change the properties so that it will ignore schema updates and only perform partition updates to your table.
You could use a custom classifier on your crawler to solve this problem: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Normally choosing Has headings in the classifier options Column Headings section will do the trick, if not, it may be necessary to enter in a list of headings in text box for that purpose.
because your columns are all classified as strings, it's likely that the columns violate the rules. in my case, i had a column name that was greater than 150 characters so Glue read the first row as data, as opposed to a header, and then assumed all columns were strings.

Athena shows no value against boolean column, table created using glue crawler

I am using aws glue csv crawler to crawl s3 directory containing csv files. Crawler works fine in the sense that it creates the schema with correct data types for each column, however, when I query data from athena, it doesn't show value under boolean type column.
A csv looks like this:
"val","ts","cond"
"1.2841974","15/05/2017 15:31:59","True"
"0.556974","15/05/2017 15:40:59","True"
"1.654111","15/05/2017 15:41:59","True"
And the table created by crawler is:
Column name Data type
val string
ts string
cond boolean
However, when I run say select * from <table_name> limit 10 it returns:
val ts cond
1 "1.2841974" "15/05/2017 15:31:59"
2 "0.556974" "15/05/2017 15:40:59"
3 "1.654111" "15/05/2017 15:41:59"
Does any one has any idea what might be the reason?
I forgot to add, if I change the data type of cond column to string, it does show data as string e.g. "True" or "False"
I don't know why Glue classifies the cond column as boolean, because Athena will not understand that value as a boolean. I think this is a bug in Glue, or an artefact of it not targeting Athena exclusively. Athena expects boolean values to be either true or false. I don't remember if that includes different capitalizations of the strings or not, but either way yours will fail because they are quoted. The actual bug is that Glue has not configured your table so that it strips the quotes from the strings, and therefore Athena sees a boolean column containing "True" with quotes and all, and that is not a supported boolean value. Instead you get NULL values.
You could try changing your tables to use the OpenCSVSerDe instead, it supports quoted values.
It's surprising that Glue continues to stumble on basic things like this. Glue is unfortunately rarely worth the effort over writing some basic scripts yourself.

Athena Query Results: Are they always strings?

I'm in the process of building new "ETL" pipelines with CTAS. Unfortunately, Quite often the CTAS query is too intensive which causes Athena to time out. As such, I use CTAS to create the initial table and populate with a small sample. I then write a script that queries the same table the CTAS was generated from (which is parquet format) for the remaining days that the CTAS couldn’t handle upfront. I write the output of these query results to the same directory that is holding the results of the CTAS query before repairing the table (to pick up new data). However, it seems to be a pretty clunky process for a number of reasons:
1) Query results written out with a standard SQL statements all end up being strings. For example, when I write out the number of DAUs (which is a count and cast to an int) the csv output is a string I.e. wrapped in “”.
Is it possible to write out Athena "query_results" (not the CTAS) as anything other than a string when in CSV format. The main problem with this is it means it can't be read back into the table produced by the CTAS since these column expect a bigint. This, of course, can be resolved with a lambda function but seems like a big overhead for something that should be trivial.
2) Can you put query results (not from CTAS) directly into parquet instead of CSV?
3) Is there any way to prevent metadata being generated with the query_results (not from CTAS). Again, it can be cleaned up with a lambda function, but it's just additional nonsense I need to handle.
Thanks in advance!
The data type of the result depends on the SQL used to create it and also on how you consume it. Based on your question I'm going to assume that you're creating a table using CTAS and that the output is CSV, and that you're then looking at the CSV data directly.
That CSV is going to have quotes in it, but that doesn't mean that it's not possible to read integer values as integers, and so on. Athena uses a schema-on-read approach, and as long as the serde can interpret a value as a particular type, that type will work as the type of the column.
If you query the table created by your CTAS operation you should get back integers for the integer columns.
Using CTAS you can also create output of different types, like JSON, Avro, Parquet, and ORC, that keep the type information. Just use the format property to select the output type.
I am a bit confused what you mean by your third question. With a normal query you get two files on S3, the data file and the metadata file, and they will be written to the output location given in the StartQueryExecution API call, but with a CTAS query you get the output data in a different location (given in the SQL) than the metadata file.
Are you actually using CTAS, or are you talking about the regular query result files?
Update after the question got clarified:
1) Athena is unfortunately unable to properly read it's own output in many situations. This is something that really surprises me that they never considered before launch. You might be able to set up a table that uses the regex serde.
2) No, unfortunately the only output of a regular query is CSV at this time.
3) No, the metadata is always written to the same prefix as the output.
I think your best bet is running multiple CTAS queries that select subsets of your source data, if there is a date column for example you could make one CTAS per month or some other time range that works. After the CTAS queries have completed you can move the result files into the same directory on S3 and create a final table that has that directory as its location.