AWS glue: ignoring spaces in JSON properties - amazon-web-services

I have a dataset with JSON files in it. Some of the entries of these JSONs have spaces in the entries like
{
'propertyOne': 'something',
'property Two': 'something'
}
I've had this data set crawled by several different crawlers to try and get the schema I want. For some reason on one of my crawls, the spaces were removed, but on trying to replicate the process, I cannot get the spaces to be removed and when querying in Athena I get this error
HIVE_METASTORE_ERROR: : expected at position x in 'some string' but ' ' found instead.
Position x is the position of the space between 'property' and 'Two' in the JSON entry.
I would like to just be able to exclude this field or have the space removed when crawled, but I'm not sure how. I can't change the JSON format. Any help is appreiated

this is actually a bug with aws gule json classifier because it doesn't play nice with nested properties that have spaces in them. The syntax error is in the schema generated by the crawler, not in the json. It generates something like this:
struct<propertyOne:string, property Two:string>
The space in "property two" should have been escaped, by the crawler. At this point, generating the DDL for the table is also not working. We are also facing this issue and looking for workarounds

I believe your only option, in this case, would be to create your own custom JSON classifier to select only those attributes you want the Crawler to add to the Data Catalog.
I.e. if you want to only retrieve propertyOne you can use specify the JSONPath expression as $.propertyOne.
Note also that your JSON should be in double quotes, the single quotes could also be causing issues when parsing the data.

Related

AWS DynamoDB fails at The provided key element does not match the schema because of CSV format

I am working on importing data from S3 bucket to DynamoDB using datapipeline. The data is in csv format. I have been struggling with this for a week now, and finally came to know the real problem.
I have some fields, the important ones are (id (partitionKey), username (sortKey)) .
Now one of the entry in data is username has been set as a comma seperated value. ForExample: {"username": ""someuser,name"} . Now the irony of csv file is when mapping to dynamodb through csv (comma seperated) file. It takes comma as a new entry in the column. And so, it fails with the error The provided key element does not match the schema. Which will ofcourse right.
Is there any way I can overcome this issue? Thanks in advance for your suggestions.
EDIT:
The csv entry looks like this as an example.
1234567,"user,name",$123$,some#email.de,2002-05-28 14:07:04.0,2013-07-19 14:17:05.0,2020-02-19 15:32:18.611,2014-02-27 14:49:19.0,,,,

Glue custom classifiers for CSV with non standard delimiter

I am trying to use AWS Glue to crawl a data set and make it available to query in Athena. My data set is a delimited text file using ^ to separate columns. Glue is not able to infer the schema for this data as the CSV classifier only recognises comma (,), pipe (|), tab (\t), semicolon (;), and Ctrl-A (\u0001). Is there a way of updating this classifer to include non standard delimeters? The option to build custom classifiers only seems to support Grok, JSON or XML which are not applicable in this case.
You will need to create a custom classifier using the custom Grok pattern and use that in the crawler. Suppose your data is like below with four fields:
qwe^123^22.3^2019-09-02
To process the above data, your custom pattern will look like below:
%{NOTSPACE:name}^%{INT:class_num}^%{BASE10NUM:balance}^%{CUSTOMDATE:balance_date}
Please let me know if that worked for you.

Amazon Athena: no viable alternative at input

While creating a table in Athena; it gives me following exception:
no viable alternative at input
hyphens are not allowed in table name.. ( though wizard allows it ) .. Just remove hyphen and it works like a charm
Unfortunately, at the moment the syntax validation error messages are not very descriptive in Athena, this error may mean "almost" any possible syntax errors on the create table statement.
Although this is annoying at the moment you will need to check if the syntax follows the Create table documentation
Some examples are:
Backticks not in place (as already pointed out)
Missing/extra commas (remember that the last column doesn't need the comma after column definition
Missing spaces
More ..
This error generally occurs when the syntax of DDL has some silly errors.There are several answers that explain different errors based on there state.The simple solution to this problem is to patiently look into DDL and verify following points line by line:-
Check for missing commas
Unbalanced `(backtick operator)
Incompatible datatype not supported by HIVE(HIVE DATA TYPES REFERENCE)
Unbalanced comma
Hypen in table name
In my case, it was because of a trailing comma after the last column in the table. For example:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
one STRING,
two STRING,
) LOCATION 's3://my-bucket/some/path';
After I removed the comma at the end of two STRING, it worked fine.
My case: it was an external table and the location had a typo (hence didn't exist)
Couple of tips:
Click the "Format query" button so you can spot errors easily
Use the example at the bottom of the documentation - it works - and modify it with your parameters: https://docs.aws.amazon.com/athena/latest/ug/create-table.html
Slashes. Mine was slashes. I had the DDL from Athena, saved as a python string.
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
was changed to
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='"',
'separatorChar'=',')
And everything fell apart.
Had to make it:
WITH SERDEPROPERTIES (
'escapeChar'='\\\\',
'quoteChar'='\\\"',
'separatorChar'=',')
In my case, it was an extra comma in PARTITIONED BY section,
In my case, I was missing the singlequotes for the S3 URL
In my case, it was that one of the table column names was enclosed in single quotes, as per the AWS documentation :( ('bucket')
As other users have noted, the standard syntax validation error message that Athena provides is not particularly helpful. Thoroughly checking the required DDL syntax (see HIVE data types reference) that other users have mentioned can be pretty tedious since it is fairly extensive.
So, an additional troubleshooting trick is to let AWS's own data parsing engine (AWS Glue) give you a hint about where your DDL may be off. The idea here is to let AWS Glue parse the data using its own internal rules and then show you where you may have made your mistake.
Specifically, here are the steps that worked for me to troubleshoot my DDL statement, which was giving me lots of trouble:
create a data crawler in AWS Glue; AWS and lots of other places go through the very detailed steps this requires so I won't repeat it here
point the crawler to the same data that you wanted (but failed) to upload into Athena
set the crawler output to a table (in an Athena database you've already created)
run the crawler and wait for the table with populated data to be created
find the newly-created table in the Athena Query Editor tab, click on the three vertical dots (...), and select "Generate Create Table DLL":
this will make Athena create the DLL for this table that is guaranteed to be valid (since the table was already created using that DLL)
take a look at this DLL and see if/where/how it differs from the DLL that you originally wrote. Naturally, this automatically-generated DLL will not have the exact choices for the data types that you may find useful, but at least you will know that it is 100% valid
finally, update your DLL based on this new Glue/Athena-generated-DLL, adjusting the column/field names and data types for your particular use case
After searching and following all the good answers here.
My issue was that working in Node.js i needed to remove the optional
ESCAPED BY '\' used in the Row settings to get my query to work. Hope this helps others.
Something that wasn't obvious for me the first time I used the UI is that if you get an error in the create table 'wizard', you can then cancel and there should be the query used that failed written in a new query window, for you to edit and fix.
My database had a hypen, so I added backticks in the query and rerun it.
This happened to me due to having comments in the query.
I realized this was a possibility when I tried the "Format Query" button and it turned the entire thing into almost 1 line, mostly commented out. My guess is that the query parser runs this formatter before sending the query to Athena.
Removed the comments, ran the query, and an angel got its wings!

AWS CloudSearch Filter on non-indexed field

I'm trying to do a structured query with a lot of dynamic fields (potentially) in the search pattern. So far everything is good, except I want to be able to limit from the result by field that is not indexed. Is this possible?
The test search console is showing this error: "Syntax Error in query: field (fieldname) is not searchable"
All index fields that you intend to use for filtering should be marked as searchable:
I figured this out by checking out the Cloudsearch console network payload.
If you are filtering by a facet field that is not marked searchable, you have to add a second query parameter: fq=<facetFieldName>&facet.<facetFieldName>={}. This seems to filter the results as expected.

RegEx for a JSON string

Im storing a person object as JSON in my SQLite database. The table will have few 1000 records of person objects. What i need is to query person based on the "name" attribute.
After investigation i figured out using GLOB method of SQLite to perform a RegEx kind of search in the JSON elements.
My Sample JSON is something like this.
{"name":"john","age":"22","father-name":"jackson"}
Now i want a RegEx matcher to get me all the records that matches a part of the SubString provided with the name attribute in JSON. And it should be case insensitive too.
For Ex: "ohn" search should fetch me john's record.
While you can store JSON and search against it using regexes (which are rather limited in SQLite), it does not mean you should.
Instead, you should really consider splitting your JSON into fields and storing them in normal SQLite table. Doing so will not only allow you to perform easier searches without need to painfully parse data every single time, search will be much faster too (if you add necessary indexes).
If you do want to go down the regex route the following will extract the record:
/\{"name":"\w*ohn\w*[^\}]+\}/i
This will match each of these:
{"name":"john","age":"22","father-name":"jackson"}
{"name":"john","age":"22","father-name":"johnson"}
{"name":"johnny","age":"22","father-name":"smith"}
but not:
{"name":"fred","age":"22","father-name":"hall"},
{"name":"mike","age":"22","father-name":"johnson"}
{"name":"bob","age":"22","father-name":"todd"}