Amazon Athena: no viable alternative at input - amazon-web-services

While creating a table in Athena; it gives me following exception:
no viable alternative at input

hyphens are not allowed in table name.. ( though wizard allows it ) .. Just remove hyphen and it works like a charm

Unfortunately, at the moment the syntax validation error messages are not very descriptive in Athena, this error may mean "almost" any possible syntax errors on the create table statement.
Although this is annoying at the moment you will need to check if the syntax follows the Create table documentation
Some examples are:
Backticks not in place (as already pointed out)
Missing/extra commas (remember that the last column doesn't need the comma after column definition
Missing spaces
More ..

This error generally occurs when the syntax of DDL has some silly errors.There are several answers that explain different errors based on there state.The simple solution to this problem is to patiently look into DDL and verify following points line by line:-
Check for missing commas
Unbalanced `(backtick operator)
Incompatible datatype not supported by HIVE(HIVE DATA TYPES REFERENCE)
Unbalanced comma
Hypen in table name

In my case, it was because of a trailing comma after the last column in the table. For example:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
one STRING,
two STRING,
) LOCATION 's3://my-bucket/some/path';
After I removed the comma at the end of two STRING, it worked fine.

My case: it was an external table and the location had a typo (hence didn't exist)
Couple of tips:
Click the "Format query" button so you can spot errors easily
Use the example at the bottom of the documentation - it works - and modify it with your parameters: https://docs.aws.amazon.com/athena/latest/ug/create-table.html

Slashes. Mine was slashes. I had the DDL from Athena, saved as a python string.
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
was changed to
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='"',
'separatorChar'=',')
And everything fell apart.
Had to make it:
WITH SERDEPROPERTIES (
'escapeChar'='\\\\',
'quoteChar'='\\\"',
'separatorChar'=',')

In my case, it was an extra comma in PARTITIONED BY section,

In my case, I was missing the singlequotes for the S3 URL

In my case, it was that one of the table column names was enclosed in single quotes, as per the AWS documentation :( ('bucket')

As other users have noted, the standard syntax validation error message that Athena provides is not particularly helpful. Thoroughly checking the required DDL syntax (see HIVE data types reference) that other users have mentioned can be pretty tedious since it is fairly extensive.
So, an additional troubleshooting trick is to let AWS's own data parsing engine (AWS Glue) give you a hint about where your DDL may be off. The idea here is to let AWS Glue parse the data using its own internal rules and then show you where you may have made your mistake.
Specifically, here are the steps that worked for me to troubleshoot my DDL statement, which was giving me lots of trouble:
create a data crawler in AWS Glue; AWS and lots of other places go through the very detailed steps this requires so I won't repeat it here
point the crawler to the same data that you wanted (but failed) to upload into Athena
set the crawler output to a table (in an Athena database you've already created)
run the crawler and wait for the table with populated data to be created
find the newly-created table in the Athena Query Editor tab, click on the three vertical dots (...), and select "Generate Create Table DLL":
this will make Athena create the DLL for this table that is guaranteed to be valid (since the table was already created using that DLL)
take a look at this DLL and see if/where/how it differs from the DLL that you originally wrote. Naturally, this automatically-generated DLL will not have the exact choices for the data types that you may find useful, but at least you will know that it is 100% valid
finally, update your DLL based on this new Glue/Athena-generated-DLL, adjusting the column/field names and data types for your particular use case

After searching and following all the good answers here.
My issue was that working in Node.js i needed to remove the optional
ESCAPED BY '\' used in the Row settings to get my query to work. Hope this helps others.

Something that wasn't obvious for me the first time I used the UI is that if you get an error in the create table 'wizard', you can then cancel and there should be the query used that failed written in a new query window, for you to edit and fix.
My database had a hypen, so I added backticks in the query and rerun it.

This happened to me due to having comments in the query.
I realized this was a possibility when I tried the "Format Query" button and it turned the entire thing into almost 1 line, mostly commented out. My guess is that the query parser runs this formatter before sending the query to Athena.
Removed the comments, ran the query, and an angel got its wings!

Related

Google Bigquery: Join of two external tables fails if one of them is empty

I have 2 external tables in BiqQuery, created on top of JSON files on Google Cloud Storage. The first one is a fact table, the second is errors data - and it might or might not be empty.
I can query each table separately just fine, even an empty one - here is an
empty table query result example
I'm also able to left join them if both of them are not empty.
However, if errors table is empty, my query fails with the following error:
The query specified one or more federated data sources but not all of them were scanned. It usually indicates incorrect uri specification or a 'limit' clause over a union of federated data sources that was satisfied without having to read all sources.
This situation isn't covered anywhere in the docs, and it's not related to this versioning issue - Reading BigQuery federated table as source in Dataflow throws an error
I'd rather avoid converting either of this tables to native, since they are used in just one step of the ETL process, and this data is dropped afterwards. One of them being empty doesn't look like an exceptional situation, since plain select works just fine.
Is some workaround possible?
UPD: raised an issue with Google, waiting for response - https://issuetracker.google.com/issues/145230326
It feels like a bug. One workaround is to use scripting to avoid querying the empty table:
DECLARE is_external_table_empty BOOL DEFAULT
(SELECT 0 = (SELECT COUNT(*) FROM your_external_table));
-- do things differently when is_external_table_empty is true
IF is_external_table_empty = true
THEN ...
ELSE ...
END IF

How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe?

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

In Redshift, how do you combine CTAS with the "if not exists" clause?

I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.

Redshift COPY command delimiter not found

I'm trying to load some text files to Redshift. They are tab delimited, except for after the final row value. That's causing a delimiter not found error. I only see a way to set the field delimiter in the COPY statement, not a way to set a row delimiter. Any ideas that don't involve processing all my files to add a tab to the end of each row?
Thanks
I don't think the problem is with missing <tab> at the end of lines. Are you sure that ALL lines have correct number of fields?
Run the query:
select le.starttime, d.query, d.line_number, d.colname, d.value,
le.raw_line, le.err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
order by le.starttime desc
limit 100
to get the full error report. It will show the filename with errors, incorrect line number, and error details.
This will help to find where the problem lies.
You can get the delimiter not found error if your row has less columns than expected. Some CSV generators may just output a single quote at the end if last columns are null.
To solve this you can use FILLRECORD on Redshift copy options.
From my understanding the error message Delimiter not found may be caused also by not specifying correctly the COPY command, in particular by not specifying the Data format parameters https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
In my case I was trying to load Parquet data with this expression:
COPY my_schema.my_table
FROM 's3://my_bucket/my/folder/'
IAM_ROLE 'arn:aws:iam::my_role:role/my_redshift_role'
REGION 'my-region-1';
and I received the Delimiter not found error message when looking into the system table stl_load_errors. But specifying I'm dealing with Parquet data in the expression in this way:
COPY my_schema.my_table
FROM 's3://my_bucket/my/folder/'
IAM_ROLE 'arn:aws:iam::my_role:role/my_redshift_role'
FORMAT AS PARQUET;
solved my problem and I was able to correctly load the data.
I know this was answered, but I just dealt with the same error and I had a simple solution so i'll share it.
This error can also be solved by stating the specific columns of the table that are copied from the s3 files (if you know what are the columns in the data on s3).
In my case the data had less columns than the number of columns in the table.
Madahava's answer with the 'FILLRECORD' option DID solve the issue for me but then I noticed a column that was supposed to filled up with default values, remained null.
COPY <table> (col1, col2, col3) from 's3://somebucket/file' ...
This may not be directly related to the OP's question but I received the same Delimiter not found error which was caused by newline characters within one of the fields.
For any field that you think may have newline characters you can remove them with:
replace(my_field, chr(10), '')
When you send fewer fields than expected on the destin table, it will also throw this error.
I'm sure there are multiple scenarios that would return this error. I just came across one that I don't see mentioned in the other answers while I was debugging someone else's code. The COPY had the EXPLICIT_IDS option listed, the table it was trying to import into had a column with a data type of identity(1,1), but the file it was trying to import into Redshift did not have an ID field. It made sense for me to add the identity field to the file. But, I imagine removing the EXPLICIT_IDS option would also have fixed the issue.
So recently I came across of this Delimiter not found error in Redshift SQL while loading the data with copy command. In my case, the problem was with column numbers.
I had created a table with 20 columns but I was loading the file with 21 columns.
I corrected it in my table by making 21 columns in the table and then re-loaded the data and boom it worked.
Hope it will be helpful to those who are facing the same kind of problem.
Ta-da
Sometimes this pops up when you dont specify the file type, for example CSV
Ref: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
copy "dev"."my"."table" from 's3://bucket/myfile_upload.csv' credentials 'aws_iam_role=arn:aws:iam::2112277888:role/RedshiftAccessRole' IGNOREHEADER 1 csv;

How I can encode/escape a varchar to be more secure without using cfqueryparam?

How I can encode/escape a varchar to be more secure without using cfqueryparam? I want to implement the same behaviour without using <cfqueryparam> to get around "Too many parameters were provided in this RPC request. The maximum is 2100" problem. See: http://www.bennadel.com/blog/1112-Incoming-Tabular-Data-Stream-Remote-Procedure-Call-Is-Incorrect.htm
Update:
I want the validation / security part, without generating a prepared-statement.
What's the strongest encode/escape I can do to a varchar inside <cfquery>?
Something similar to mysql_real_escape_string() maybe?
As others have said, that length-related error originates at a deeper level, not within the queryparam tag. And it offers some valuable protection and therefore exists for a reason.
You could always either insert those values into a temporary table and join against that one or use the list functions to split that huge list into several smaller lists which are then used separately.
SELECT name ,
..... ,
createDate
FROM somewhere
WHERE (someColumn IN (a,b,c,d,e)
OR someColumn IN (f,g,h,i,j)
OR someColumn IN (.........));
cfqueryparam performs multiple functions.
It verifies the datatype. If you say integer, it makes sure there is an integrer, and if not, it does nto allow it to pass
It separates the data of a SQL script from the executable code (this is where you get protection from SQL injection). Anything passed as a param cannot be executed.
It creates bind variables at the DB engine level to help improve performance.
That is how I understand cfqueryparam to work. Did you look into the option of making several small calls vs one large one?
It is a security issue. Stops SQL injections
Adobe recommends that you use the cfqueryparam tag within every cfquery tag, to help secure your databases from unauthorized users. For more information, see Security Bulletin ASB99-04, "Multiple SQL Statements in Dynamic Queries," at www.adobe.com/devnet/security/security_zone/asb99-04.html, and "Accessing and Retrieving Data" in the ColdFusion Developer's Guide.
The first thing I'd be asking myself is "how the heck did I end up with more than 2100 params in a single query?". Because that in itself should be a very very big red flag to you.
However if you're stuck with that (either due to it being outwith your control, or outwith your motivation levels to address ;-), then I'd consider:
the temporary table idea mentioned earlier
for values over a certain length just chop 'em in half and join 'em back together with a string concatenator, eg:
*
SELECT *
FROM tbl
WHERE col IN ('a', ';DROP DATABAS'+'E all_my_data', 'good', 'etc' [...])
That's a bit grim, but then again your entire query sounds grim, so that might not be such a concern.
param values that are over a certain length or have stop words in them or something. This is also quite a grim suggestion.
SERIOUSLY go back over your requirement and see if there's a way to not need 2100+ params. What is it you're actually needing to do that requires all this???
The problem does not reside with cfqueryparam, but with MsSQL itself :
Every SQL batch has to fit in the Batch Size Limit: 65,536 * Network Packet Size.
Maximum size for a SQL Server Query? IN clause? Is there a Better Approach
And
http://msdn.microsoft.com/en-us/library/ms143432.aspx
The few times that I have come across this problem I have been able to rewrite the query using subselects and/or table joins. I suggest trying to rewrite the query like this in order to avoid the parameter max.
If it is impossible to rewrite (e.g. all of the multiple parameters are coming from an external source) you will need to validate the data yourself. I have used the following regex in order to perform a safe validation:
<cfif ReFindNoCase("[^a-z0-9_\ \,\.]",arguments.InputText) IS NOT 0>
<cfthrow type="Application" message="Invalid characters detected">
</cfif>
The code will force an error if any special character other than a comma, underscore, or period is found in a text string. (You may want to handle the situation cleaner than just throwing an error.) I suggest you modify this as necessary based on the expected or allowed values in the fields you are validating. If you are validating a string of comma separated integers you may switch to use a more limiting regex like "[^0-9\ \,]" which will only allow numbers, commas, and spaces.
This answer will not escape the characters, it will not allow them in the first place. It should be used on any data that you will not use with <cfqueryparam>. Personally, I have only found a need for this when I use a dynamic sort field; not all databases will allow you to use bind variables with the ORDER BY clause.