Google Bigquery: Join of two external tables fails if one of them is empty - google-cloud-platform

I have 2 external tables in BiqQuery, created on top of JSON files on Google Cloud Storage. The first one is a fact table, the second is errors data - and it might or might not be empty.
I can query each table separately just fine, even an empty one - here is an
empty table query result example
I'm also able to left join them if both of them are not empty.
However, if errors table is empty, my query fails with the following error:
The query specified one or more federated data sources but not all of them were scanned. It usually indicates incorrect uri specification or a 'limit' clause over a union of federated data sources that was satisfied without having to read all sources.
This situation isn't covered anywhere in the docs, and it's not related to this versioning issue - Reading BigQuery federated table as source in Dataflow throws an error
I'd rather avoid converting either of this tables to native, since they are used in just one step of the ETL process, and this data is dropped afterwards. One of them being empty doesn't look like an exceptional situation, since plain select works just fine.
Is some workaround possible?
UPD: raised an issue with Google, waiting for response - https://issuetracker.google.com/issues/145230326

It feels like a bug. One workaround is to use scripting to avoid querying the empty table:
DECLARE is_external_table_empty BOOL DEFAULT
(SELECT 0 = (SELECT COUNT(*) FROM your_external_table));
-- do things differently when is_external_table_empty is true
IF is_external_table_empty = true
THEN ...
ELSE ...
END IF

Related

HIVE_PARTITION_SCHEMA_MISMATCH in Athena due to different order in struct?

Full error
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.
The types are incompatible and cannot be coerced. The column 'ein_verification' in table 'dynamodb_etl_dev.widget_user_snapshots' is declared as type
'struct<status:string,unlocktimestamp:bigint,message:string,lastverifiedtimestamp:bigint,datelastverified:bigint>', but partition 'snapshot_time=2022-08-03T18%3A41' declared column 'ein_verification' as type
'struct<status:string,unlocktimestamp:bigint,lastverifiedtimestamp:bigint,message:string,datelastverified:bigint>'.
It looks like the only difference is the order message:string,lastverifiedtimestamp:bigint is reversed but they are otherwise the same.
I know there are settings for updating the table definition and updating existing partitions with metadata from the table, but I'd like to understand why this is happening and possibly prevent it from happening at all.
Also it appears Athena is not trying to query the latest partition, as there is a new partitition with a more recent timestamp in this s3 bucket. I'm stuck on how to proceeed as I can run this job once and get a single partition and it works fine. But everytime so far that I run it a second time I get the error with struct out of order.
Found the answer to the partion error here . I n particular, theres a comment on how to do it in terraform that helped me get it running

How does AWS Athena react to schema changes in S3 files?

What happens when after creating the table in AWS Athena for files on S3, the structure of the files on S3 change?
For eg:
If the files previously had 5 columns when the table was created and later the new files started getting 1 more column:
a) at the end?
b) in between?
What happens when some columns are not available in new files?
What happens when the columns remain the same but the column order changes?
Can we alter Athena tables to adjust to these changes?
1 - Athena is not a NoSQL solution. It is not dynamic schema either. If you change the schema, all your files in a particular folder should reflect that change. Athena wont magically update to have it included.
2 - Then it'll be a problem and it'll break. You should include NULL or ,, to force it to be okay.
3 - Athena picks it up by column order. Not by name, really. If your column orders change, it'll probably break (different types).
4 - Yes. You can always easily recreate Athena tables by dropping it and creating a new one.
If you have variable length files, then you should insert them into different folders so that each folder represents one consistent schema. You can then unify this later on in Athena with a union or similar to create a condensed, simplified table that you can apply the consistent schema to.
It depends on the files format you are using and the setup (if the schema is by field order or by field name). All the details are here: https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
Take a big note that if the data is nested or in arrays, it will completely break your data, to quote from this page:
Schema updates described in this section do not work on tables with complex or nested data types, such as arrays and structs.

Does Cloud Spanner support a TRUNCATE TABLE command?

I want to clear all the values from a table. It has a few secondary indexes. I tried to do this via committing a transaction with Mutation.delete("MyTable", KeySet.all()) (see docs here). But I got an error:
error:INVALID_ARGUMENT: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: The transaction contains too many mutations.
How can I efficiently clear my table contents?
Cloud Spanner does not support such a truncate command. If your table had no secondary indexes, then you could specify KeySet.all() in your Delete as specified above, but this can fail if your table has secondary indexes and is large.
The best way to do what you want is to issue an updateDdl RPC including the following statements:
1) For each secondary index on MyTable, include a corresponding DROP INDEX statement
2) DROP TABLE MyTable
3) If necessary, re-create your table and indexes via the CREATE TABLE and CREATE INDEX statements, respectively.
Note that you are allowed and encouraged to include all of these statements in a single updateDdl RPC. The advantage of this is that it gives you atomic ("all-or-nothing") semantics.

Amazon Athena: no viable alternative at input

While creating a table in Athena; it gives me following exception:
no viable alternative at input
hyphens are not allowed in table name.. ( though wizard allows it ) .. Just remove hyphen and it works like a charm
Unfortunately, at the moment the syntax validation error messages are not very descriptive in Athena, this error may mean "almost" any possible syntax errors on the create table statement.
Although this is annoying at the moment you will need to check if the syntax follows the Create table documentation
Some examples are:
Backticks not in place (as already pointed out)
Missing/extra commas (remember that the last column doesn't need the comma after column definition
Missing spaces
More ..
This error generally occurs when the syntax of DDL has some silly errors.There are several answers that explain different errors based on there state.The simple solution to this problem is to patiently look into DDL and verify following points line by line:-
Check for missing commas
Unbalanced `(backtick operator)
Incompatible datatype not supported by HIVE(HIVE DATA TYPES REFERENCE)
Unbalanced comma
Hypen in table name
In my case, it was because of a trailing comma after the last column in the table. For example:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
one STRING,
two STRING,
) LOCATION 's3://my-bucket/some/path';
After I removed the comma at the end of two STRING, it worked fine.
My case: it was an external table and the location had a typo (hence didn't exist)
Couple of tips:
Click the "Format query" button so you can spot errors easily
Use the example at the bottom of the documentation - it works - and modify it with your parameters: https://docs.aws.amazon.com/athena/latest/ug/create-table.html
Slashes. Mine was slashes. I had the DDL from Athena, saved as a python string.
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
was changed to
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='"',
'separatorChar'=',')
And everything fell apart.
Had to make it:
WITH SERDEPROPERTIES (
'escapeChar'='\\\\',
'quoteChar'='\\\"',
'separatorChar'=',')
In my case, it was an extra comma in PARTITIONED BY section,
In my case, I was missing the singlequotes for the S3 URL
In my case, it was that one of the table column names was enclosed in single quotes, as per the AWS documentation :( ('bucket')
As other users have noted, the standard syntax validation error message that Athena provides is not particularly helpful. Thoroughly checking the required DDL syntax (see HIVE data types reference) that other users have mentioned can be pretty tedious since it is fairly extensive.
So, an additional troubleshooting trick is to let AWS's own data parsing engine (AWS Glue) give you a hint about where your DDL may be off. The idea here is to let AWS Glue parse the data using its own internal rules and then show you where you may have made your mistake.
Specifically, here are the steps that worked for me to troubleshoot my DDL statement, which was giving me lots of trouble:
create a data crawler in AWS Glue; AWS and lots of other places go through the very detailed steps this requires so I won't repeat it here
point the crawler to the same data that you wanted (but failed) to upload into Athena
set the crawler output to a table (in an Athena database you've already created)
run the crawler and wait for the table with populated data to be created
find the newly-created table in the Athena Query Editor tab, click on the three vertical dots (...), and select "Generate Create Table DLL":
this will make Athena create the DLL for this table that is guaranteed to be valid (since the table was already created using that DLL)
take a look at this DLL and see if/where/how it differs from the DLL that you originally wrote. Naturally, this automatically-generated DLL will not have the exact choices for the data types that you may find useful, but at least you will know that it is 100% valid
finally, update your DLL based on this new Glue/Athena-generated-DLL, adjusting the column/field names and data types for your particular use case
After searching and following all the good answers here.
My issue was that working in Node.js i needed to remove the optional
ESCAPED BY '\' used in the Row settings to get my query to work. Hope this helps others.
Something that wasn't obvious for me the first time I used the UI is that if you get an error in the create table 'wizard', you can then cancel and there should be the query used that failed written in a new query window, for you to edit and fix.
My database had a hypen, so I added backticks in the query and rerun it.
This happened to me due to having comments in the query.
I realized this was a possibility when I tried the "Format Query" button and it turned the entire thing into almost 1 line, mostly commented out. My guess is that the query parser runs this formatter before sending the query to Athena.
Removed the comments, ran the query, and an angel got its wings!

In Redshift, how do you combine CTAS with the "if not exists" clause?

I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.