How do I execute the SHOW PARTITIONS command on an Athena table? - amazon-web-services

I'm using AWS Athena with AWS Glue for the first time, with S3 providing a 'folder' structure which maps to the partitions in my data - I'm getting into the concepts so please excuse any mistaken description!
I'm looking at what happens when I add data to my S3 bucket and see that new folders are ignored. Digging deeper I came across the 'SHOW PARTITIONS' command, as described here https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html, and I'm trying to execute this against my test tables using the Athena query editor, with a mind that I'll go onto use the 'ALTER TABLE ADD PARTITION' command to add the new S3 folders.
I'm trying to execute the 'SHOW PARTITIONS' command in the AWS Athena Console Query Editor:
SHOW PARTITIONS "froms3_customer-files"."unit"
but when I try to execute it I see this message:
line 1:17: missing {'from', 'in'} at '"froms3_customer-files"' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: c0c0c351-2d42-4da4-b1f3-223b1733db65)
I'm struggling to understand what this is telling me, can anyone help me here?

Athena does not supports hyphen in database name.
Athena table, view, database, and column names cannot contain special
characters, other than underscore (_).
Also remove the double quotes from the show partitions command.
SHOW PARTITIONS froms3_customer_files.unit
References :
Athena table and database naming convention
Athena show partitions

If you want to see all the partitions that are created till now you can use following command
SHOW PARTITIONS DB_NAME.TABLE_NAME
If you want to view the keys along which table is partitioned you can view it through UI in following way:
1. Click on the table menu options.
2. Click on Show Properties
3. Click on partitions to see partition.

Related

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

cannot export partition by date table from BigQuery to GCP

I'm trying to export the partition by day table to the GCP bucket.
My main goal is to create daily files in the GCP.
Here is my simple first test for the extraction:
bq extract --destination_format=CSV test.test_table$20210719 gs://test_partition/part_col=20210716/test_*.csv
test_table is a partition table and I do have 7k rows for the date 2021-07-19.
However, when I run this line, I get this
BigQuery error in extract operation: Error processing job 'test-net:bqjob_1234456789: Not found: Table test-
net:temp.test_partitiontime0210719 was not found in location EU
As you can see above the table name is temp.test_partitiontime0210719 there is no 2 at the beginning of it. Whenever $ sign is included in the table name, the first character after it is being removed.
I've tried to apply this GCP documentation.
Also, I have tried this comment, but it hasn't worked for me.
How can I extract the partition table to the GCP for a specific date?
It's a linux side effect. If you put $something linux search for an environment variable with the key something.
To skip that, you can use simple quote ' to say to linux: "Hey, do not evaluate this environment variable token", so do this
bq extract --destination_format=CSV 'test.test_table$20210719' gs://test_partition/part_col=20210716/test_*.csv

Query Athena tables & output column for 'S3 source' path

Currently using information_schema.tables to list all tables in my catalog.
What I am missing, is a column to tell me which S3 path each table (external) is pointing to.
Looked in all the information_schema tables, but cannot see this info.
The only place I've seen this via 'sql' is with the 'SHOW CREATE TABLE' command, which doesn't give the result in a proper recordset.
Failing that ... is there another way to keep tabs on all of your tables and their sources ?
Many Thanks.
So as above, could find no way of doing this from the database.
Actual solution below for interest (& in case anyone finds a better way)
From CLI:
Call AWS glue get-tables & output json to file
Sync file to S3
ETL job to convert multi-line json into single-line json and place in new bucket
Crawl new bucket
Now query/unnest in Athena
'convoluted' is a word that comes to mind !
At least it gets there data I need where I need it
Again, if anyone finds an easier way.... ?

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.

AWS Athena: HIVE_UNKNOWN_ERROR: Unable to create input format

I've crawled a couple of XML files on S3 using AWS Glue, using a simple XML classifier:
However, when I try running any query on that data using AWS Athena, I get the following error (note that it's the simplest possible query I'm doing here):
HIVE_UNKNOWN_ERROR: Unable to create input format
Note that Athena can see my tables and it can see the columns, it just can't query them:
I noticed that there is someone with the same problem on the AWS Discussion forums: Athena XML Query Give HIVE Unknown Error but it got no love from anyone.
I know there is a similar question here about this error but the query in question targeted an RDS database, unlike an S3 bucket like I have here.
Has anyone got a solution for this?
Sadly at this time 12/2018 Athena cannot query XML input which is hard to understand when you may hear that Athena along with AWS Glue can query xml.
What output you are seeing from the AWS crawler is correct though, just not what you think its doing! For example after your crawler has run and you see the tables, but cannot execute any Athena queries. Go into your AWS Glue Catalog and at the right click tables, click your table, edit properties it will look something like this:
Notice how input format is null? If you have any other tables you can look at their properties or refer back to the input formatters documentation for Athena. This is the error you recieve.
Solutions:
convert your data to text/json/avro/other supported formats prior to upload
create a AWS glue job which converts a source to target from xml to target supported Athena format(compressed hopefully with ORC/Parquet)