Running multiple MSCK REPAIR TABLE statements in AWS Athena - amazon-web-services

So I'm trying to execute the following in AWS Athena which allows to run only one statement at a time:
MSCK REPAIR TABLE some_database.some_table_001;
MSCK REPAIR TABLE some_database.some_table_002;
MSCK REPAIR TABLE some_database.some_table_003;
Problem is, I just don't have three statements, I have 700+ similar statements and would like to run those all 700+ in one go as batch.
So using AWS CloudShell CLI and tried running the following:
aws athena start-query-execution --query-string "MSCK REPAIR TABLE `some_table_001`;" --work-group "primary" \
--query-execution-context Database=some_database \
--result-configuration "OutputLocation=s3://some_bucket/some_folder"
..hoping I could use Excel to generate 700+ statements like this and run as batch
..but keep getting this error:
An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:1: mismatched input 'MSCK'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'PREPARE', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'UNLOAD', 'UPDATE', 'USE', <query>
Not sure what I'm doing wrong as the same MSCK command seems to run fine in Athena console. I know Athena is finicky when it comes to
`some_table_001`
versus
'some_table_001'
(different types of single quotes), I tried both but didn't get it work.
Any thoughts on possible solution?

For Amazon Athena, special characters other than underscore (_) are not supported in the table names and table column names.
Since your table name doesn't contain special characters, you don't need to wrap it in quotes or backticks. However, had your table name contained special characters then you would have had to use backticks. Refer below Amazon Athena Documentation for more details here.
Regarding running multiple MSCK REPAIR TABLE statements in Amazon Athena, you may also use any SDK like boto3 (Python).

Related

Accessing datacatalog table in Glue properly

I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'.
I am trying to access the dynamic frame following the glue way:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
The above logs output: Count: 0 & Schema: StructType([], {}), where the Athena table shows I have around ~800,000 rows.
Sidenotes:
The ETL job concerned has AWSGlueServiceRole attached.
I tried Glue Visual Editor as well, it showed the datacatalog database/table concerned but sadly, same error.
It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding additional_options = {"recurse": True} to your from_catalog(). This will help to recursively read records from s3 files.

Athena - reserved words and table that cannot be queried

I'm putting JSON data files into S3, and use AWS-Glue to build the table definition. I have about 120 fields per each json "row". One of the fields is called "timestamp" in lower case. I have 1000s of large files, and would hate to change them all.
Here (https://docs.aws.amazon.com/athena/latest/ug/reserved-words.html), I see TIMESTAMP in DDL is a reserved word. Does that mean I won't be able to read those JSON file from Athena.
I'm getting this error, which lead me to the above being a potential reason.
I clicked the 3 dots to the right of the tablename, and clicked "Preview Table", which built and ran this select statement:
SELECT * FROM "relatixcurrdayjson"."table_currday" limit 10;
That lead to an error which seems wrong or misleading:
Your query has the following error(s):
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause
This query ran against the "relatixcurrdayjson" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c448f0ea-5086-4436-9107-2b60dab0f04f.
If I click the option that says "Generate Create Table DDL", it builds this line to execute:
SHOW CREATE TABLE table_currday;
and results in this error:
Your query has the following error(s):
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.NullPointerException
This query ran against the "relatixcurrdayjson" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 6ac5d90f-8d52-4e3e-8f16-cd42e1edcfa3.
This is the AWS Glue Log:
UPDATE #1:
I used Athena a couple of weeks ago with CSV and it worked great.
This time I'm using JSON.
I created a new folder with one file containing the following, ran the Glue Crawler:
[
{"firstName": "Neal",
"lastName": "Walters",
"city": "Irving",
"state", "TX"
}
{"firstName": "Fred",
"lastName": "Flintstone",
"city": "Bedrock",
"state", "TX"
}
{"firstName": "Barney",
"lastName": "Rubble",
"city": "Stillwater",
"state", "OK"
}
]
and this SQL gives the same error as above:
SELECT * FROM "relatixcurrdayjson"."tbeasyeasytest" limit 10;
It's very easy to get Glue crawlers to create tables that don't work in Athena, which is surprising given that it's the primary goal it was designed for.
If the JSON you posted is exactly what you ran your crawler against the problem is that Athena does not support multi-line JSON documents. Your files must have exactly one JSON document per line. See Dealing with multi-line JSON? (And, bonus points, CRLF), Multi-line JSON file querying in hive, and Create Table in Athena From Nested JSON

AWS CLI DynamoDB Called From Powershell Put-Item fails when a value contains a space

So, let's say I'm trying to post this JSON via the command line (not in a file because I'm not going to write a file for every invocation of this script) to a dynamo DB table
{\"TeamId\":{\"S\":\"One_Space_123\"},\"TeamName\":{\"S\":\"One_Space\"},\"Environment\":{\"S\":\"cte\"},\"StartDate\":{\"S\":\"null\"},\"EndDate\":{\"S\":\"null\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someones user\"},\"EmailDistributionList\":{\"S\":\"test#test.com\"},\"RemedyGroup\":{\"S\":\"OneSpace\"},\"ScomSubscriptionId\":{\"S\":\"guid-ab22-2345\"},\"ZabbixActionId\":{\"S\":\"11\"},\"SnsTopic\":{\"M\":{\"TopicName\":{\"S\":\"ATopicName\"},\"TopicArn\":{\"S\":\"AtopicArn1234\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someones user\"}}}}
Then the result from the CLI is one like this:
Unknown options: Space"},"ScomSubscriptionId":{"S":"guid-ab22-2345"},"ZabbixActionId":{"S":"11"},"SnsTopic":{"M":{"TopicName":{"S":"ATopicName"},"TopicArn":{"S":"AtopicArn1234"},"CreatedDate":{"S":"today"},"CreatedBy":{"S":"someones, user"}}}}, user"},"EmailDistributionList":{"S":"test#test.com"},"RemedyGroup":{"S":"One
As you can see, it fails on the TeamName property that in the above example is "One Space". If I change that value to "OneSpace" then instead it starts to fail on the "CreatedBy" property that is populated by "someones user" but if I remove all spaces from all properties I can suddenly pass this json to dynamoDB successfully.
In a working example the json looks like this:
{\"TeamId\":{\"S\":\"One_Space_123\"},\"TeamName\":{\"S\":\"One_Space\"},\"Environment\":{\"S\":\"cte\"},\"StartDate\":{\"S\":\"null\"},\"EndDate\":{\"S\":\"null\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someonesuser\"},\"EmailDistributionList\":{\"S\":\"test#test.com\"},\"RemedyGroup\":{\"S\":\"OneSpace\"},\"ScomSubscriptionId\":{\"S\":\"guid-ab22-2345\"},\"ZabbixActionId\":{\"S\":\"11\"},\"SnsTopic\":{\"M\":{\"TopicName\":{\"S\":\"ATopicName\"},\"TopicArn\":{\"S\":\"AtopicArn1234\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someonesuser\"}}}}
I can't find any documentation that tells me I can't have spaces, if I read this in from a file it will post it with the spaces, so what gives? If anyone has any advice on this matter, I certainly appreciate it.
For what it's worth in Powershell the execution looks like this currently (though I've tried various combinations of quoting the $dbTeamTableEntry variable
$dbEntry = aws.exe dynamodb put-item --region $region --table-name $table --item "$($dbTeamTableEntry)"

Automate aws Athena partition loading [duplicate]

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.

AWS SimpleDB CLI: How to use the 'select' command?

I'm trying to use the select command of AWS SimpleDB from AWS CLI.
The required call is as follows:
select
--select-expression <value>
with select-expression being described as follows: --select-expression (string) The expression used to query the domain.
The select is supposed to be similar to the SQL select statement, however I keep getting errors about the syntax, e.g.:
aws sdb select --select-expression "select * from my-domain"
An error occurred (InvalidQueryExpression) when calling the Select operation: The specified query expression syntax is not valid.
I can't find any documentation or example about the right syntax to use, either.
I found the solution - turns out I needed to use single quotes for the query and special character around the table name:
aws sdb select --select-expression 'select * from `my-domain`'