Ways to backup AWS Athena views - amazon-web-services

In an AWS Athena instance we have several user-created views.
Would like to back-up the views.
Have been experimenting using AWS CLI
aws athena start-query-execution --query-string “show views...
and for each view
aws athena start-query-execution --query-string “show create views...
and then
aws athena get-query-execution --query-execution-id...
to get the s3 location for the create view code.
Looking for ways to get the view definitions backed up.If AWS CLI is the best suggestion, then I will create a Lambda to do the backup.

I think SHOW VIEWS is the best option.
Then you can get the Data Definition Language (DDL) with SHOW CREATE VIEW.
There are a couple of ways to back the views up. You could use GIT (AWS offers CodeCommit). You could definitely leverage CodeCommit in a Lambda Function using Boto3.
In fact, just by checking the DDL, you are in fact backing them up to [S3].
Consider the following DDL:
CREATE EXTERNAL TABLE default.dogs (
`breed_id` int,
`breed_name` string,
`category` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION
's3://stack-exchange/48836509'
TBLPROPERTIES ('skip.header.line.count'='1')
and the following view based on it.
CREATE VIEW default.vdogs AS SELECT * FROM default.dogs;
When we show the DDL:
$ aws athena start-query-execution --query-string "SHOW CREATE VIEW default.vdogs" --result-config
uration OutputLocation=s3://stack-exchange/66620228/
{
"QueryExecutionId": "ab21599f-d2f3-49ce-89fb-c1327245129e"
}
We write to S3 (just like any Athena query).
$ cat ab21599f-d2f3-49ce-89fb-c1327245129e.txt
CREATE VIEW default.vdogs AS
SELECT *
FROM
default.dogs

Related

AWS Glue Crawler breaks table data structure in Athena query

Problem Statement: CSV data currently stored in S3 (extracted from Postgresql RDS), I need to query this S3 data using Athena. To achieve this, I created AWS Glue DB and running a crawler on this S3 bucket but the data in Athena query is broken (starts from columns with large text content). I tried changing the data type of Glue Table schema from string to varchar(1000) and recrawl, but still it breaks.
Data stored in S3 bucket :
Data coming out of Athena query on same bucket (using SELECT *) [note the missing row]
Also tested loading the S3 data using jupyter notebook in AWS Glue Studio with this code snippet and the output data looks correct here :
dynamicFrame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://..."]},
format="csv",
format_options={
"withHeader": True,
# "optimizePerformance": True,
},
)
Any help on this would be greatly appreciated!

Restrict AppSync permissions in AWS CDK

I am trying to build an AppSync API connected to a DynamoDB table in AWS using the CDK in Python. I want this to be a read only API with no create, delete, update. In my stack I add the AppSync API:
# AppSync API for data and data catalogue queries
api = _appsync.GraphqlApi(self,
'DataQueryAPI',
name='dataqueryapi',
log_config=_appsync.LogConfig(field_log_level=_appsync.FieldLogLevel.ALL),
schema=_appsync.Schema.from_asset('graphql/schema.graphql')
)
I then add the DynamoDB table as a data source as follows:
# Data Catalogue DynamoDB Table as AppSync Data Source
data_catalogue_api_ds = api.add_dynamo_db_data_source(
'DataCatalogueDS',
data_catalogue_dynamodb
)
I later add some resolvers with mapping templates but even after just the above, if I run cdk diff I see that this will create permission changes that appear to grant full access to AppSync when interacting with the DynamoDB table.
I only want this to be a read only API and so the question is how can I restrict permissions so that the AppSync API can only read from the table?
What I have tried was to add a role that would explicitly grant query permissions in the hope that this would prevent the creation of the wider set of permissions but it didn't have that effect and I'm not really sure where I was going with it or if it was on the right track:
role = _iam.Role(self,
"Role",
assumed_by=_iam.ServicePrincipal("appsync.amazonaws.com")
)
api.grant_query(role, "getData")
Following a comment on this question I have swapped add_dynamo_db_data_source for DynamoDbDataSource as it has a read_only_access parameter. So I am now using:
data_catalogue_api_ds = _appsync.DynamoDbDataSource(self,
'DataCatalogueDS',
table=data_catalogue_dynamodb,
read_only_access=True,
api=api
)
Which seems to then just give me read permissions:

AWS Glue Column Level access Control

How to give column level access to particular roles in Glue Catalog? I want to give Role_A permissions to only column_1 and column_2 of Table XYZ And, Role_B to give access to all columns of Table XYZ.
AWS Glue offers fine grained access only for tables/databases [1]. If you want users to restrict to only few columns then you have to use AWS Lake Formation. Refer to this which has examples.
For example if you want to give access to only two columns prodcode and location then you can achieve it by doing as shown below:
aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/datalake_user1 --permissions "SELECT" --resource '{ "TableWithColumns": {"DatabaseName":"retail", "Name":"inventory", "ColumnNames": ["prodcode","location"]}}'

"aws dynamodb list-tables" not showing the tables present

When I use:
aws dynamodb list-tables
I am getting:
{
"TableNames": []
}
I gave the region as default as I did the same while aws configure.
I also tried with specific region name.
When I check in AWS console also I don't see any DynamoDB tables, but i am able to access the table programmatically. Able to add and modify item as well.
But no result when enter I use aws dynamodb list-tables and also no tables when I check in console.
This is clearly a result of the commands looking in the wrong place.
DynamoDB tables are stored in an account within a region. So, if there is definitely a table but none are showing, then the credentials being used either belong to a different AWS Account or the command is being sent to the wrong region.
You can specify a region like this:
aws dynamodb list-tables --region ap-southeast-2
If you are able to access the table dynamically, the make sure the same credentials being used by your program are also being used for the AWS CLI.
We need to specify the endpoint in the command which will work . As the above dynamodb is used programmatically and used as wed app.
this command will work :
aws dynamodb list-tables --endpoint-url http://localhost:8080 --region us-west-2
Check the region you set up in AWS configuration vs what is displayed at the top of the AWS console. I had my app configured to us-east-2 but the AWS console had us-east-1 as the default. I was able to view my table once the correct region was selected in the AWS console.

Is there a simple way to clone a glue job, but change the database connections?

I have a large number of clients who supply data in the same format, and need them loading into identical tables in different databases. I have set up a job for them in Glue, but now I have to do the same thing another 20 times
Is there any way I can take an existing job and copy it, but with changes to the S3 filepath and the JDBC connection?
I haven't been able to find much online regarding scripting in AWS Glue. Would this be achievable through the AWS command line interface?
The quickest way would be to use the aws cli.
aws glue get-job --job-name <value>
where value is the specific job that you are trying to replicate. You can then alter the s3 path and JDBC connection info in the JSON that the above command returns. Also, you'll need to give it a new unique name. Once you've done that, you can pass that in to:
aws glue create-job --cli-input-json <value>
where value is the updated JSON that you are trying to create a new job from.
See AWS command line reference for more info on the glue command line
use the command
aws glue create-job --generate-cli-skeleton
to generate the skeleton JSON
Use the below command to get the existing job's definition
aws glue get-job --job-name <value>
Copy the values from the output of existing job's definition into skeleton
Remove the newline character and pass it as input to below command
aws glue create-job --cli-input-json <framed_JSON>
Here is the complete reference for Create Job AWS CLI documentation
https://docs.aws.amazon.com/cli/latest/reference/glue/create-job.html
PS: don't change the order of the elements in JSON (generated in skeleton), only update the connection and name
--cli-input-json (string) Performs service operation based on the JSON string provided. The JSON string follows the format provided by --generate-cli-skeleton. If other arguments are provided on the command line, the CLI values will override the JSON-provided values. It is not possible to pass arbitrary binary values using a JSON-provided value as the string will be taken literally.
--generate-cli-skeleton (string) Prints a JSON skeleton to standard output without sending an API request. If provided with no value or the value input, prints a sample input JSON that can be used as an argument for --cli-input-json. If provided with the value output, it validates the command inputs and returns a sample output JSON for that command.
Thanks to the great answers here, you already know that the AWS CLI comes to the rescue.
Tip: if you don't want to install or update the AWS CLI, just use the AWS CloudShell!
I've tested the commands here using version:
$ aws --version
aws-cli/1.19.14 Python/3.8.5 Linux/5.4.0-65-generic botocore/1.20.14
If you want to create a new job from scratch, you'll want a template first, which you can get with:
aws glue create-job --generate-cli-skeleton > job_template.json
Then use your favourite editor (I like vim) to fill out the details in job_template.json (or whatever you call it).
But if DuckDuckGo or other engine sent you here, there's probably an existing job that you would like to clone and tweak. We'll call it "perfect_job" in this guide.
Let's get a list of all the jobs, just to check we're in the right place.
aws glue list-jobs --region us-east-1
The output shows us two jobs:
{
"JobNames": [
"perfect_job",
"sunshine"
]
}
View our job:
aws glue get-job --job-name perfect_job --region us-east-1
The JSON output looks right, let's put it in a file so we can edit it:
aws glue get-job --job-name perfect_job --region us-east-1 > perfect_job.json
Let's cp that to a new file, say  super_perfect_job.json. Now you can edit it to change the fields as desired. The first thing of course is to change the Name!
Two things to note:
Remove the outer level of the JSON, we need the value of Job not the Job identifier itself. If you look at job_template.json created above, you'll see that it must start with Name, so it's a small edit to match the format requirement.
There's no CreatedOn or LastModifiedOn in job_template.json either, so let's delete those lines too. Don't worry, if you forget to delete them, the creation will fail with a helpful message like 'Parameter validation failed: Unknown parameter in input: "LastModifiedOn"'.
Now we're ready to create the job! The following example will add Glue job "super_perfect_job" in the Cape Town region:
aws glue create-job --cli-input-json file://super_perfect_job.json --region af-south-1
But that didn't work:
An error occurred (InvalidInputException) when calling the CreateJob
operation: Please set only Allocated Capacity or Max Capacity.
I delete MaxCapacity and try again. Still not happy:
An error occurred (InvalidInputException) when calling the CreateJob
operation: Please do not set Allocated Capacity if using Worker Type
and Number of Workers.
Fine. I delete AllocatedCapacity and have another go. This time the output is:
{
    "Name": "super_perfect_job"
}
Which means, success! You can confirm by running list-jobs again. It's even more rewarding to open the AWS Console and see it pop up in the web UI.
We can't wait to run this job, so we'll use the CLI as well, and we'll pass three additional parameters: --fruit, --vegetable and --nut which our script expects. But -- would confuse the AWS CLI so let's store these in a file called args.json containing:
{
  "--fruit": "tomato",
  "--vegetable": "cucumber",
  "--nut": "almond"
}
And call our job like so:
aws glue start-job-run --job-name super_perfect_job --arguments file://args.json --region af-south-1
Or like this:
aws glue start-job-run --job-name super_perfect_job --arguments '{"--fruit": "tomato","--vegetable": "cucumber"}'
And you can view the status of job runs with:
aws glue get-job-runs --job-name super_perfect_job --region us-east-1
As you can see, the AWS Glue API accessed by the AWS CLI is pretty powerful, being not only convenient, but allowing automation in Continuous Integration (CI) servers like Jenkins, for example. Run aws glue help for more commands and quick help or see the online documentation for more details.
For creating or managing permanent infrastructure, it's preferable to use Infrastructure as Code tools, such as CloudFormation or Terraform.