how to get a list of dynamo db tables in aws that have a creationDate less than a certain value - amazon-web-services

I wish to use an aws cli commad that will return a list of dynamodb tables in my aws account where the CreationDateTime is less than a certain value. CreationDateTime is a property that is shown with the describe-table command, but the list-tables command only returns the names of the table. Is there any way I could possibly use a query to filter the names returned by list-tables in accordance with their corresponding CreationDateTime?

As noted, the answer is simply no, the AWS CLI can not do what you want in one query.
You need to resort to shell scripting to chain the necessary aws commands, or,
if you are willing to give up the hard requirement that it must be the AWS CLI, an alternative solution (and probably faster than forking off aws over and over) is to use a 5 line python script to get the result you want. You can still run this from the command-line, but it won't be the AWS CLI specifically.
Just perform the time arithmetic that suits your specific selection criteria.
#!/usr/bin/env python3
import boto3
from datetime import datetime, timedelta, timezone
ddb = boto3.resource('dynamodb')
for t in sorted(filter(lambda table: datetime.now(timezone.utc) - table.creation_date_time < timedelta(days=90), ddb.tables.all()), key=lambda table: table.creation_date_time):
print(f"{t.name}, {str(t.creation_date_time)}") # tables created in the last 90 days

No - As you noted, ListTables only lists the names of tables, and there is no request that provides additional information for each table, let alone filter on such information. You'll need to use ListTables and then DescribeTable on each of those tables. You can run all of those DescribeTable requests in parallel to make the whole thing much faster (although there is a practical snag - to do it in parallel, you'll need to have opened a bunch of connections to the server).

Related

Export DynamoDB table to S3 automatically

The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.

Can AWS Glue write to DynamoDB?

I need to do some grouping job from a Source DynamoDB table, then write each resulting Item to another Target DynamoDB table (or a secondary index of the Source one).
Here I see that DynamoDB can be used as a Source (as well as reported in Connection Types).
However, it's not clear to me if a DynamoDB table can be used as Target as well.
Note: each resulting grouping item must be written into a separate DynamoDB Item (i.e., if there are X objects resulting from grouping, X Items must be written to Target DynamoDB table).
Glue can now read and write to DynamoDB. The option to write is not available via the console, but can be done by editing the script.
Example:
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_Frame1,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "myDDBTable",
"dynamodb.throughput.write.percent": "1.0"
}
)
As per:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#etl-connect-dynamodb-as-sink
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-dynamo-db-cross-account.html
The Glue Job scripts can be customized to write to any datasource. If you are using the auto generated scripts, you can add boto3 library to write to DynamoDb tables.
If you want to test the scripts easily, you can create a Dev endpoint through AWS console & launch a jupyter notebook to write and test your glue job scripts.

Daily AWS Lambda not creating Athena partition, however commands runs successfully

I have an Athena database set up pointing at an S3 bucket containing ALB logs, and it all works correctly. I partition the table by a column called datetime and the idea is that it has the format YYYY/MM/DD.
I can manually create partitions through the Athena console, using the following command:
ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'
I have created a lambda to run daily to create a new partition, however this doesn't seem to work. I use the boto3 python client and execute the following:
result = athena.start_query_execution(
QueryString = "ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'",
QueryExecutionContext = {
'Database': 'web'
},
ResultConfiguration = {
"OutputLocation" : "s3://aws-athena-query-results-093305704519-eu-west-1/Unsaved/"
}
)
This appears to run successfully without any errors and the query execution even returns a QueryExecutionId as it should. However if I run SHOW PARTITIONS web.alb_logs; via the Athena console it hasn't created the partition.
I have a feeling it could be down to permissions, however I have given the lambda execution role full permissions to all resources on S3 and full permissions to all resources on Athena and it still doesn't seem to work.
Since Athena query execution is asynchronous your Lambda function never sees the result of the query execution, it just gets the result of starting the query.
I would be very surprised if this wasn't a permissions issue, but because of the above the error will not appear in the Lambda logs. What you can do is to log the query execution ID and look it up with the GetQueryExecution API call to see that the query succeeded.
Even better would be to rewrite your code to use the Glue APIs directly to add the partitions. Adding a partition is a quick and synchronous operation in Glue, which means you can make the API call and get a status in the same Lambda execution. Have a look at the APIs for working with partitions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html

Automate aws Athena partition loading [duplicate]

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.

AWS Elastic Mapreduce optimizing Pig job

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue.
The logfiles in question are stored in s3 with keys that correspond to the dates they are emitted from the logging server, eg: /2013/03/01/access.log. These files are very, very large. My mapreduce job runs an Apache Pig script that simply examines some of the uri paths stored in the log files and outputs generalized counts that correspond to our business logic.
My client code in boto takes date times as input on cli and schedules a jobflow with a PigStep instance for every date needed. Thus, passing something like python script.py 2013-02-01 2013-03-01 would iterate over 29 days worth of datetime objects and create pigsteps with the respective input keys for s3. This means that the resulting jobflow could have many, many steps, one for each day in the timedelta between the from_date and to_date.
My problem is that my EMR jobflow is exceedingly slow, almost absurdly so. It's been running for a night now and hasn't made it even halfway through that example set. Is there something wrong I am doing creating many jobflow steps like this? Should I attempt to generalize the pig script for the different keys instead, rather than preprocessing it in the client code and creating a step for each date? Is this a feasible place to look for an optimization on Elastic Mapreduce? It's worth mentioning that a similar job for a months worth of comparable data passed to the AWS elastic-mapreduce cli ruby client took about 15 minutes to execute (this job was fueled by the same pig script.)
EDIT
Neglected to mention, job was scheduled for two instances of type m1.small, which admittedly may in itself be the problem.