I have a cloud setup, with a DynamomDB table with entries - which each have a corresponding SQS queue (created by a lambda when entries are added in the DB)
A number of external systems need access to one or more entries, which means that they should read messages from the SQS queue linked to the entries in the database.
I'm looking for the smartest way to do the following:
Create database entries (based on an input JSON string or file reference etc) - this will automatically generate the needed queues
Create an IAM user
Generate ARNs for all queues generated in (1) and generate a matching set of permissions, so the user created in (2) can read/delete/purge these queues.
Output the newly created user credentials and all related items in the database.
Of course I can make a lambda and a javascript that does all of it. But I'm looking for a smarter way to do this, hopefully using the AWS CLI only?
I ended up doing like this:
Created user groups in CF
Made a shell script for creating users and assigning proper groups
Made another shell script for creating resources in the database table, and adding the created user name as a part of the data model
Made a lambda (triggered by DB events) that sets permissions on the queues related to the database entries and sets the ARN of the users in the database, as the principal.
Works so far!
Related
I have a react app using Amplify with auth enabled. The app has many users, all of whome are members of one "client", no more.
I would like to be able to limit access to the data in a Glue table to users that are members of the client, using IAM, so that I have a security layer as close to the data layer as possible.
I have a 'clientid' partition in the table. The table is backed by an s3 bucket, with each client's data stored in their own 'clientid=xxxxxx' folder. The table was created by a Glue job with the following option in the "write_dynamic_frame" method at the end, which created the folders.
{"partitionKeys": ["clientid"]},
My first idea was to use the clientid in the front-end to bake the user's client ID into the query to select just their partition but, clearly, that is open to abuse.
Then I tried to use a Glue crawler to scan the existing table's s3 bucket in the hope it would create one table per folder, if I unchecked the "Create a single schema for each S3 path" option. However, the crawler 'sees' the folders as partitions (presumably, in at least part, due to the hive partitioning structure) and I just get a single table again.
There are tens of thousands of clients and TB's of data, so moving/renaming data around and manually creating tables is not feasible.
Please help!
I assume you have a mechanism in place already to assign an IAM role (individual or per client) to each user on the front end, otherwise that's a big topic that should probably be its own question.
The most basic way to solve your problem is to make sure that the IAM roles only have s3:GetObject permission to the prefix of the partition(s) that the user is allowed to access. This would mean that users can only access their own data and will receive an error if they try accessing other users' data. They could potentially fish for what client IDs are valid, though, by trying different combinations and observing the difference between the query not hitting any partition (which would be allowed since no files would be accessed), and the query hitting a partition (which would not be allowed).
I think it would be better to create tables, or even databases per client, that would allow you to put permissions at the Glue Data Catalog level too, not allowing queries at all for other databases/tables than the user's own. Glue Crawlers won't help you with that unfortunately, they're too limited in what they can do, and will try to be helpful in unhelpful ways. You can create these tables easily with the Glue Data Catalog API and you won't have to move any data, just point the tables' locations at the locations of the current partitions.
I am looking for a better way to replicate data from one AWS account DynamoDB to another account.
I know this can be done using Lambda triggers and streams.
Is there something like Global tables which exist in AWS we can use for replication across accounts?
I think the best way for migrating data between accounts is using AWS Data pipeline. This process will essentially take a backup (export) of your DynamoDb table in Account A to a S3 bucket of account B via DataPipeline. Then, one more DataPipeline job in account B would import the data from S3 back to the provided DynamoDb table.
The step by step manual is given in this document
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb.html.
Also you will need Cross-account access to the S3 bucket which you will be using to store your DynamoDB table data from account A, so the bucket or (the files) you are using must be shared between your account (A) and your destination account (B), till the migration gets completed.
Refer this doc for permissions https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
Another approach you can take is using script. There is no direct API for migration. You will have to use two clients, one for each account. One client will scan the data and other client could write that data into the table in another account.
Also I think they do have an import-export tool as mentioned in their AWSLabs repo. Although I have never tried this.
https://github.com/awslabs/dynamodb-import-export-tool
I want to give a customer access to data in my account, but building a table around my s3 bucket requires additional overhead as the number of partitions are increasing overtime. In my account, I have a lambda that automatically handles this by dropping/creating Athena tables with the necessary partition projections. I would like to have this same lambda create/drop an Athena table on another AWS account, not owned by me, is this possible? I know I could give each team access to my Athena table, but then all querying costs are accrued by me.
I want to give a customer access to data in my account, but building a table around my s3 bucket requires additional overhead as the number of partitions are increasing overtime.
If I'm not mistaken, you won't have to manage partitions of your Athena/Glue table if you have Partition Projection set up. So if your customer has the right permissions for accessing objects in your S3 bucket, they should be able to run queries via a table defined in their account.
I would like to have this same lambda create/drop an Athena table on another AWS account, not owned by me, is this possible?
Yes, this is certainly possible as long as the correct permissions for creating/deleting tables in Glue are granted.
Based on feedback I received on the issues I was encountering, I ended up not needing to create/drop tables on another AWS account and instead used Athena's ALTER TABLE SET query to properly change the partition projections needed for my Athena table (via the execution of a lambda). Customers are required to do the same for their tables that rely on our s3 data.
Note: If you choose to go this route, make sure that the code that runs the ALTER TABLE SET handles the race condition associated with setting the properties of the table rather than appending to them. (i.e If a property value is equal to XYZ and the code responsible for executing the query runs in parallel (via multiple lambda invocations or some sort of parallelism), then you may run into the case that process A attempts to add value T to XYZ while process B attempts to add value C resulting in either XYZT or XYZC instead of the expected XYZTC)
I have multiple folders inside a bucket each folder is named as a unique guid and it is always going to contain a single file.
I need to fetch only those files which have never been read before. If I'll fetch all the objects at once and then do client side filtering it might introduce latency in the near future as every day the number of new folders getting added could be hundreds.
Initially I tried to list object by specifying StartAfter, but soon I realized it only works with alphabetically sorted list.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
I am using AWS C# SDK. Can someone please give me some idea about the best approach.
Thanks
Amazon S3 does not maintain a concept of "objects that have not been accessed".
However, there is a different approach to process each object only once:
Create an Amazon S3 Event that will trigger when an object is created
The Event can then trigger:
An AWS Lambda function, or
Send a message to an Amazon SQS queue, or
Send a message to an Amazon SNS topic
You could therefore trigger your custom code via one of these methods, and you will never actually need to "search" for new objects.
I created a test Redshift cluster and enabled audit logging on the database. This creates connection logs, user logs and user activity logs (details about the logs are available here). This creates the logs in S3 bucket in the following location:
s3://bucket_name/AWSLogs/123456789012/redshift/<region>/<year>/<month>/<date>/*_<log_type>_<timestamp>.gz
Next I created a Glue Crawler and pointed the data store to s3://bucket_name/AWSLogs/123456789012/redshift and left the remaining options as the default values.
When I run the Crawler, it creates a separate table for every log item. Instead, I expect it to create 3 tables (one each for user log, user activity log and connection log).
Following are some things I tried with no success:
Updated the data store to point to prefix further inside the bucket like s3://bucket_name/AWSLogs/123456789012/redshift/<region>.
Grouping behavior: create a single schema for each S3 path
Configuration options: add new columns only
Am I missing something here? Thank you.
You cant keep all 3 schema files under one folder. They should be in separate folders before running crawler at root folder