TL;DR
We are looking for a way for letting a service account inherit BQ read permissions from more than one other service account. Impersonation does only work with one.
The scenario
Our company follows a data mesh approach in which our product teams are responsible for integrating their data into BigQuery. The product owner is also considered the owner of the data. So it is the product owner to decides whom to give access to the data.
Working in an analytical team, we usually combine data from multiple source systems in our BigQuery queries. Our ETL processes run on a kubernetes cluster, each process is using a separate service account. This gives us fine-grained access control as access to data is limited for each process to those very objects that they really need. This design also helps us with debugging and cost control. On the other hand, this leads to an issue on source side:
The problem
Every time we design a new process, we need to ask the data owner for allowance. They already agreed that our product team / system at wholesome level may access their data, so this authorization process is quite cumbersome and confuses the data owner.
We'd prefer to have just one "proxy" service account for each source object that holds the neccessary BQ read permissions. The processes' service accounts would then be set up to inherit the rights from the proxy service account of those BQ sources they need to access.
Using impersonation does only help if it's just one source system, but our queries often use more than one.
Using Google Groups does not help
We discussed a solution in which we setup a google group for each source system we want to read from. BigQuery Data Reader role will then be assigned to this group. In turn, service accounts that require those rights will be added to the group. However, company policy does not allow for adding service accounts to google groups. Also, google groups cannot be managed (created) by our product teams themselves, so this approach lacks flexibility.
Implementing a coarse-grained approach
One approach is to use a more coarsed-grained access control, i.e. just using one service account for all ETL processes. We could add the process name as a label to the query to cover the debugging and cost control part. However, if possible, we'd prefer an approach in which the processes can only access as little data objects as possible.
You haven't easy solution.
Data governance is in place to control the quality, the source and the access to the data. It's normal to ask them to access the data.
Special groups could have access to all the data source (after a request to the data gov team of each data mesh instance).
However, groups with service account aren't allowed.
The only solution that I see is to use a service account, authorized on all the data mesh instances, and you impersonate it to access to all the sources.
It's not the most perfect for traceabilty, but I don't see any other good solution for that.
Related
Context: As a new intern at a firm, one of my responsibilities is to maintain a clean and ordered QuickSight Analysis and Datasets list.
There are a lot of existing analysis reports and dashboards on the firm's Amazon QuickSight account, dating back to several years. There is a concern of deleting the old reports/supporting datasets which take up a lot of SPICE storage because of the thought that someone is using/accessing it. Is there a way one can see the stats of each report - how many people accessed it, how many times it was used over the last month etc., which could help one decide the analysis reports/datasets that can be deleted. Please help.
This AWS blog post -- Using administrative dashboards for a centralized view of Amazon QuickSight objects -- discussed how BI administrators can use the QuickSight dashboard, Lambda functions, and other AWS services to create a centralized view of groups, users, and objects access permission information and abnormal access auditing.
It is mainly security focused, but you can get the idea, how to find the relevant information about access to QuickSight objects in the AWS CloudTrail events.
I'm creating a platform whereby users upload data to us. We check the data to make sure it's safe and correctly formatted, and then store the data in buckets, tagging by user.
The size of the data upload is normally around 100MB. This is large enough to be concerning.
I'm worried about cases where certain users may try to store an unreasonable about of data on the platform, i.e. they make 1000s of transactions within a short period of time.
How do cloud service providers allow site admins to monitor the amount of data stored per user?
How is this actively monitored?
Any direction/insight appreciated. Thank you.
Amazon S3 does not have a mechanism for limiting data storage per "user".
Instead, your application will be responsible for managing storage and for defining the concept of an "application user".
This is best done by tracking data files in a database, including filename, owner, access permissions, metadata and lifecycle rules (eg "delete after 90 days"). Many applications allow files to be shared between users, such as a photo sharing website where you can grant view-access of your photos to another user. This is outside the scope of Amazon S3 as a storage service, and should be handled within your own application.
If such data is maintained in a database, it would be trivial to run a query to identify "data stored per user". I would recommend against storing such information as metadata against the individual objects in Amazon S3 because there is no easy way to query metadata across all objects (eg list all objects associated with a particular user).
Bottom line: Amazon S3 will store your data and keep it secure. However, it is the responsibility of your application to manage activities related to users.
I've an application that queries some of my AWS accounts every few hours. Is it safe (from memory, number of connections perspective) to create a new client object for every request ? As we need to sync almost all of the resource types for almost all of the regions, we end up with hundred clients(number of regions multiplied by resource types) per service run.
In general creating the AWS clients are pretty cheap and it is fine to create them and quickly dispose them. The one area I would be careful with when comes to performance is when the SDK has do resolve the credentials like assuming IAM roles to get credentials. It sounds like in your case you are iterating through a bunch of accounts so I'm guessing you are explicitly setting credentials and so that will be okay.
I'm working on client-side SDK for my product (based on AWS). Workflow is as follows:
User of SDK somehow uploads data to some S3 bucket
User somehow saves command on some queue in SQS
One of the worker on EC2 polls the queue, executes operation and sends notification via SNS. This point seems to be clear.
As you might have noticed, there are quite some unclear points about access management here. Is there any common practice to provide access to AWS services (S3 and SQS in this case) for 3rd-party users of such SDK?
Options which I see at the moment:
We create IAM-user for users of the SDK which have access to some S3 resources and write permission for SQS.
We create additional server/layer between AWS and SDK which is writing messages to SQS instead of users as well as provides one-time short-living link for SDK to write data directly to S3.
First one seems to be OK, however I'm hesitant that I'm missing some obvious issues here. Second one seems to have a problem with scalability - if this layer will be down, whole system won't work.
P.S.
I tried my best to explain the situation, however I'm afraid that question might still lack some context. If you want more clarification - don't hesitate to write a comment.
I recommend you look closely at Temporary Security Credentials in order to limit customer access to only what they need, when they need it.
Keep in mind with any solution to this kind of problem, it depends on your scale, your customers, and what you are ok exposing to your customers.
With your first option, letting the customer directly use IAM or temporary credentials exposes knowledge to them that AWS is under the hood (since they can easily see requests leaving their system). It has the potential for them to make their own AWS requests using those credentials, beyond what your code can validate & control.
Your second option is better since it addresses this - by making your server the only point-of-contact for AWS, allowing you to perform input validation / etc before sending customer provided data to AWS. It also lets you replace the implementation easily without affecting customers. On availablily/scalability concerns, that's what EC2 (and similar services) are for.
Again, all of this depends on your scale and your customers. For a toy application where you have a very small set of customers, simpler may be better for the purposes of getting something working sooner (rather than building & paying for a whole lot of infrastructure for something that may not be used).
I'm developing a platform where users will in effect have their own site within a directory of my own. Each user site will consist of a package of php scripts and the template/image files for their sites custom layout. Each user site will be connected to their own Amazon RDS. I need to be able to track the resource usage of each directory so that I can bill each user for the resources they have used. Would it be possible to setup custom metrics with CloudWatch so that I can calculate costs?
You should be able to use cloudwatch to do this, however, it might not be the most efficient place to put this information if you are going to bill or report on it. I think you are better off computing the data and then storing it in a database of your own. This way you have easy access to the data and you can do things with data that may not work well in the context of cloudwatch.