How to assign column-level restriction on BigQuery table in asia-east1 location - google-cloud-platform

I want to restrict access to certain PII columns of my BigQuery tables. My tables are present in location: asia-east1. The BigQuery 'Policy Tag' feature can create policy tags for enforcing column restrictions only in 'US' and 'EU' regions. When I try to assign these policy tags to my asia-east1 tables, it fails with error:
BigQuery error in update operation: Policy tag reference projectsproject-id/locations/us/taxonomies/taxonomy-id/policyTags/policytag-id
should contain a location that is in the same region as the dataset.
Any idea on how I can implement this column level restriction for my asia-east1 BigQuery tables?

Summarising our discussion from the comment section.
According to the documentation, BigQuery provides fine grained access to sensitive data based on type or data classification of the data. In order to achieve this, you can use Data Catalog to create a the taxonomy and policy for your data.
Regarding the location of the Policy tags, asia-east1. Currently, this feature is on Beta. This is a launch stage where the product is available for broader testing and use and new features/updates might be still taking place. For this reason, Data Catalog locations are limited to the ones listed here. As shown in the link, asia-east1 end point has Taiwan as the region.
As an addition information, here is a How to guide to implement Policy Tags in BigQuery.

Related

How to share an Athena Iceberg table with another account

I've recently been looking into the Apache Iceberg table format to reduce Athena query times on a Glue table with a large number of partitions, the additional features would be a bonus (transactions, row-level updates/deletes, time-travel queries etc). I've successfully built the tables and confirmed that they address the issue at-hand but I'd now like to be able to share the table with another AWS account, we've done this previously using Lake Formation cross-account grants and also the method described here but both approaches raise errors in the alternate account when trying to query the shared table. I've also tried using a bucket policy and registering a duplicate Glue table in the other account which doesn't throw an error but no rows are found when querying.
Is this currently possible to do? I'm aware that I could achieve this by providing role access into the account with the iceberg table but this complicates interaction with the table from other services in the alternate account. Any ideas appreciated.
Edit: When querying the lake formation table I see 'Generic internal error - access denied', it's documented that Iceberg tables don't work with Lake Formation so this is expected. When querying the table shared via cross account data catalog I see 'HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info' when running a SELECT query and 'FAILED: SemanticException Unable to fetch table XXXXXXXXX. Unable to get table: java.lang.NullPointerException' when running SHOW CREATE TABLE or DESCRIBE. I can successfully run SHOW TBLPROPERTIES.
As of now Apache Iceberg Lake Formation integration is not supported:
Lake Formation – Integration with AWS Lake Formation is not supported.
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html

How to secure access to an athena/glue table by partition using native AWS features

I have a react app using Amplify with auth enabled. The app has many users, all of whome are members of one "client", no more.
I would like to be able to limit access to the data in a Glue table to users that are members of the client, using IAM, so that I have a security layer as close to the data layer as possible.
I have a 'clientid' partition in the table. The table is backed by an s3 bucket, with each client's data stored in their own 'clientid=xxxxxx' folder. The table was created by a Glue job with the following option in the "write_dynamic_frame" method at the end, which created the folders.
{"partitionKeys": ["clientid"]},
My first idea was to use the clientid in the front-end to bake the user's client ID into the query to select just their partition but, clearly, that is open to abuse.
Then I tried to use a Glue crawler to scan the existing table's s3 bucket in the hope it would create one table per folder, if I unchecked the "Create a single schema for each S3 path" option. However, the crawler 'sees' the folders as partitions (presumably, in at least part, due to the hive partitioning structure) and I just get a single table again.
There are tens of thousands of clients and TB's of data, so moving/renaming data around and manually creating tables is not feasible.
Please help!
I assume you have a mechanism in place already to assign an IAM role (individual or per client) to each user on the front end, otherwise that's a big topic that should probably be its own question.
The most basic way to solve your problem is to make sure that the IAM roles only have s3:GetObject permission to the prefix of the partition(s) that the user is allowed to access. This would mean that users can only access their own data and will receive an error if they try accessing other users' data. They could potentially fish for what client IDs are valid, though, by trying different combinations and observing the difference between the query not hitting any partition (which would be allowed since no files would be accessed), and the query hitting a partition (which would not be allowed).
I think it would be better to create tables, or even databases per client, that would allow you to put permissions at the Glue Data Catalog level too, not allowing queries at all for other databases/tables than the user's own. Glue Crawlers won't help you with that unfortunately, they're too limited in what they can do, and will try to be helpful in unhelpful ways. You can create these tables easily with the Glue Data Catalog API and you won't have to move any data, just point the tables' locations at the locations of the current partitions.

What does an AWS Glue Crawler do

I've read the AWS glue docs re: the crawlers here: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html but I'm still on unclear on what exactly the Glue crawler does. Does a Crawler go through your S3 buckets, and create pointers to those buckets?
When the docs say "The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog" what is the purpose of these metadata tables?
The CRAWLER creates the metadata that allows GLUE and services such as ATHENA to view the S3 information as a database with tables. That is, it allows you to create the Glue Catalog.
This way you can see the information that s3 has as a database composed of several tables.
For example if you want to create a crawler you must specify the following fields:
Database --> Name of database
Service role service-role/AWSGlueServiceRole
Selected classifiers --> Specify Classifier
Include path --> S3 location
Crawlers are needed to analyze data in specified s3 location and generate/update Glue Data Catalog which is basically is a meta-store for actual data (similar to Hive metastore). In other words it persists information about physical location of data, its schema, format and partitions which makes it possible to query actual data via Athena or to load it in Glue jobs.
I would suggest to read this documentation to understand Glue crawlers better and of course make some experiments.

Is it possible to see a history of interaction with tables in a Redshift schema?

Ultimately, I would like to obtain a list of tables in a particular schema that haven't been queried in the last two weeks (say).
I know that there are many system tables that track various things about how the Redshift cluster is functioning, but I have yet to find one that I could use to obtain the above.
Is what I want to do possible?
Please have a look at our "Unscanned Tables" query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/unscanned_table_summary.sql
If you have enabled audit logging for the cluster, activity data stored inside a S3 bucket which you configured while enabling logging.
According to AWS Documentation, audit log bucket structure is as follows.
AWSLogs/AccountID/ServiceName/Region/Year/Month/Day/AccountID_ServiceName_Region_ClusterName_LogType_Timestamp.gz
For example: AWSLogs/123456789012/redshift/us-east-1/2013/10/29/123456789012_redshift_us-east-1_mycluster_userlog_2013-10-29T18:01.gz

DynamoDB Table Missing?

I had created a simple table in dynamo called userId, I could view it in the AWS console and query it through some java on my local machine. This morning, however, I could no longer see the table in the dynamo dashboard but I could still query it through the java. The dashboard showed no tables at all (I only had one, the missing 'userId'). I then just created a new table using the dashboard, called it userId and populated it. However, now when I run my java to query it, the code is returning the items from the missing 'userId' table, not this new one! Any ideas what is going on?
Ok, that's strange. I thought dynamo tables were not specified by region but I noticed once I created this new version of 'userId' it was viewable under the eu-west region but then I could see the different (previously missing!) 'userId' table in the us-east region. They both had the same table name but contained different items. I didn't think this was possible?
Most of the services of Amazon Web Services are in a single region. The only exceptions are Route 53 (DNS), IAM, and CloudFront (CDN). The reason is that you want to control the location of your data, mainly for regulatory reasons. Many times your data can't leave the US or Europe or any other region.
It is possible to create high availability for your services within a single region with availability zones. This is how the highly available services as DynamoDB or S3 are giving such functionality, by replicating the data between availability zones, but within a single region.