I've gone with the approach of using a kms_aead_key template to generate a DEK with the key in KMS acting as the KEK and then writing the encrypted key out for use in BigQuery as a variable.
I can’t seem to decrypt the DEK in BigQuery. It always fails with an error asking if I am referencing the right key which I’m sure I am.
I’ve reproduced the problem in the script below. This script uses the same key_uri to encrypt in Python and decrypt in BigQuery, but I still get that error. I think the issue could be the format in which I am writing the key out and passing that to BQ, but I’m a little lost.
To get the script to run I had to download the roots.pem file from https://github.com/grpc/grpc/blob/master/etc/roots.pem
A placed the roots.pem file in the same directory as my script.
I then set this environment variable GRPC_DEFAULT_SSL_ROOTS_FILE_PATH=roots.pem
If you can figure out what is wrong I would be very grateful.
Thanks
import io
import tink
from tink import aead, cleartext_keyset_handle
from tink.integration import gcpkms
from google.cloud import bigquery
key_uri = 'gcp-kms://projects/PROJECT/locations/europe-west2/keyRings/KEY_RING/cryptoKeys/KEY_NAME'
aead.register()
gcpkms.GcpKmsClient.register_client(key_uri=key_uri, credentials_path="")
template = aead.aead_key_templates.create_kms_aead_key_template(key_uri=key_uri)
keyset_handle = tink.KeysetHandle.generate_new(template)
kms_aead_primitive = keyset_handle.primitive(aead.Aead)
# Encrypt
encrypted_value = kms_aead_primitive.encrypt(
plaintext='encrypt_me'.encode('utf-8'),
associated_data='test'.encode('utf-8')
)
out = io.BytesIO()
writer = tink.BinaryKeysetWriter(out)
cleartext_keyset_handle.write(writer, keyset_handle)
out.seek(0)
key = out.read()
# Decrypt in BigQuery
bq_client = bigquery.Client(location='europe-west2')
sql = f"""
DECLARE kms_resource_name STRING;
DECLARE first_level_keyset BYTES;
DECLARE associated_data STRING;
SET kms_resource_name = '{key_uri}';
SET first_level_keyset = {key};
SET associated_data = 'test';
SELECT
AEAD.DECRYPT_STRING(
KEYS.KEYSET_CHAIN(kms_resource_name, first_level_keyset),
{encrypted_value},
associated_data
) as Decrypt,
"""
job_config = bigquery.QueryJobConfig(
priority=bigquery.QueryPriority.BATCH
)
job = bq_client.query(sql, job_config=job_config)
result = job.result()
Related
we've set up AWS SecretsManager as a secrets backend to Airflow (AWS MWAA) as described in their documentation. Unfortunately, nowhere is explained where the secrets are to be found and how they are to be used then. When I supply conn_id to a task in a DAG, we can see two errors in the task logs, ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined. What's even more surprising is that when retrieving variables stored the same way with Variable.get('my_variable_id'), it works just fine.
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Btw, neither the connection nor the variable show up in the Airflow UI.
Having stored a secret named airflow/connections/redshift_conn (and on the side one airflow/variables/my_variable_id), I expect the connection to be found and used when constructing RedshiftSQLOperator(task_id='mytask', redshift_conn_id='redshift_conn', sql='SELECT 1'). But this results in the above error.
I am able to retrieve the redshift connection manually in a DAG with a separate task, but I think that is not how SecretsManager is supposed to be used in this case.
The example DAG is below:
from airflow import DAG, settings, secrets
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
from airflow.models.baseoperator import chain
from airflow.models import Connection, Variable
from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
from datetime import timedelta
sm_secret_id_name = f'airflow/connections/redshift_conn'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(1),
'retries': 1,
}
def read_from_aws_sm_fn(**kwargs): # from AWS example code
### set up Secrets Manager
hook = AwsBaseHook(client_type='secretsmanager')
client = hook.get_client_type('secretsmanager')
response = client.get_secret_value(SecretId=sm_secret_id_name)
myConnSecretString = response["SecretString"]
print(myConnSecretString[:15])
return myConnSecretString
def get_variable(**kwargs):
my_var_value = Variable.get('my_test_variable')
print('variable:')
print(my_var_value)
return my_var_value
with DAG(
dag_id=f'redshift_test_dag',
default_args=default_args,
dagrun_timeout=timedelta(minutes=10),
start_date=days_ago(1),
schedule_interval=None,
tags=['example']
) as dag:
read_from_aws_sm_task = PythonOperator(
task_id="read_from_aws_sm",
python_callable=read_from_aws_sm_fn,
provide_context=True
) # works fine
query_redshift = RedshiftSQLOperator(
task_id='query_redshift',
redshift_conn_id='redshift_conn',
sql='SELECT 1;'
) # results in above errors :-(
try_to_get_variable_value = PythonOperator(
task_id='get_variable',
python_callable=get_variable,
provide_context=True
) # works fine!
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Using secret manager as a backend, you don't need to change the way you use the connections or variables. They work the same way, when looking up a connection/variable, airflow follow a search path.
Btw, neither the connection nor the variable show up in the Airflow UI.
The connection/variable will not up in the UI.
ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined
The 1st error is related to the secret and the 2nd error is due to the connection not existing in the airflow UI.
There is 2 formats to store connections in secret manager (depending on the aws provider version installed) the IPv6 URL error could be that its not parsing the connection correctly. Here is a link to the provider docs.
First step is defining the prefixes for connections and variables, if they are not defined, your secret backend will not check for the secret:
secrets.backend_kwargs : {"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}
Then for the secrets/connections, you should store them in those prefixes, respecting the required fields for the connection.
For example, for the connection my_postgress_conn:
{
"conn_type": "postgresql",
"login": "user",
"password": "pass",
"host": "host",
"extra": '{"key": "val"}',
}
You should store it in the path airflow/connections/my_postgress_conn, with the json dict as string.
And for the variables, you just need to store them in airflow/variables/<var_name>.
I used the wildcard * export in order to export a large BigQuery table into separate files in GCS. I used the code sample provided in GCP's docs:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'bucket'
project = "project"
dataset_id = "dataset"
table_id = "table"
destination_uri = "gs://{}/{}".format(bucket_name, "table*.parquet")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location="US",
) # API request
extract_job.result() # Waits for job to complete.
print(
"Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri)
)
This generated 19 different files in my storage bucket like this mytable000000000000.parquet and mytable000000000001.parquet and so on (up to 0000000000019).
It would be nice to have an automatic way to get a list of these file names so that I can either compose them together or loop over them to do something else. Is there an easy way to edit the code above to do this?
You don't get an explicit list when using a wildcard, but take a look at the destinationUriFileCounts field in the extract job statistics. It would tell you how many files are present. In python, this is available here.
If you want stronger validation, you could also leverage the Cloud Storage libraries and list objects with the same pattern(s) you supplied as part of the extract configuration.
I have developed different Athena Workgroups for different teams so that I can separate their queries and their query results. The users would like to query the tables available to them from their notebook instances (JupyterLab). I am having difficulty finding code which successfully covers the requirement of querying a table from the user's specific workgroup. I have only found code that will query the table from the primary workgroup.
The code I have currently used is added below.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>')
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
df
This code does not work as the users only have access to perform queries from their specific workgroups hence get errors when this code is run. It also does not cover the requirement of separating the user's queries in user specific workgroups.
Any suggestions on how I can add alter the code so that I can run the queries within a specific workgroup from the notebook instance?
Documentation of pyathena is not super extensive, but after looking into source code we can see that connect simply creates instance of Connection class.
def connect(*args, **kwargs):
from pyathena.connection import Connection
return Connection(*args, **kwargs)
Now, after looking into signature of Connection.__init__ on GitHub we can see parameter work_group=None which name in the same way as one of the parameters for start_query_execution from the official AWS Python API boto3. Here is what their documentation say about it:
WorkGroup (string) -- The name of the workgroup in which the query is being started.
After following through usages and imports in Connection we endup with BaseCursor class that under the hood makes a call to start_query_execution while unpacking a dictionary with parameters assembled by BaseCursor._build_start_query_execution_request method. That is excatly where we can see familar syntax for submitting queries to AWS Athena, in particular the following part:
if self._work_group or work_group:
request.update({
'WorkGroup': work_group if work_group else self._work_group
})
So this should do a trick for your case:
import pandas as pd
from pyathena import connect
conn = connect(
s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>',
work_group='<USER SPECIFIC WORKGROUP>'
)
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
I implemented this it worked for me.
!pip install pyathena
Ref link
from pyathena import connect
from pyathena.pandas.util import as_pandas
import boto3
query = """
Select * from "s3-prod-db"."CustomerTransaction" ct where date(partitiondate) >= date('2022-09-30') limit 10
"""
query
cursor = connect(s3_staging_dir='s3://s3-temp-analytics-prod2/',
region_name=boto3.session.Session().region_name, work_group='data-scientist').cursor()
df = cursor.execute(query)
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.output_location)
df = as_pandas(cursor)
print(df)
If we dont pass work_group parameter will use "primary" as the default work_group.
If we pass s3_staging_dir='s3://s3-temp-analytics-prod2/' s3 bucket which does not exist, it will create this bucket.
But if the user role that you are running the script does not have to create bucket privilege it will throw an exception.
Can someone provide a complete example of the following: Use boto3 and Python (2.7) to upload a file from a desktop computer to DigitalOcean Spaces, such that the uploaded file would be publically readable from Spaces.
DigitalOcean says their Spaces API is the same as the S3 API. I don't know if this is 100% true, but here is their API: https://developers.digitalocean.com/documentation/spaces
I can do the file-upload, with the code below, but I can't find an example of how to specify the file be publically readable. I do not want to make the entire Space (= S3 bucket) readable -- only the object.
import boto3
boto_session = boto3.session.Session()
boto_client = boto_session.client('s3',
region_name='nyc3',
endpoint_url='https://nyc3.digitaloceanspaces.com',
aws_access_key_id='MY_SECRET_ACCESS_KEY_ID',
aws_secret_access_key='MY_SECRET_ACCESS_KEY')
boto_client.upload_file( FILE_PATHNAME, BUCKETNAME, OBJECT_KEYNAME )
Changing the last statement to the following did not work:
boto_client.upload_file( FILE_PATHNAME, BUCKETNAME, OBJECT_KEYNAME,
ExtraArgs={ 'x-amz-acl': 'public-read', } )
Thank you.
Thanks to #Vorsprung for his comment above. The following works:
import boto3
boto_session = boto3.session.Session()
boto_client = boto_session.client('s3',
region_name='nyc3',
endpoint_url='https://nyc3.digitaloceanspaces.com',
aws_access_key_id='MY_SECRET_ACCESS_KEY_ID',
aws_secret_access_key='MY_SECRET_ACCESS_KEY')
boto_client.upload_file( FILE_PATHNAME, BUCKETNAME, OBJECT_KEYNAME )
boto_client.put_object_acl( ACL='public-read', Bucket=BUCKETNAME, Key=OBJECT_KEYNAME )
Is there a way at all to query on the global secondary index of dynamodb using boto3. I dont find any online tutorials or resources.
You need to provide an IndexName parameter for the query function.
This is the name of the index, which is usually different from the name of the index attribute (the name of the index has an -index suffix by default, although you can change it during table creation). For example, if your index attribute is called video_id, your index name is probably video_id-index.
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('videos')
video_id = 25
response = table.query(
IndexName='video_id-index',
KeyConditionExpression=Key('video_id').eq(video_id)
)
To check the index name, go to the Indexes tab of the table on the web interface of AWS. You'll need a value from the Name column.
For anyone using the boto3 client, below example should work:
import boto3
# for production
client = boto3.client('dynamodb')
# for local development if running local dynamodb server
client = boto3.client(
'dynamodb',
region_name='localhost',
endpoint_url='http://localhost:8000'
)
resp = client.query(
TableName='UsersTabe',
IndexName='MySecondaryIndexName',
ExpressionAttributeValues={
':v1': {
'S': 'some#email.com',
},
},
KeyConditionExpression='emailField = :v1',
)
# will always return list
items = resp.get('Items')
first_item = items[0]
Adding the updated technique:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource(
'dynamodb',
region_name='localhost',
endpoint_url='http://localhost:8000'
)
table = dynamodb.Table('userTable')
attributes = table.query(
IndexName='UserName',
KeyConditionExpression=Key('username').eq('jdoe')
)
if 'Items' in attributes and len(attributes['Items']) == 1:
attributes = attributes['Items'][0]
There are so many questions like this because calling dynamo through boto3 is not intuitive. I use dynamof library to make things like this a lot more common sense. Using dynamof the call looks like this.
from dynamof.operations import query
from dynamof.conditions import attr
query(
table_name='users',
conditions=attr('role').equals('admin'),
index_name='role_lookup_index')
https://github.com/rayepps/dynamof
disclaimer: I wrote dynamof