Google cloud storage to bigquery policy tags don't work - google-cloud-platform

Context
On Airflow using the GoogleCloudStorageToBigQueryOperator to load files from Google cloud storage into BigQuery.
Schema as per Bigquery documentation table schema.
Policy tags implemented as per documentation, tested manually via the UI - works as expected.
Blocker
The policy tags are not implemented when the load completes, even though it's specified in the schema fields. The other schema fields work as expected.
import airflow
from airflow import DAG
from google.cloud import bigquery
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
with DAG(
'gcs_to_bq',
catchup=False,
default_args=default_args,
schedule_interval=None) as dag:
DATASET_NAME="temp"
TABLE_NAME="table"
gcs_to_bq_load = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq_load',
bucket="temp-bucket",
source_objects=['dummy_data/data.csv'],
source_format='CSV',
skip_leading_rows=1,
write_disposition='WRITE_TRUNCATE',
destination_project_dataset_table=f"{DATASET_NAME}.{TABLE_NAME}",
schema_fields=
[{
"name": "id",
"mode": "NULLABLE",
"type": "INT64",
"fields": []
},
{
"name": "email",
"mode": "REQUIRED",
"type": "STRING",
"description": "test policy tags",
"policyTags": {
"names": ["projects/project-id/locations/location/taxonomies/taxonomy-id/policyTags/policytag-id"]
}
},
{
"name": "created_at",
"mode": "NULLABLE",
"type": "DATE",
"fields": []
}
]
,
dag=dag)
gcs_to_bq_load

Related

django.request logger to find "Synchronous middleware … adapted" for Django async

I've set up a trial async view in my Django app but the view continues to render in sync. As per Django docs, I'm checking that my Middleware isn't causing the issue:
Middleware can be built to support both sync and async contexts. Some of Django’s middleware is built like this, but not all. To see what middleware Django has to adapt, you can turn on debug logging for the django.request logger and look for log messages about “Synchronous middleware … adapted”.
There's already been a Stack Overflow question to elaborate on using the logger to find which middleware is prevent async from working, but the answer is incomplete.
This is what I have in my settings, as per the above Stack Overflow answer:
settings.py
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'handlers': {
'console': {
'class': 'logging.StreamHandler',
},
},
'loggers': {
'django.request': {
'handlers': ['console'],
'level': os.getenv('DJANGO_LOG_LEVEL', 'DEBUG'),
'propagate': False,
},
},
}
And my view:
views.py
from time import sleep
import asyncio
import logging
logger = logging.getLogger("info")
# --- Views
from django.db import transaction
#transaction.non_atomic_requests
async def async0(request):
#loop = asyncio.get_event_loop()
#loop.create_task(lets_sleep())
await asyncio.sleep(1)
logger.debug('in index')
logger.info('something')
return HttpResponse("Hello, async Django!")
I've restarted the built in Django server, but I don't see a log output anywhere. Where should I be looking ?
although you posted your view, the log is about the middleware, so you should have a sync only middleware to get the log you want. after this, you need to add root logger as well:
LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"handlers": {
"console": {
"class": "logging.StreamHandler",
},
},
"root": {
"level": "DEBUG",
},
"loggers": {
"django.request": {
"handlers": ["console"],
"level": "DEBUG",
},
},
}
this was enough for me to get the Synchronous middleware someMiddlewareOfYours adapted log msg

How to transform data before loading into BigQuery in Apache Airflow?

I am new to Apache Airflow. My task is to read data from Google Cloud Storage, transform the data and upload the transformed data into BigQuery table. I'm able to get data from Cloud Storage bucket and directly store that to BigQuery table. I'm not sure how to include the transform function in this pipeline.
Here's my code:
# Import libraries needed for the operation
import airflow
from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
# Default Argument
default_args = {
'owner': <OWNER_NAME>,
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=2),
}
# DAG Definition
dag = DAG('load_from_bucket_to_bq',
schedule_interval='0 * * * *',
default_args=default_args)
# Variable Configurations
BQ_CONN_ID = <CONN_ID>
BQ_PROJECT = <PROJECT_ID>
BQ_DATASET = <DATASET_ID>
with dag:
# Tasks
start = DummyOperator(
task_id='start'
)
upload = GoogleCloudStorageToBigQueryOperator(
task_id='load_from_bucket_to_bigquery',
bucket=<BUCKET_NAME>,
source_objects=['*.csv'],
schema_fields=[
{'name': 'Active_Cases', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Country', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Last_Update', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'New_Cases', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'New_Deaths', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Total_Cases', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Total_Deaths', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Total_Recovered', 'type': 'STRING', 'mode': 'NULLABLE'},
],
destination_project_dataset_table=BQ_PROJECT + '.' + BQ_DATASET + '.' + <TABLE_NAME>,
write_disposition='WRITE_TRUNCATE',
google_cloud_storage_conn_id=BQ_CONN_ID,
bigquery_conn_id=BQ_CONN_ID,
dag = dag
)
end = DummyOperator(
task_id='end'
)
# Setting Dependencies
start >> upload >> end
Any help on how to proceed is appreciated. Thanks.
Posting the conversation with #sachinmb27 as an answer. The transform can be placed in a python function and use PythonOperator to call the transform function at runtime. More details on what operators can be used in Airflow can be seen in Airflow Operator docs.

How get Cognito users list in JSON-format

I'm going to backup of my Cognito users with Lambda but I can't get Cognito users list in JSON-format with boto3. I do:
import boto3
import os
import json
from botocore.exceptions import ClientError
COGNITO_POOL_ID = os.getenv('POOL_ID')
S3_BUCKET = os.getenv('BACKUP_BUCKET')
ENV_NAME = os.getenv('ENV_NAME')
filename = ENV_NAME + "-cognito-backup.json"
REGION = os.getenv('REGION')
cognito = boto3.client('cognito-idp', region_name=REGION)
s3 = boto3.resource('s3')
def lambda_handler (event,context):
try:
response = (cognito.list_users(UserPoolId=COGNITO_POOL_ID,AttributesToGet=['email_verified','email']))['Users']
data = json.dumps(str(response)).encode('UTF-8')
s3object = s3.Object(S3_BUCKET, filename)
s3object.put(Body=(bytes(data)))
except ClientError as error:
print(error)
But get one string and I'm not sure that is JSON at all:
[{'Username': 'user1', 'Attributes': [{'Name': 'email_verified', 'Value': 'true'}, {'Name': 'email', 'Value': 'user1#xxxx.com'}], 'UserCreateDate': datetime.datetime(2020, 2, 10, 13, 13, 34, 457000, tzinfo=tzlocal()), 'UserLastModifiedDate': datetime.datetime(2020, 2, 10, 13, 13, 34, 457000, tzinfo=tzlocal()), 'Enabled': True, 'UserStatus': 'FORCE_CHANGE_PASSWORD'}]
I need something like this:
[
{
"Username": "user1",
"Attributes": [
{
"Name": "email_verified",
"Value": "true"
},
{
"Name": "email",
"Value": "user1#xxxx.com"
}
],
"Enabled": "true",
"UserStatus": "CONFIRMED"
}
]
Try this:
import ast
import json
print(ast.literal_eval(json.dumps(response)))
For the dict response from the SDK?
Edit: Just realized since the list_users SDK also UserCreateDate object, json.dumps will complain about the transformation due to the datatime value of the UserCreateDate key. If you get that off, this will work without the ast module -
import json
data = {'Username': 'Google_11761250', 'Attributes': [{'Name': 'email', 'Value': 'abc#gmail.com'}],'Enabled': True, 'UserStatus': 'EXTERNAL_PROVIDER'}
print((json.dumps(data)))
> {"Username": "Google_1176125910", "Attributes": [{"Name": "email", "Value": "123#gmail.com"}], "Enabled": true, "UserStatus": "EXTERNAL_PROVIDER"}
You can check the output type by using
type(output)
I guess that it can be list type, so you can convert it into JSON and prettyprint by using:
print(json.dumps(output, indent=4))

IntegrityError when loading Django fixtures with OneToOneField using SQLite

When attempting to load initial data via the syncdb command, Django throws the following error:
django.db.utils.IntegrityError: Problem installing fixtures: The row in table 'user_profile_userprofile' with primary key '1' has an invalid foreign key: user_profile_userprofile.user_id contains a value '1' that does not have a corresponding value in user_customuser.id.
There is a OneToOne relationship between the UserProfile model and CustomUser:
class UserProfile(TimeStampedModel):
user = models.OneToOneField(settings.AUTH_USER_MODEL, null=True, blank=True)
Running ./manage.py dumpdata user --format=json --indent=4 --natural-foreign produces the following:
CustomUser Model Data
[
{
"fields": {
"first_name": "Test",
"last_name": "Test",
"is_active": true,
"is_superuser": true,
"is_staff": true,
"last_login": "2014-10-21T11:33:42Z",
"groups": [],
"user_permissions": [],
"password": "pbkdf2_sha256$12000$Wqd4ekGdmySy$Vzd/tIFIoSABP9J0GyDRwCgVh5+Zafn9lOiTGin9/+8=",
"email": "test#test.com",
"date_joined": "2014-10-21T11:22:58Z"
},
"model": "user.customuser",
"pk": 1
}
]
Running ./manage.py dumpdata user_profile --format=json --indent=4 --natural-foreign produces the following:
Profile Model
[
{
"fields": {
"weight": 75.0,
"created": "2014-10-21T11:23:35.536Z",
"modified": "2014-10-21T11:23:35.560Z",
"height": 175,
"user": 1,
},
"model": "user_profile.userprofile",
"pk": 1
}
]
Loading just the CustomUser model's initial data and then following up with UserProfile via load data works great, which suggests to me syncdb is attempting to load UserProfile before CustomUser has been loaded.
If the simplest solution would be to force the load order, what would the simplest way be to do this?
I guess you should use Migrations https://docs.djangoproject.com/en/1.7/topics/migrations/ , they are ordered. But if you using older Django version then 1.7, install south https://south.readthedocs.org/en/latest/

django-social-auth: logging in from a unit test client

I use django-social-auth as my authentication mechanism and I need to test my app with logged in users. I'm trying:
from django.test import Client
c = Client()
c.login(username='myfacebook#username.com", password='myfacebookpassword')
The user which is trying to login succeeds to login from a browser. The app is already allowed to access user's data.
Any ideas how to login from a unittest when using django-social-auth as the authentication mechanism?
Thanks
Create a fixture with User instances
{
"pk": 15,
"model": "auth.user",
"fields": {
"username": "user",
"first_name": "user",
"last_name": "userov",
"is_active": true,
"is_superuser": false,
"is_staff": false,
"last_login": "2012-07-20 15:37:03",
"groups": [],
"user_permissions": [],
"password": "!",
"email": "",
"date_joined": "2012-07-18 13:29:53"
}
}
Create a fixture with SocialAuthUser instances like this
{
"pk": 7,
"model": "social_auth.usersocialauth",
"fields": {
"uid": "1234567",
"extra_data": "%some account social data%",
"user": 15,
"provider": "linkedin"
}
}
So you will get the user, who has the same behavior as a real user and has all the social data you need.
Set the new password and then you can use the auth mechanism for log this user in:
...
user.set_password('password')
user.save()
logged_in = self.client.login(username='user', password='password')
and then just call the view with login required
self.client.get("some/url")
Don't forget, that django.contrib.auth.backends.ModelBackend is needed, and django.contrib.sessions should be in your INTALLED_APPS tuple
Also, the advantage of using standard auth is that you don't need to make a server request for getting oauth tokens and so on.