Google Dataprep copy flows from one project to another - google-cloud-platform

I have two Google projects: dev and prod. I import data from also different storage buckets located in these projects: dev-bucket and prod-bucket.
After I have made and tested changes in the dev environment, how can I smoothly apply (deploy/copy) the changes to prod as well?
What I do now is I export the flow from devand then re-import it into prod. However, each time I need to manually do the following in the `prod flows:
Change the dataset that serve as inputs in the flow
Replace the manual and scheduled destinations for the right BigQuery dataset (dev-dataset-bigquery and prod-dataset-bigquery)
How can this be done more smoother?

If you want to copy data between Google Cloud Storage (GCS) buckets dev-bucket and prod-bucket, Google provides a Storage Transfer Service with this functionality. https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console You can either manually trigger data to be copied from one bucket to another or have it run on a schedule.
For the second part, it sounds like both dev-dataset-bigquery and prod-dataset-bigquery are loaded from files in GCS? If this is the case, the BigQuery Transfer Service may be of use. https://cloud.google.com/bigquery/docs/cloud-storage-transfer You can trigger a transfer job manually, or have it run on a schedule.
As others have said in the comments, if you need to verify data before initiating transfers from dev to prod, a CI system such as spinnaker may help. If the verification can be automated, a system such as Apache Airflow (running on Cloud Composer, if you want a hosted version) provides more flexibility than the transfer services.

Follow below procedure for movement from one environment to another using API and for updating the dataset and the output as per new environment.
1)Export a plan
GET
https://api.clouddataprep.com/v4/plans/<plan_id>/package
2)Import the plan
Post:
https://api.clouddataprep.com/v4/plans/package
3)Update the input dataset
PUT:
https://api.clouddataprep.com/v4/importedDatasets/<datset_id>
{
"name": "<new_dataset_name>",
"bucket": "<bucket_name>",
"path": "<bucket_file_name>"
}
4)Update the output
PATCH
https://api.clouddataprep.com/v4/outputObjects/<output_id>
{
"publications": [
{
"path": [
"<project_name>",
"<dataset_name>"
],
"tableName": "<table_name>",
"targetType": "bigquery",
"action": "create"
}
]
}

Related

What precautions do I need to take when sharing an AWS Amplify project publicly?

I'm creating a security camera IoT project that uploads images to S3 and will soon offer a UI to review those images. AWS Amplify is being used to make this happen quickly.
As I get started on the Amplify side of things, I'm noticing a config file that has very specifically named attributes and values. The team-provider-info.json file in particular that isn't ignored is very specific:
{
"dev": {
"awscloudformation": {
"AuthRoleName": "amplify-twintigersecurityweb-dev-123456-authRole",
"UnauthRoleArn": "arn:aws:iam::111164163333:role/amplify-twintigersecurityweb-dev-123456-unauthRole",
"AuthRoleArn": "arn:aws:iam::111164163333:role/amplify-twintigersecurityweb-dev-123456-authRole",
"Region": "us-east-1",
"DeploymentBucketName": "amplify-twintigersecurityweb-dev-123456-deployment",
"UnauthRoleName": "amplify-twintigersecurityweb-dev-123456-unauthRole",
"StackName": "amplify-twintigersecurityweb-dev-123456",
"StackId": "arn:aws:cloudformation:us-east-1:111164163333:stack/amplify-twintigersecurityweb-dev-123456/88888888-8888-8888-8888-888838f58888",
"AmplifyAppId": "dddd7dx2zipppp"
}
}
}
May I post this to my public repository without worry? Is there a chance for conflict in naming? How would one pull this in for use in their new project?
Per AWS Amplify documentation:
If you want to share a project publicly and open source your serverless infrastructure, you should remove or put the amplify/team-provider-info.json file in gitignore file.
At a glance, everything else generated by amplify init NOT in the .gitignore file is ok to share. e.g. project-config.json and backend-config.json.
Add this to .gitignore:
# not to share if public
amplify/team-provider-info.json

Export result from Bigquery to a google bucket

I have to send the results from a bigquery to a google storage bucket. I'm used to send it to tables like this:
{
"schedule": null,
"owner":"agf#jdfgdfgs.es",
"email":["l1o3t0y2h5o3v6o3#jggvgfvf.com"],
"task_config":{
"orders":{
"destination_dataset_table":"international_reporting.orders"
}
Which I write in a JSON inside a Github repository, said repository is then called by Airflow.Those github repos are almost all the time a config JSON and the sql queries to execute. I don't know how to point to said bucket in Google Cloud storage. I would prefer to do it this way in order to keep the same style as the other, i.e cant use python -.-
Can you help me?
Please and thank you

Is there a way to create Quicksight analysis purely through code (boto3)?

What I currently have in my Quicksight account is a Data Source (Redshift), some datasets (some Redshift views) and an analysis (graphs and charts that use the datasets). I can view all of these on the AWS Quicksight Console. But when I use boto3 to create a data source and datasets, nothing shows up on the console. They do however show up when I use the list_data_sources and list_data_sets calls.
After this, I need to create all the graphs by code that I created manually. I can't currently find an option to do this through code. There is a 'create_template' api call which is supposed to create a template through an existing Quicksight analysis. But it requires the ARN of the analysis which I can't find.
Any suggestions on what to do?
Note: this only answers why the data sets/sources do not appear in the console. As for the other question, I assume mjgpy3 was of some help.
Summary
Add the permissions at the bottom of this post to your data set and data source in order for them to appear in the console. Make sure to fill in the principal arn with your details.
Details
In order for data sets and data sources to appear in the console when created via the API, you must ensure that the correct permissions have been added to them. Without adding the correct permissions, it is true that the CLI lists them whereas the console does not.
If you have created data sets/sources via the console, you can use the CLI (aws quicksight describe-data-set-permissions and aws quicksight describe-data-source-permissions) to view what permissions AWS gives them so that your account can interact with them.
I've tested this and these are what AWS assigns them as of 25/03/2020.
Data Set permissions:
"permissions": [
{
"Principal": "arn:aws:quicksight:<region>:<aws_account_id>:user/default/{IAM user name}",
"Actions": [
"quicksight:UpdateDataSetPermissions",
"quicksight:DescribeDataSet",
"quicksight:DescribeDataSetPermissions",
"quicksight:PassDataSet",
"quicksight:DescribeIngestion",
"quicksight:ListIngestions",
"quicksight:UpdateDataSet",
"quicksight:DeleteDataSet",
"quicksight:CreateIngestion",
"quicksight:CancelIngestion"
]
}
]
Data Source permissions:
"permissions": [
{
"Principal": "arn:aws:quicksight:<region>:<aws_account_id>:user/default/{IAM user name}",
"Actions": [
"quicksight:UpdateDataSourcePermissions",
"quicksight:DescribeDataSource",
"quicksight:DescribeDataSourcePermissions",
"quicksight:PassDataSource",
"quicksight:UpdateDataSource",
"quicksight:DeleteDataSource"
]
}
]
It sounds like your smaller question is regarding the ARN of the analysis.
The format of analysis ARNs is
arn:aws:quicksight:$AWS_REGION:$AWS_ACCOUNT_ID:analysis/$ANALYSIS_ID
Where
$AWS_REGION is replaced with the region in which the analysis lives
$AWS_ACCOUNT_ID is replaced with your AWS account ID
$ANALYSIS_ID is replaced with the analysis ID
If you're looking for the $ANALYSIS_ID it's the GUID-looking thing on the end of the URL for the analysis in the QuickSight URL
So, if you were on an analysis at the URL
https://quicksight.aws.amazon.com/sn/analyses/018ef6393-2c71-4842-9798-1aa2f0902804
the analysis ID would be 018ef6393-2c71-4842-9798-1aa2f0902804 (this is a fake ID I injected for this example).
Your larger question seems to be whether you can use the create_template API to duplicate your analysis. The answer at this moment (12/16/19) is, unfortunately, no.
You can use the create_dashboard API to publish a Dashboard from a template made with create_template but you can't create an Analysis from a template.
I'm answering this bit just to clarify since you may actually be okay with creating a dashboard (basically the published version of an analysis) rather than another analysis.
There are multiple ways you can find analysis id associated. Use any of the following.
A dashboard url has dashboard id included, Use this ID to execute API call describe-dashboard and you would see analysis ARN in the source entity.
Click on "save as" option on the dashboard and it would take you to the associated analysis. [ One might not see this option if a dashboard is created from a template ]
A dashboard ID can also be found by using list_dashboards API call. Print all the dashboard ID and name. You can match the ID with the given dashboard name.Look at the whole list because a dashboard id is unique but the dashboard name is not. One can have multiple dashboards with the same name.
Yes you can create lambda and trigger using cron Job
import boto3
quicksight = boto3.client('quicksight')
response = quicksight.create_ingestion(AwsAccountId=XXXXXXX,
DataSetId=YYYY,IngestionId=ZZZZ)
https://aws.amazon.com/blogs/big-data/automate-dataset-monitoring-in-amazon-quicksight/
https://aws.amazon.com/blogs/big-data/event-driven-refresh-of-spice-datasets-in-amazon-quicksight/
I've been playing with this as well and ran into the same issue. Make sure that your permissions are set up properly for the data source and the data set by referencing the quicksight user as follows:
arn:aws:quicksight:{region}:xxxxxxxxxx:user/default/{user}
I would include all the quicksight permissions found in the docs to start with and shave down from there. If nothing else, create the data source/set from the console, and then use the describe-* CLI call to see what they use.
It's kind of wonky.

Permissions Issue with Google Cloud Data Fusion

I'm following the instructions in the Cloud Data Fusion sample tutorial and everything seems to work fine, until I try to run the pipeline right at the end. Cloud Data Fusion Service API permissions are set for the Google managed Service account as per the instructions. The pipeline preview function works without any issues.
However, when I deploy and run the pipeline it fails after a couple of minutes. Shortly after the status changes from provisioning to running the pipeline stops with the following permissions error:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "xxxxxxxxxxx-compute#developer.gserviceaccount.com does not have storage.buckets.create access to project X.",
"reason" : "forbidden"
} ],
"message" : "xxxxxxxxxxx-compute#developer.gserviceaccount.com does not have storage.buckets.create access to project X."
}
xxxxxxxxxxx-compute#developer.gserviceaccount.com is the default Compute Engine service account for my project.
"Project X" is not one of mine though, I've no idea why the pipeline startup code is trying to create a bucket there, it does successfully create temporary buckets ( one called df-xxx and one called dataproc-xxx) in my project before it fails.
I've tried this with two separate accounts and get the same error in both places. I had tried adding storage/admin roles to the various service accounts to no avail but that was before I realized it was attempting to access a different project entirely.
I believe I was able to reproduce this. What's happening is that the BigQuery Source plugin first creates a temporary working GCS bucket to export the data to, and I suspect it is attempting to create it in the Dataset Project ID by default, instead of your own project as it should.
As a workaround, create a GCS bucket in your account, and then in the BigQuery Source configuration of your pipeline, set the "Temporary Bucket Name" configuration to "gs://<your-bucket-name>"
You are missing setting up permissions steps after you create an instance. The instructions to give your service account right permissions is in this page https://cloud.google.com/data-fusion/docs/how-to/create-instance

Changing Storage class from Multi-Regional to Coldline in Google Cloud Platform

I just finished my 1 year free trial with Google Cloud Platform and I am now being billed.
When I set my first project up, it looks like I set it up as Multi-Regional. I would only use the Google Cloud Storage in the event of a catastrophic failure in my home where i lose data on both internal and external hard drives (ie. fire, etc) . I believe for this type of backup, I only need Coldline storage. I did change my project over to Coldline but it looks like it only changes new data, not the original stored data because I am still being charged for Multi-regional storage.
From what I understand, I have to change the Object Storage Class either by overwriting the data using "gsutil rewrite -s [STORAGE_CLASS] gs://[PATH_TO_OBJECT]" or by Object Lifestyle Management. I could not figure out how to do either, so I need help doing this (I am not even sure where to type these commands or which approach to use (I am not a programmer!!)).
I also saw in another post that my gsutil command needs to up to date 4.22 or higher. How do I check this?? I also saw in this post that the [PATH_TO_OBJECT] is My Bucket. I see a Project Name, Project ID, and Project number. Which of these (if any) are used in that field for My Bucket?
Thank you for any help
I also saw in another post that my gsutil command needs to up to date
4.22 or higher. How do I check this??
Get the gsutil version:
gsutil version
Update the Cloud SDK which includes gsutil:
Windows:
Open a command prompt with Administrator rights
gcloud components update
Linux:
gcloud components update
I see a Project Name, Project ID, and Project number. Which of these
(if any) are used in that field for My Bucket.
Use the PROJECT_ID. To get a list of the projects that you have access to. This command will list each project.
gcloud projects list
To see which is your default project:
gcloud config list project
If the default project is blank or the wrong one, use the following command.
To set the default project:
gcloud config set project [PROJECT_ID]
From what I understand, I have to change the Object Storage Class
either my overwriting the data
Assuming your bucket name is mybucket.
STEP 1: Change the default storage class for the bucket:
gsutil defstorageclass set coldline gs://mybucket
STEP 2: Change the storage class for each object manually. This is an option if you want to just select a few files.
gsutil rewrite -s coldline gs://mybucket/objectname
STEP 3: Verify the existing lifecycle policy. Change step 4 accordingly if an existing policy exists.
gsutil lifecycle get gs://mybucket
STEP 4: Change the lifecycle of the bucket. This policy will move all files older than 7 days to coldline storage.
POLICY (write to lifecycle.json):
{
"lifecycle": {
"rule": [
{
"action": {
"type": "SetStorageClass",
"storageClass": "COLDLINE"
},
"condition": {
"age": 7,
"matchesStorageClass": [
"MULTI_REGIONAL",
"STANDARD",
"DURABLE_REDUCED_AVAILABILITY"
]
}
}
]
}
}
Command:
gsutil lifecycle set lifecycle.json gs://mybucket