Confused about GCP Dataproc sofwareConfig values - google-cloud-platform

I'm attempting to modify Airflow's dataproc operator to include Anaconda and Jupyter to the cluster.
I'm overriding DataprocClusterCreateOperator to include optionalComponents.
After reading Google docs I understand that I need to pass an enum;
Every time I try to run this task I encounter invalid value errors, or TypeError: Object of type 'EnumMeta' is not JSON serializable
I'd really appreciate it if someone can tell me how to correctly pass in this field.
cluster_data = {
'projectId': self.project_id,
'clusterName': self.cluster_name,
'config': {
'gceClusterConfig': {
},
'masterConfig': {
'numInstances': self.num_masters,
'machineTypeUri': master_type_uri,
'diskConfig': {
'bootDiskType': self.master_disk_type,
'bootDiskSizeGb': self.master_disk_size
}
},
'workerConfig': {
'numInstances': self.num_workers,
'machineTypeUri': worker_type_uri,
'diskConfig': {
'bootDiskType': self.worker_disk_type,
'bootDiskSizeGb': self.worker_disk_size
}
},
'secondaryWorkerConfig': {},
'softwareConfig': {
# I've tried the following:
'optionalComponents': 'ANACONDA,JUPYTER'
#from google.cloud.dataproc_v1 import enums
'optionalComponents': [enums.Component.ANACONDA.value]
},
}
}

You want to use a JSON list there ['ANACONDA', 'JUPYTER'].
As general guidance for figuring out how to structure things, you can create a cluster with gcloud and then run:
gcloud dataproc clusters describe my-cluster --format json
That --format json is the key. The result should be directly copy-pastable.

Related

Cloud Function to import data into CloudSQL from cloud storage bucket but getting already schema exist error

I'm trying to import data into CloudSQL instance from cloud storage bucket using cloud function.
How can i delete schema's before importing the data using a single cloud function?
I am using Node.js in cloud function.
error:
error: exit status 3 stdout(capped at 100k bytes): SET SET SET SET SET set_config ------------ (1 row) SET SET SET SET stderr: ERROR: schema "< >" already exists
https://cloud.google.com/sql/docs/mysql/admin-api/rest/v1beta4/instances/import
in below code where do i need to put delete all existing schema's apart from public schema?
Entry point : importDatabase
index.js
const {google} = require('googleapis');
const {auth} = require("google-auth-library");
var sqlAdmin = google.sqladmin('v1beta4');
exports.importDatabase = (_req, res) => {
async function doIt() {
const authRes = await auth.getApplicationDefault();
let authClient = authRes.credential;
var request = {
project: 'my-project', // TODO: Update placeholder value.
instance: 'my-instance', // TODO: Update placeholder value.
resource: {
importContext: {
kind: "sql#importContext",
fileType: "SQL", // CSV
uri: <bucket path>,
database: <database-name>
// Options for importing data as SQL statements.
// sqlimportOptions: {
// /**
},
auth: authClient,
};
sqladmin.instances.import(request, function(err, result) {
if (err) {
console.log(err);
} else {
console.log(result);
}
res.status(200).send("Command completed", err, result);
});
}
doIt();
};
package.json
{
"name": "import-database",
"version": "0.0.1",
"dependencies": {
"googleapis": "^39.2.0",
"google-auth-library": "3.1.2"
}
}
The error looks to be occurring due to a previous aborted import managed to transfer the "schema_name" schema, and then this subsequent import was done without first re-initializing the DB,check helpful document on Cloud SQL import
One way to prevent this issue is to change the create statements in the SQL file from:
CREATE SCHEMA schema_name;
to
CREATE SCHEMA IF NOT EXISTS schema_name;
As far the removing of currently created schema is considered by default, only user or service accounts with the Cloud SQL Admin (roles/cloudsql.admin) or Owner (roles/owner) role have the permission to delete a Cloud SQL instance,please check the helpful document on cloudsql.instances.delete to help you understand the next steps.You can also define an IAM custom role for the user or service account that includes the cloudsql.instances.delete permission. This permission is supported in IAM custom roles.
As a best practice for import export operations, we recommend that you adopt the principle of least privilege, which in this case would mean creating a custom role and adding that specific permission and assigning it to your service account. Alternatively, the service account could be given the “Cloud SQL Admin” role, or the “Cloud Composer API Service Agent” role, which include this permission, and would therefore allow you to execute this command.
NOTE:It is recommended and advised to revalidate any delete actions performed as this may lead to loss of useful data.

How to create an AWS SSM Document Package using Terraform

Using Terraform, I am trying to create an AWS SSM Document Package for Chrome so I can install it on various EC2 instances I have. I define these steps via terraform:
Upload zip containing Chrome installer plus install and uninstall powershell scripts.
Add that ZIP to an SSM package.
However, when I execute terraform apply I receive the following error...
Error updating SSM document: InvalidParameterValueException:
AttachmentSource not provided in the input request.
status code: 400, request id: 8d89da70-64de-4edb-95cd-b5f52207794c
The contents of my main.tf are as follows:
# 1. Add package zip to s3
resource "aws_s3_bucket_object" "windows_chrome_executable" {
bucket = "mybucket"
key = "ssm_document_packages/GoogleChromeStandaloneEnterprise64.msi.zip"
source = "./software-packages/GoogleChromeStandaloneEnterprise64.msi.zip"
etag = md5("./software-packages/GoogleChromeStandaloneEnterprise64.msi.zip")
}
# 2. Create AWS SSM Document Package using zip.
resource "aws_ssm_document" "ssm_document_package_windows_chrome" {
name = "windows_chrome"
document_type = "Package"
attachments_source {
key = "SourceUrl"
values = ["/path/to/mybucket"]
}
content = <<DOC
{
"schemaVersion": "2.0",
"version": "1.0.0",
"packages": {
"windows": {
"_any": {
"x86_64": {
"file": "GoogleChromeStandaloneEnterprise64.msi.zip"
}
}
}
},
"files": {
"GoogleChromeStandaloneEnterprise64.msi.zip": {
"checksums": {
"sha256": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
}
}
}
DOC
}
If I change the file from a zip to a vanilla msi I do not receive the error message, however, when I navigate to the package in the AWS console it tells me that the install.ps1 and uninstall.ps1 files are missing (since obviously they weren't included).
Has anyone experienced the above error and do you know how to resolve it? Or does anyone have reference to a detailed example of how to do this?
Thank you.
I ran into this same problem, in order to fix it I added a trailing slash to the source url value parameter:
attachments_source {
key = "SourceUrl"
values = ["/path/to/mybucket/"]
}
My best guess is it appends the filename from the package spec to the value provided in the attachments source value so it needs the trailing slash to build a valid path to the actual file.
This is the way it should be set up for an attachment in s3:
attachments_source {
key = "S3FileUrl"
values = ["s3://packer-bucket/packer_1.7.0_linux_amd64.zip"]
name = "packer_1.7.0_linux_amd64.zip"
}
I realized that in the above example there was no way terraform could identify a dependency between the two resources i.e. the s3 object needs to be created before the aws_ssm_document. Thus, I added in the following explicit dependency inside the aws_ssm_document:
depends_on = [
aws_s3_bucket_object.windows_chrome_executable
]

Idle delete configuration for PySpark Cluster on GCP

I am trying to define a create cluster function to create a cluster on Cloud Dataproc. While going through the reference material I came across an idle delete parameter (idleDeleteTtl) which would auto-delete the cluster if not in use for the amount of time defined. When I try to include it in cluster config it gives me a ValueError: Protocol message ClusterConfig has no "lifecycleConfig" field.
The create cluster function for reference:
def create_cluster(dataproc, project, zone, region, cluster_name, pip_packages):
"""Create the cluster."""
print('Creating cluster...')
zone_uri = \
'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(
project, zone)
cluster_data = {
'project_id': project,
'cluster_name': cluster_name,
'config': {
'initialization_actions': [{
'executable_file': 'gs://<some_path>/python/pip-install.sh'
}],
'gce_cluster_config': {
'zone_uri': zone_uri,
'metadata': {
'PIP_PACKAGES': pip_packages
}
},
'master_config': {
'num_instances': 1,
'machine_type_uri': 'n1-standard-1'
},
'worker_config': {
'num_instances': 2,
'machine_type_uri': 'n1-standard-1'
},
'lifecycleConfig': { #### PROBLEM AREA ####
'idleDeleteTtl': '30m'
}
}
}
cluster = dataproc.create_cluster(project, region, cluster_data)
cluster.add_done_callback(callback)
global waiting_callback
waiting_callback = True
I want similar functionality if not in the same function itself. I already have a manual delete function defined but I want to add the functionality to auto delete clusters when not in use.
You are calling the v1 API passing a parameter that is part of the v1beta2 API.
Change your endpoint from:
https://www.googleapis.com/compute/v1/projects/{}/zones/{}
To this:
https://www.googleapis.com/compute/v1beta2/projects/{}/zones/{}

Mapping geo_point data when importing data to AWS Elasticsearch

I have a set of data inside dynamodb that I am importing to AWS Elasticsearch using this tutorial: https://medium.com/#vladyslavhoncharenko/how-to-index-new-and-existing-amazon-dynamodb-content-with-amazon-elasticsearch-service-30c1bbc91365
I need to change the mapping of a part of that data to geo_point.
I have tried creating the mapping before importing the data with:
PUT user
{
"mappings": {
"_doc": {
"properties": {
"grower_location": {
"type": "geo_point"
}
}
}
}
}
When I do this the data doesn't import, although I don't receive an error.
If I import the data first I am able to search it, although the grower_location: { lat: #, lon: # } object is mapped as an integer and I am unable to run geo_distance.
Please help.
I was able to fix this by importing the data once with the python script in the tutorial.
Then running
GET user/_mappings
Copying the auto generated mappings to clipboard, then,
DELETE user/
Then pasting the copied mapping to a new mapping and changing the type for the geo_point data.
PUT user/
{
"mappings": {
"user_type": {
"properties": {
...
"grower_location": {
"type": "geo_point"
}
...
}
}
}
}
Then re-importing the data using the python script in the tutorial.
Everything is imported and ready to be searched using geo_point!

Enabling regex support on AWS Managed ElasticSearch in painless scripts

I am trying to upload templates to my AWS managed ElasticSearch.
ElasticSearch responds with a 500 error complaining that I need to set script.painless.regex.enabled to true. I know that you cannot edit the elasticsearch.yml file directly, but is there anyway to allow for support of regex in painless scripts on AWS managed ES?
There is no way yet to use regex under AWS ES cluster.
You can try to use StringTokenizer, as following example:
example value:
doc['your_str_field.keyword'].value = '{"xxx":"123213","yyy":"123213","zzz":"123213"}'
Painless script:
{
"script": {
"lang": "painless",
"inline": "String xxx = doc['your_str_field.keyword'].value; xxx = xxx.replace('{','').replace('}','').replace('\"','').replace(' ','');StringTokenizer tokenizer = new StringTokenizer(xxx, ',');tokenizer.nextToken();tokenizer.nextToken();StringTokenizer tokenizer_v = new StringTokenizer(tokenizer.nextToken(),':');tokenizer_v.nextToken();return tokenizer_v.nextToken();"
}
}
also, I needed to increase max_compilations_rate
PUT /_cluster/settings
{
"transient": {
"script.max_compilations_rate": "500/1m"
}
}