Google Cloud Dataproc: cluster create errors (debconf DbDriver config.dat locked) - google-cloud-platform

Recently, I have experienced occasional errors while attempting to create dataproc clusters in GCP. The creation command is similar to:
gcloud dataproc clusters create ${CLUSTER_NAME} \
--zone "us-east1-b" \
--master-machine-type "n1-standard-16" \
--master-boot-disk-size 150 \
--num-workers ${WORKER_NODE_COUNT:-9} \
--worker-machine-type "n1-standard-16" \
--worker-boot-disk-size 25 \
--project ${PROJECT_NAME} \
--properties 'yarn:yarn.log-aggregation-enable=true'
Very intermittently, the error I receive is:
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/PROJECT/regions/global/operations/UUID] failed: Multiple Errors:
- Failed to initialize node random-name-m. See output in: gs://dataproc-UUID-us/google-cloud-dataproc-metainfo/UUID/random-name-m/dataproc-startup-script_output
- Failed to initialize node random-name-w-0. See output in: gs://dataproc-UUID-us/google-cloud-dataproc-metainfo/UUID/random-name-w-0/dataproc-startup-script_output
- Failed to initialize node random-name-w-1. See output in: gs://dataproc-UUID-us/google-cloud-dataproc-metainfo/UUID/random-name-w-1/dataproc-startup-script_output
- Worker random-name-w-8 unable to register with master random-name-m. This could be because it is offline, or network is misconfigured..
And the last lines of the Google Storage bucket output file (dataproc-startup-script_output) are:
+ debconf-set-selections
debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process: Resource temporarily unavailable
++ logstacktrace
++ local err=1
++ local code=1
++ set +o xtrace
ERROR: 'debconf-set-selections' exited with status 1
Call tree:
0: /usr/local/share/google/dataproc/startup-script-cloud_datarefinery_image_20180803_nightly-RC04.sh:490 main
Exiting with status 1
This one is really starting to annoy me! Any ideas/thoughts/resolutions are much appreciated!

A fix for this issue will be rolling out over the course of next week's release.
You can check the release notes to see when the fix has rolled out here:
https://cloud.google.com/dataproc/docs/release-notes

Related

Unable to run mongoexport on AWS EC2 instance

I have data in a DocumentDB database that I would like to export to an S3 bucket. However, when I try to run the mongoexport command:
mongoexport --uri="my_cluster_address/database_to_use" --collection=my_collection --out=some_file.json
I get this error:
could not connect to server: server selection error: server selection timeout, current topology:
{ Type: Single, Servers: [{ Addr: docdb_cluster_address, Type: Unknown, State: Connected, Average RTT: 0, Last error:
connection() : connection(docdb_cluster_address[-13]) incomplete read of message header: read tcp port_numbers-
>port_numbers: i/o timeout }, ] }
I am able to ssh into the cluster and do all sorts of transformations and really anything else related to database work but when I exit the mongoshell and try to run the mongoexport command it does not work. I already downloaded the mongoexport tools to the EC2 instance and added them to the .bash_profile path. I do not think it is a networking issue because if that were the case I wouldn't be able to ssh into the cluster so I think I am good on that part, I am not sure what I could be missing here. Any ideas?
When working with DocumentDB the mongoexport does not take the same parameters as it normally would when exporting/importing/restoring/dumping from/to MongoDB
Below is the command that worked for me and a link to the documentation:
https://docs.aws.amazon.com/documentdb/latest/developerguide/backup_restore-dump_restore_import_export_data.html
mongoexport --ssl \
--host="tutorialCluster.node.us-east-1.docdb.amazonaws.com:27017" \
--collection=restaurants \
--db=business \
--out=restaurant2.json \
--username=<yourUsername> \
--password=<yourPassword> \
--sslCAFile rds-combined-ca-bundle.pem
And below is the documentation for how it would normally work if you were working with MongoDB:
https://docs.mongodb.com/database-tools/mongoexport/

Google Cloud Function fail to build

I'm trying to update a cloud function that has been working for over a week now.
But when I try to update the function today, I get BUILD FAILED: BUILD HAS TIMED OUT error
Build fail error
I am using the google cloud console to deploy the python function and not cloud shell. I even tried to make a new copy of the function and that fails too.
Looking at the logs, it says INVALID_ARGUMENT. But I'm just using the console and haven't changed anything apart from the python code in comparison to previous build that I successfully deployed last week.
Error logs
{
insertId: "fjw53vd2r9o"
logName: " my log name "
operation: {…}
protoPayload: {
#type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {…}
methodName: "google.cloud.functions.v1.CloudFunctionsService.UpdateFunction"
requestMetadata: {…}
resourceName: " my function name"
serviceName: "cloudfunctions.googleapis.com"
status: {
code: 3
message: "INVALID_ARGUMENT"
}
}
receiveTimestamp: "2020-02-05T18:04:18.269557510Z"
resource: {…}
severity: "ERROR"
timestamp: "2020-02-05T18:04:18.241Z"
}
I even tried to increase the timeout parameter to 540 seconds and I still get the build error.
Timeout parameter setting
Can someone help please ?
In future, please copy and paste the text from errors and logs rather than reference screenshots; it's easier to parse and it's possibly more permanent.
It's possible that there's an intermittent issue with the service (in your region) that is causing you problems. Does this issue continue?
You may check the status dashboard (there are none for Functions) for service issues:
https://status.cloud.google.com/
I just deployed and updated a Golang Function in us-centrall without issues.
Which language|runtime are you using?
Which region?
Are you confident that your updates to the Function are correct?
A more effective albeit dramatic way to test this would be to create a new (temporary) project and try to deploy the function there (possibly to a different region too).
NB The timeout setting applies to the Function's invocations not to the deployment.
Example (using gcloud)
PROJECT=[[YOUR-PROJECT]]
BILLING=[[YOUR-BILLING]]
gcloud projects create ${PROJECT}
gcloud beta billing projects link ${PROJECT} --billing-account=${BILLING}
gcloud services enable cloudfunctions.googleapis.com --project=${PROJECT}
touch function.go go.mod
# Deploy
gcloud functions deploy fred \
--region=us-central1 \
--allow-unauthenticated \
--entry-point=HelloFreddie \
--trigger-http \
--source=${PWD} \
--project=${PROJECT} \
--runtime=go113
# Update
gcloud functions deploy fred \
--region=us-central1 \
--allow-unauthenticated \
--entry-point=HelloFreddie \
--trigger-http \
--source=${PWD} \
--project=${PROJECT} \
--runtime=go113
# Test
curl \
--request GET \
$(\
gcloud functions describe fred \
--region=us-central1 \
--project=${PROJECT} \
--format="value(httpsTrigger.url)")
Hello Freddie
Logs:
gcloud logging read "resource.type=\"cloud_function\" resource.labels.function_name=\"fred\" resource.labels.region=\"us-central1\" protoPayload.methodName=(\"google.cloud.functions.v1.CloudFunctionsService.CreateFunction\" OR \"google.cloud.functions.v1.CloudFunctionsService.UpdateFunction\")" \
--project=${PROJECT} \
--format="json(protoPayload.methodName,protoPayload.status)"
[
{
"protoPayload": {
"methodName": "google.cloud.functions.v1.CloudFunctionsService.CreateFunction"
}
},
{
"protoPayload": {
"methodName": "google.cloud.functions.v1.CloudFunctionsService.CreateFunction",
"status": {}
}
},
{
"protoPayload": {
"methodName": "google.cloud.functions.v1.CloudFunctionsService.UpdateFunction"
}
},
{
"protoPayload": {
"methodName": "google.cloud.functions.v1.CloudFunctionsService.UpdateFunction",
"status": {}
}
}
]

Connect local node to kops AWS cluster

I want to connect a local node (not a cloud one) to a kops-created cluster on AWS. I followed the suggested approach in
Kubernetes: Combining a Kops cluster to an on-premise Kubeadm cluster. Below are my kubelet options:
DAEMON_ARGS="\
--allow-privileged=true \
--cgroup-root=/ \
--cgroup-driver=systemd \
--cluster-dns=${CLUSTER_DNS} \
--cluster-domain=cluster.local \
--enable-debugging-handlers=true \
--eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% \.available<10%,imagefs.inodesFree<5% \
--feature-gates=DevicePlugins=true,ExperimentalCriticalPodAnnotation=true \
--hostname-override=my-node \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--network-plugin-mtu=9001 \
--network-plugin=kubenet \
--node-labels=KubernetesCluster=${CLUSTER_NAME}, kubernetes.io/cluster/${CLUSTER_NAME}=owned,kubernetes.io/role=node,node-role.kubernetes.io/node= \
--node-ip=${NODE_IP} \
--non-masquerade-cidr=${NON_MASQUERADE_CIDR} \
--pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-schedulable=true \
--v=2 \
--cni-bin-dir=/opt/cni/bin/ \
--cni-conf-dir=/etc/cni/net.d/ \
--cni-bin-dir=/opt/cni/bin/"
When I start kubelet on my local node, it successfully connects to the kube-apiserver and registers the node. However, it then repeatedly fails on updating node status:
E0121 23:08:35.135858 18352 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "my-node": nodes "my-node" not found
E0121 23:08:35.191611 18352 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "my-node": nodes "my-node" not found
...
E0121 23:08:35.359480 18352 kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count
E0121 23:08:35.823944 18352 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "my-node" not found
Upon checking the kube-controller-manager logs (/var/log/kube-controller-manager.log on the master node), I found that kube-controller-manager is deleting my-node because it can't be found within the cloud provider, aws:
I0121 23:08:25.722214 1 node_controller.go:769] Deleting node (no longer present in cloud provider): my-node
I0121 23:08:25.722248 1 controller_utils.go:197] Recording Deleting Node my-node because it's not present according to cloud provider event message for node my-node
Is there a way to disable this cloud provider check for my-node so it doesn't get deleted by kube-controller-manager? I still want to be able to run some nodes in AWS so don't want to clear the cloud-provider flag in kube-controller-manager.

dsub: google cloud error ("exit status 141")

I was trying to run some whole genome sequencing samples on google cloud using dsub. The dsub commands work ok for some samples, but not others. I have tried reducing the number of parallel threads, increasing the memory and disk, but it still fails. Since each run takes about 2 days, the trial and error approach is pretty expensive! Any help/tips would be highly appreciated!
My command is:
dsub \
--project "${MY_PROJECT}" \
--zones "us-central1-a" \
--logging "${LOGGING}" \
--vars-include-wildcards \
--disk-size 800 \
--min-ram 60 \
--image "us.gcr.io/xxx-yyy-zzz/data" \
--tasks "${SCRIPT_DIR}"/tBOWTIE2.tsv \
--command 'bismark --bowtie2 --bam --parallel 2 "${GENOME_REFERENCE}" -1 "${INPUT_FORWARD}" -2 "${INPUT_REVERSE}" -o "${OUTPUT_DIR}"' \
--wait
The dstat command with '--full' option shows the error as:
status: FAILURE
status-detail: "11: Docker run failed"
The last line in the log file, on google cloud, just states "(exit status 141)".
many thanks!

Kubernetes 1.0.1 External Load Balancer on GCE with CoreOS

Using a previous version of Kubernetes (0.16.x) I was able to create a cluster of CoreOS based VMs on GCE that were capable of generating external network load balancers for services. With the release of v1 of Kubernetes the configuration necessary for this functionality seems to have changed. Could anyone offer any advice or point me in the direction of some documentation that might help me out with this issue?
I suspect that the problem is to do with ip/naming as I was previously using kube-register to handle this, and this component no longer seems necessary. My current configuration will create internal service load balancers without issue, and will even create external service load balancers, but they are only viewable through the gcloud UI and are not registered or displayed in kubectl output. Unfortunately the external ips generated do not actually proxy the traffic through either.
The kube-controller-manager service log looks like this:
Aug 05 12:15:42 europe-west1-b-k8s-master.c.staging-infrastructure.internal hyperkube[1604]: I0805 12:15:42.516360 1604 gce.go:515] Firewall doesn't exist, moving on to deleting target pool.
Aug 05 12:15:42 europe-west1-b-k8s-master.c.staging-infrastructure.internal hyperkube[1604]: E0805 12:15:42.516492 1604 servicecontroller.go:171] Failed to process service delta. Retrying: googleapi: Error 404: The resource 'projects/staging-infrastructure/global/firewalls/k8s-fw-a4db9328c3b6b11e5ab9f42010af0397' was not found, notFound
Aug 05 12:15:42 europe-west1-b-k8s-master.c.staging-infrastructure.internal hyperkube[1604]: I0805 12:15:42.516539 1604 servicecontroller.go:601] Successfully updated 2 out of 2 external load balancers to direct traffic to the updated set of nodes
Aug 05 12:16:07 europe-west1-b-k8s-master.c.staging-infrastructure.internal hyperkube[1604]: E0805 12:16:07.620094 1604 servicecontroller.go:171] Failed to process service delta. Retrying: failed to create external load balancer for service default/autobot-cache-graph: googleapi: Error 400: Invalid value for field 'resource.targetTags[0]': 'europe-west1-b-k8s-node-0.c.staging-infrastructure.int'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)', invalid
Aug 05 12:16:12 europe-west1-b-k8s-master.c.staging-infrastructure.internal hyperkube[1604]: I0805 12:16:12.804512 1604 servicecontroller.go:275] Deleting old LB for previously uncached service default/autobot-cache-graph whose endpoint &{[{146.148.114.97 }]} doesn't match the service's desired IPs []
Here is the config I am using (download chmod etc omitted for clarity).
On the master:
- name: kube-apiserver.service
command: start
content: |
[Unit]
Description=Kubernetes API Server
Requires=setup-network-environment.service etcd.service generate-serviceaccount-key.service
After=setup-network-environment.service etcd.service generate-serviceaccount-key.service
[Service]
EnvironmentFile=/etc/network-environment
ExecStart=/opt/bin/hyperkube apiserver \
--cloud-provider=gce \
--service_account_key_file=/opt/bin/kube-serviceaccount.key \
--service_account_lookup=false \
--admission_control=NamespaceLifecycle,NamespaceAutoProvision,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota \
--runtime_config=api/v1 \
--allow_privileged=true \
--insecure_bind_address=0.0.0.0 \
--insecure_port=8080 \
--kubelet_https=true \
--secure_port=6443 \
--service-cluster-ip-range=10.100.0.0/16 \
--etcd_servers=http://127.0.0.1:2379 \
--bind-address=${DEFAULT_IPV4} \
--logtostderr=true
Restart=always
RestartSec=10
- name: kube-controller-manager.service
command: start
content: |
[Unit]
Description=Kubernetes Controller Manager
Requires=kube-apiserver.service
After=kube-apiserver.service
[Service]
ExecStart=/opt/bin/hyperkube controller-manager \
--cloud-provider=gce \
--service_account_private_key_file=/opt/bin/kube-serviceaccount.key \
--master=127.0.0.1:8080 \
--logtostderr=true
Restart=always
RestartSec=10
- name: kube-scheduler.service
command: start
content: |
[Unit]
Description=Kubernetes Scheduler
Requires=kube-apiserver.service
After=kube-apiserver.service
[Service]
ExecStart=/opt/bin/hyperkube scheduler --master=127.0.0.1:8080
Restart=always
RestartSec=10
And on the node:
- name: kubelet.service
command: start
content: |
[Unit]
Description=Kubernetes Kubelet
Requires=setup-network-environment.service
After=setup-network-environment.service
[Service]
EnvironmentFile=/etc/network-environment
WorkingDirectory=/root
ExecStart=/opt/bin/hyperkube kubelet \
--cloud-provider=gce \
--address=0.0.0.0 \
--port=10250 \
--api_servers=<master_ip>:8080 \
--allow_privileged=true \
--logtostderr=true \
--cadvisor_port=4194 \
--healthz_bind_address=0.0.0.0 \
--healthz_port=10248
Restart=always
RestartSec=10
- name: kube-proxy.service
command: start
content: |
[Unit]
Description=Kubernetes Proxy
Requires=setup-network-environment.service
After=setup-network-environment.service
[Service]
ExecStart=/opt/bin/hyperkube proxy \
--master=<master_ip>:8080 \
--logtostderr=true
Restart=always
RestartSec=10
To me it looks like a mismatch in naming and ip, but I'm not sure how to adjust my config to resolve. Any guidance greatly appreciated.
How did you create the nodes in your cluster? We've seen another instance of this issue due to bugs in the cluster bootstrapping script that was used that didn't apply the expected node names and tags.
If you recreate your cluster using the following two commands as recommended on the issue linked to above, creating load balancers should work for you:
export OS_DISTRIBUTION=coreos
cluster/kube-up.sh
Otherwise, you may need to wait for the issues to be fixed.