Enabling HA namenodes on a secure cluster in Cloudera Manager fails - hdfs

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?

It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.

Related

ElastiCache for Redis - redirections for cluster mode disabled?

Refer to the documentation of ElastiCache for Redis -> Getting Started -> Step 4: Connect to the cluster's mode:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/GettingStarted.ConnectToCacheNode.html
Under the section Connecting to a cluster mode disabled unencrypted-cluster, the docs ask you to run the following command:
$ src/redis-cli -h cluster-endpoint -c -p port number
Then, it gives an example of some redis commands:
set x Hi
-> Redirected to slot [16287] located at 172.31.28.122:6379
OK
set y Hello
OK
get y
"Hello"
set z Bye
-> Redirected to slot [8157] located at 172.31.9.201:6379
OK
get z
"Bye"
get x
-> Redirected to slot [16287] located at 172.31.28.122:6379
"Hi"
What I don't understand is: when we're talking about a "cluster mode disabled" ElastiCache cluster, it means that there's only one shard, as stated in the docs: Components and Features.
If so, how is it that the requests sent in the example above got redirected to other nodes? If there's only one shard, it means that all the data is written in the primary node. The primary node may be replicated to replica nodes, but that's another thing..
Is it a mistake in the docs or am I missing something?
A Redis cluster is a logical grouping of one or more ElastiCache for Redis shards. In the example it talks about interacting with a "cluster mode disabled" redis, however replication is turned on as you see in the screenshot there are 1 primary node and 2 replicas.
Initially I thought the redirections are due to replica, but I tested on my Redis cluster mode disabled with same replication setup and I do not get the ASK and MOVED redirections. I also tested this against the read only directly. (I connected with --verbose mode and -c)
I was not able to generate the redirection events you see in the documentation.
Therefore I can say with strong certainty that the author of the document has pasted in output from a cluster mode enabled Redis cluster, which is possibly causing you the confusion.
You are right. The title for this section in the documentation is incorrect and it describes how to connect to a cluster mode enabled unecrypted cluster.
You can leave feedback on documentation inaccuracies by clicking on the feedback icon in the navigation pane on the upper right part of the screen or clicking on "Provide feedback" at the bottom left in the footer of the page.

"Kafka Timed out waiting for a node assignment." on MSK

Specs:
The serverless Amazon MSK that's in preview.
t2.xlarge EC2 instance with Amazon Linux 2
Installed Kafka from https://dlcdn.apache.org/kafka/3.0.0/kafka_2.13-3.0.0.tgz
openjdk version "11.0.13" 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
sharing)
Gradle 7.3.3
https://github.com/aws/aws-msk-iam-auth, successfully built.
I also tried adding IAM authentication information, as recommended by the Amazon MSK Library for AWS Identity and Access Management. It says to add the following in config/client.properties:
# Sets up TLS for encryption and SASL for authN.
security.protocol = SASL_SSL
# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM
# Binds SASL client implementation.
# sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;
# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler
# Binds SASL client implementation. Uses the specified profile name to look for credentials.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsProfileName="kafka-client";
And kafka-client is the IAM role attached to the EC2 instance as an instance profile.
Networking: I used VPC Reachability Analyzer to confirm that the security groups are configured correctly and the EC2 instance I'm using as a Producer can reach the serverless MSK cluster.
What I'm trying to do: create a topic.
How I'm trying: bin/kafka-topics.sh --create --partitions 1 --replication-factor 1 --topic quickstart-events --bootstrap-server boot-zclcyva3.c2.kafka-serverless.us-east-2.amazonaws.com:9098
Result:
Error while executing topic command : Timed out waiting for a node assignment. Call: createTopics
[2022-01-17 01:46:59,753] ERROR org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createTopics
(kafka.admin.TopicCommand$)
I'm also trying: with the plaintext port of 9092. (9098 is the IAM-authentication port in MSK, and serverless MSK uses IAM authentication by default.)
All the other posts I found on SO about this node assignment error didn't include MSK. I tried suggestions like uncommenting the listener setting in server.properties, but that didn't change anything.
Installing kcat for troubleshooting didn't work for me, since there's no out-of-the box installation for the yum package manager, which Amazon Linux 2 uses, and since these instructions failed for me at checking for libcurl (by compile)... failed (fail).
The Question: Any other tips on solving this "node assignment" error?
The documentation has been updated recently, I was able to follow it end to end without any issue (The IAM policy is now correct)
https://docs.aws.amazon.com/msk/latest/developerguide/serverless-getting-started.html
The created properties file is not automatically used; your command needs to include --command-config client.properties, where this properties file is documented at the MSK docs on the linked IAM page.
Extract...
ssl.truststore.location=<PATH_TO_TRUST_STORE_FILE>
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
Alternatively, if the plaintext port didn't work, then you have other networking issues
Beyond these steps, I suggest reaching out to MSK support, and telling them to update the "Create a Topic" page to no longer use Zookeeper, keeping in mind that Kafka 3.0 is not (yet) supported

eks fluent-bit to elasticsearch timeout

So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?
So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.

Greengrass_HelloWorld lambda doesn't publish to Amazon IoT console

I have been following the documentation in every step, and I didn't face any errors. Configured, deployed and made a subscription to hello/world topic just as the documentation detailed. However, when I arrived at the testing step here: https://docs.aws.amazon.com/greengrass/latest/developerguide/lambda-check.html
No messages were showing up on the IoT console (subscription view hello/world)! I am using Greengrass core daemon which runs on my Ubuntu machine, it is active and listens to port 8000. I don't think there is anything wrong with my local device because the group was deployed successfully and because I see the communications going both ways on Wireshark.
I have these logs on my machine: /home/##/Desktop/greengrass/ggc/var/log/system/runtime.log:
[2019-09-28T06:57:42.492-07:00][INFO]-===========================================
[2019-09-28T06:57:42.492-07:00][INFO]-Greengrass Version: 1.9.3-RC3
[2019-09-28T06:57:42.492-07:00][INFO]-Greengrass Root: /home/##/Desktop/greengrass
[2019-09-28T06:57:42.492-07:00][INFO]-Greengrass Write Directory: /home/##/Desktop/greengrass/ggc
[2019-09-28T06:57:42.492-07:00][INFO]-Group File Directory: /home/##/Desktop/greengrass/ggc/deployment/group
[2019-09-28T06:57:42.492-07:00][INFO]-Default Lambda UID: 122
[2019-09-28T06:57:42.492-07:00][INFO]-Default Lambda GID: 127
[2019-09-28T06:57:42.492-07:00][INFO]-===========================================
[2019-09-28T06:57:42.492-07:00][INFO]-The current core is using the AWS IoT certificates with fingerprint. {"fingerprint": "90##4d"}
[2019-09-28T06:57:42.492-07:00][INFO]-Will persist worker process info. {"dir": "/home/##/Desktop/greengrass/ggc/ggc/core/var/worker/processes"}
[2019-09-28T06:57:42.493-07:00][INFO]-Will persist worker process info. {"dir": "/home/##/Desktop/greengrass/ggc/ggc/core/var/worker/processes"}
[2019-09-28T06:57:42.494-07:00][INFO]-No proxy URL found.
[2019-09-28T06:57:42.495-07:00][INFO]-Started Deployment Agent to listen for updates. [2019-09-28T06:57:42.495-07:00][INFO]-Connecting with MQTT. {"endpoint": "a6##ws-ats.iot.us-east-2.amazonaws.com:8883", "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.497-07:00][INFO]-The current core is using the AWS IoT certificates with fingerprint. {"fingerprint": "90##4d"}
[2019-09-28T06:57:42.685-07:00][INFO]-MQTT connection successful. {"attemptId": "GVko", "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-MQTT connection established. {"endpoint": "a6##ws-ats.iot.us-east-2.amazonaws.com:8883", "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-MQTT connection connected. Start subscribing. {"clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-Deployment agent connected to cloud.
[2019-09-28T06:57:42.685-07:00][INFO]-Start subscribing. {"numOfTopics": 2, "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-Trying to subscribe to topic $aws/things/simulators_gg_Core-gda/shadow/update/delta
[2019-09-28T06:57:42.727-07:00][INFO]-Trying to subscribe to topic $aws/things/simulators_gg_Core-gda/shadow/get/accepted
[2019-09-28T06:57:42.814-07:00][INFO]-All topics subscribed. {"clientId": "simulators_gg_Core"}
[2019-09-28T06:58:57.888-07:00][INFO]-Daemon received signal: terminated. [2019-09-28T06:58:57.888-07:00][INFO]-Shutting down daemon.
[2019-09-28T06:58:57.888-07:00][INFO]-Stopping all workers.
[2019-09-28T06:58:57.888-07:00][INFO]-Lifecycle manager is stopped.
[2019-09-28T06:58:57.888-07:00][INFO]-IPC server stopped.
/home/##/Desktop/greengrass/ggc/var/log/system/localwatch/localwatch.log:
[2019-09-28T06:57:42.491-07:00][DEBUG]-will keep the log files for the following lambdas {"readingPath": "/home/##/Desktop/greengrass/ggc/var/log/user", "lambdas": "map[]"}
[2019-09-28T06:57:42.492-07:00][WARN]-failed to list the user log directory {"path": "/home/##/Desktop/greengrass/ggc/var/log/user"}
Thanks in advance.
I had a similar issue on another platform (Jetson Nano). I could not get a response after going through the AWS instructions for setting up a simple Lambda using IOT Greengrass. In my search for answers I discovered that AWS has a qualification test script for any device you connect.
It goes through an automated process of deploying and testing a lambda function(as well as other functionality) and reports results for each step and docs provide troubleshooting info for failures.
By going through those tests I was able to narrow down the issues with my setup, installation, and configuration. The testing docs give pointers to troubleshoot test results. Here is a link to the test: https://docs.aws.amazon.com/greengrass/latest/developerguide/device-tester-for-greengrass-ug.html
If you follow the 'Next Topic' links, it will take you through the complete test. Let me warn you that its extensive, and will take some time, but for me it gave a lot of detailed insight that a hello world does not.

Azure VM, your credentials did not work on remote desktop

I've just had a bit of fun trying to connect to a new VM I'd created, I've found loads of posts from people with the same problem, the answer details the points I've found
(1) For me it worked with
<VMName>\Username
Password
e.g.
Windows8VM\MyUserName
SomePassword#1
(2) Some people have just needed to use a leading '\', i.e.
\Username
Password
Your credentials did not work Azure VM
(3) You can now reset the username/password from the app portal. There are powershell scripts which will also allow you to do this but that shouldn't be necessary anymore.
(4) You can also try redeploying the VM, you can do this from the app portal
(5) This blog says that "Password cannot contain the username or part of username", but that must be out of date as I tried that once I got it working and it worked fine
https://blogs.msdn.microsoft.com/narahari/2011/08/29/your-credentials-did-not-work-error-when-connecting-to-windows-azure-vms/
(6) You may find links such as the below which mention Get-AzureVM, that seems to be for classic VMs, there seem to be equivalents for the resource manager VMs such as Get-AzureRMVM
https://blogs.msdn.microsoft.com/mast/2014/03/06/enable-rdp-or-reset-password-with-the-vm-agent/
For complete novices to powershell, if you do want to go down that road here's the basics you may need. In the end I don't believe I needed this, just point 1
unInstall-Module AzureRM
Install-Module AzureRM -allowclobber
Import-Module AzureRM
Login-AzureRmAccount (this will open a window which takes you through the usual logon process)
Add-AzureAccount (not sure why you need both, but I couldn’t log on without this)
Select-AzureSubscription -SubscriptionId <the guid for your subscription>
Set-AzureRmVMAccessExtension -ResourceGroupName "<your RG name>" -VMName "Windows8VM" -Name "myVMAccess" -Location "northeurope" -username <username> -password <password>
(7) You can connect to a VM in a scale set as by default the Load Balancer will have Nat Rules mapping from port onwards 50000, i.e. just remote desktop to the IP address:port. You can also do it from a VM that isn't in the scale set. Go to the scale set's overview, click on the "virtual network/subnet", that'll give you the internal IP address. Remote desktop from the other one
Ran into similar issues. It seems to need domain by default. Here is what worked for me:
localhost\username
Other option can be vmname\username
Some more guides to help:
https://learn.microsoft.com/en-us/azure/virtual-machines/windows/quick-create-portal#connect-to-virtual-machine
https://learn.microsoft.com/en-us/azure/virtual-machines/windows/connect-logon
In April 2022 "Password cannot contain the username or part of username" was the issue.
During the creation of VM in Azure, everything was alright but wasn't able to connect via RDP.
Same in Nov 2022, you will be allowed to create a password that contains the user name but during login it will display the credential error. Removing the user name from the password fixed it.