strongswan configuration and traffic on tunnel problem IKEv2 - centos7

im new in this scope. I tried to configure strongswan site-to-site with centos7 (different region) at google cloud platform. Ive done follow this guide:
https://blog.ruanbekker.com/blog/2018/02/11/setup-a-site-to-site-ipsec-vpn-with-strongswan-and-preshared-key-authentication/
https://www.tecmint.com/setup-ipsec-vpn-with-strongswan-on-centos-rhel-8/
https://medium.com/#georgeswizzalonge/how-to-setup-a-site-to-site-vpn-connection-with-strongswan-32d4ed034ae2
This ipsec.conf comes from site A:
config setup
charondebug="all"
strictcrlpolicy=no
uniqueids = yes
conn sg-to-jkt
authby=secret
left=%defaultroute
leftid=34.xx.xx.xxx
leftsubnet=10.xxx.x.xx/24
right=34.xxx.xxx.xxx
rightsubnet=10.xxx.x.x/24
ike=aes256-sha2_256-modp1024!
esp=aes256-sha2_256!
keyingtries=0
ikelifetime=1h
lifetime=8h
dpddelay=30
dpdtimeout=120
dpdaction=restart
auto=start
ipsec.secrets file site A:
site-A site-B : PSK "someencryptedkey"
This ipsec.conf site B:
config setup
charondebug="all"
strictcrlpolicy=no
uniqueids = yes
conn jkt-to-sg
authby=secret
left=%defaultroute
leftid=34.xxx.xxx.xxx
leftsubnet=10.xxx.x.x/24
right=34.xx.xx.xxx
rightsubnet=10.xxx.x.xx/24
ike=aes256-sha2_256-modp1024!
esp=aes256-sha2_256!
keyingtries=0
ikelifetime=1h
lifetime=8h
dpddelay=30
dpdtimeout=120
dpdaction=restart
auto=start
ipsec.secret file site B:
site-B site-A : PSK "someencryptedkey"
My questions are:
Why everytime i used to restart the strongswan (strongswan restart), the strongswan service (systemctl status strongswan) becomes dead/inactive? (note: strongswan tunnel is still up)
● strongswan.service - strongSwan IPsec IKEv1/IKEv2 daemon using ipsec.conf
Loaded: loaded (/usr/lib/systemd/system/strongswan.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Sun 2020-10-11 16:37:06 UTC; 32min ago
No traffic in the ESP protocol, tcpdump esp not display anything but the strongswan tunnel is up. I realized that the status give different result from the example. The result return ESP in UDP SPIs instead of ESP SPIs. Is there any different or anything else?
thank you for your help and advices

You maigh check your Systemd service file strongswan.service and change the Type= option.
By default you should have Type=simple and it works for many Systemd service files, but it does not work when the script in ExecStart launches another process and completes, please consider to change to explicitly specify Type=forking in the [Service] section so that Systemd knows to look at the spawned process rather than the initial one.
From man systemd.service:
If set to forking, it is expected that the process configured with ExecStart= will call fork() as part of its start-up. The parent process is expected to exit when start-up is complete and all communication channels are set up. The child continues to run as the main daemon process. This is the behavior of traditional UNIX daemons. If this setting is used, it is recommended to also use the PIDFile= option, so that systemd can identify the main process of the daemon. systemd will proceed with starting follow-up units as soon as the parent process exits.
Additionally, I have found another thread in StrackOverflow with a similar issue.
But please see man systemd.service for an appropriate type.
For your second question you might check your firewall, I found another similar case in this link

Related

Chaincode (invoke) is not able to endorse on remote cluster with all three orgs, org1 succeeds but org2 and org3 don't. What could be wrong?

I have a Kubernetes cluster configured which builds perfectly when running via Docker Desktop, including invoking with successful endorsement via all three Chaincode containers in the network.
On the remote side, I'm using AWS EKS to deploy my nodes and I have more recently followed this guide on deploying a production ready peer. I already had EFS set up and in use as a k8s Persistent Volume, and this is populated each time I spool up a network with all the config. This means all the crypto materials, connection profiles, etc are mounted to the relevant containers and as per best practice the reference to these TLS certs is in this directory.
This all works as expected... my admin pods can communicate with my peers, the orderers connect, etcetera. I'm able to fully install chaincode, approve it and commit it to all three of my peers successfully.
When it comes to invoking the chaincode, my org1 container always succeeds, and successfully communicates with the peer in its organization.
I'm aware of the core.yaml setting localMspId and this is being overridden by the environment variable CORE_PEER_LOCALMSPID for each set of peers, such that in my org1 peer the value is Org1MSP, in org2 it's Org2MSP, etc.
When running peer chaincode invoke, the first container (org1) succeeds very quickly, the other two try to contact their peers and hang for the timeout period set in the default gRPC settings (110000ms wait). I also have set the env var of CORE_PEER_ADDRESS_AUTODETECT: "true" on my peer in order to ensure it doesn't try to resolve using the hostnames like peer0.org1 (this clearly works for org1 but not the other two).
The environment variables set for TLS in each of the containers corresponds to the contents of the ones I am passing (in correct order) with my invoke command:
peer chaincode invoke --ctor '${CC_INIT_ARGS}' --channelID ${CHANNEL_ID} --name ${CC_NAME} --cafile \$ORDERER_TLS_ROOTCERT_FILE \
--tls true -o orderer.${ORG}:7050 \
--peerAddresses peer0.org1:7051 \
--peerAddresses peer0.org2:7051 \
--peerAddresses peer0.org3:7051 \
--tlsRootCertFiles /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org1-cert.pem \
--tlsRootCertFiles /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org2-cert.pem \
--tlsRootCertFiles /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org3-cert.pem >&invoke-log.txt
cat invoke-log.txt
That command is executed inside my container, and as mentioned, I have manually confirmed by inspecting all three containers, then cating the contents of the files, versus doing the same with the above paths, and they match exactly. That is to say the contents of /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org1-cert.pem are equivalent to the CORE_PEER_TLS_ROOTCERT_FILE setting in org1, and so on per organization.
Example org1 chaincode container logs:
2022-02-23T13:47:07.255Z debug [c-api:lib/handler.js] [allorgs-5e707801] Calling chaincode Invoke(), response status: 200
2022-02-23T13:47:07.256Z info [c-api:lib/handler.js] [allorgs-5e707801] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
For org2 and org3 containers, once it finally finishes the timeout, it outputs:
2022-02-23T12:24:05.045Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: No connection established\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:182:78\n at processTicksAndRejections (internal/process/task_queues.js:79:11)"
2022-02-23T12:24:05.045Z debug [c-api:lib/handler.js] Chat stream ending
I have also enabled DEBUG logs on everything and I'm gleaning nothing useful from it. Any help or suggestions would be greatly appreciated!
The three peers share the same port. Is that even possible?
Also, when running invoke from the command line, I would normally use the following pattern, repeated for each peer.
--peerAddresses localhost:6051 --tlsRootCertFiles <path to peer on port 6051>
--peerAddresses localhost:6052 --tlsRootCertFiles <path to peer on port 6052>
not the three peers followed by the three TLS cert file paths.

How do I added a label to a Kubernetes pod when it has finished running it's startup script

I have a pod that runs a script on startup. Once that script has finished executing I want to add a label to the pod so that it becomes available as part of a service. What is the best way to do this?
Instead of a label, you can use a readiness probe. If a readiness probe is set on a pod, then that probe will be continuously executed to detect whether that pod is considered "ready" and therefore should become available as part of a service. If the probe returns false or error, the pod will not be put into service. Probes can hit an HTTP endpoint or run a script inside the container.
In your case, your script could drop a file somewhere to indicate that the pod is ready. A separate script in your container (specified in the readiness probe) could return true only if that file is present.
More information on readiness checks can be found here.
(If you are still interested in adding a label to the pod, you can learn about service accounts which allow pods to communicate with the apiserver.)

Spark - Remote Akka Client Disassociated

I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.

Yarn in Aazon EC2 with whirr

i am trying to configue Yarn 2.2.0 with whirr in Amazon EC2. however I am having some problems. I have modified the whirr services to support yarn 2.2.0. As a result I am able to start the jobs and run them successfully. however I am facing n issue in tracking the job progress.
mapreduce.Job (Job.java:monitorAndPrintJob(1317)) - Running job: job_1397996350238_0001
2014-04-20 21:57:24,544 INFO [main] mapred.ClientServiceDelegate (ClientServiceDelegate.java:getProxy(270)) - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
java.io.IOException: Job status not available
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:599)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1327)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
at com.zetaris.hadoop.seek.preprocess.PreProcessorDriver.executeJobs(PreProcessorDriver.java:112)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executePreProcessJob(JobToJobMatchingDriver.java:143)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executeJobs(JobToJobMatchingDriver.java:78)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executeJobs(JobToJobMatchingDriver.java:43)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.main(JobToJobMatchingDriver.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212
I tried Debugginh the problem is with the ApplicationMaster. It has an hostname and rpc port , in which the hostname is the internal hostname which can only be resolved from within the amazon network. Idealy it should have been a public Amazon DNs name. however I could'nt set it yet. I tried setting parameters like
yarn.nodemanager.hostname
yarn.nodemanager.address
But I couldnt find any change in the ApplicationMaster's hostname or port they are still the private amazon internal hostname. Am I missing anything. Or should I change the /etc/hosts in all node manager nodes so that node managers start with the public address..
But that will be an overkill right.Or is there any way I can configure the ApplicationMaster to take the public ip.So that I can Remotely track the progress
I am doing this all because I need to submit the jobs remotely.I am not willing to compromise this feature. Anyone out there who an guide me
I was successful in configuring the historyserver and I am able to access then from the remote client. I used the configuration to do it.
mapreduce.jobhistory.webapp.address
When i debugged I find the
MRClientProtocol MRClientProxy = null;
try {
MRClientProxy = getProxy();
return methodOb.invoke(MRClientProxy, args);
} catch (InvocationTargetException e) {
// Will not throw out YarnException anymore
LOG.debug("Failed to contact AM/History for job " + jobId +
" retrying..", e.getTargetException());
// Force reconnection by setting the proxy to null.
realProxy = null;
proxy failing to connect because of the private address . And above code snipped is from ClientServiceDelegate
I was able to avoid the issue. Rather than solve this. The problem is with the resolution of ip outside the cloud environment.
Initially I tried updating the whirr-yarn source to make use of public ip for configurations rather than private ip. But still There where issues.So I gave up the task.
What I finally did was to start job form the cloud environment itself. rther than from a host outside the cloud infrastructure. Hope somebody found a better way.
I had the same problem. Solved by adding following lines in mapred-site.yml. It move's your staging directory from default tmp directory to your home directory where your have permission.
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
In addition to this, you need to create a history directory on hdfs:
hdfs dfs -mkdir -p /user/history
hdfs dfs -chmod -R 1777 /user/history
hdfs dfs -chown mapred:hadoop /user/history
I found this link quite useful for configuring a Hadoop cluster.
conf.set("mapreduce.jobhistory.address", "hadoop3.hwdomain:10020");
conf.set("mapreduce.jobhistory.intermediate-done-dir", "/mr-history/tmp");
conf.set("mapreduce.jobhistory.done-dir", "/mr-history/done");

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?
It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.