ElasticSearch cloud-aws plugin not able to join cluster - amazon-web-services

So i have been attempting to use the ElasticSearch "cloud-aws" plugin to join elasticsearch nodes to my single master. I have been though a few online guides and tried a few settings from various sources but I still can't get the new nodes to join the existing master.
I have configured IAMS roles and tags for EC2 and this is my elasticsearch.yml file on one node (the others are similar):
node.name: Thor
node.client: "true"
network.host: localhost
cloud.aws.access_key: foobar
cloud.aws.secret_key: barfoo
cloud.aws.region: eu-west-1
discovery.type: ec2
discovery.ec2.tag.elasticsearch: Ubuntu-ElasticNode
The logging from elasticsearch is poor and even in DEBUG mode not much is offered up.
[2016-03-15 23:01:05,440][INFO ][node ] [Thor] version[2.2.0], pid[1550], build[8ff36d1/2016-01-27T13:32:39Z]
[2016-03-15 23:01:05,447][INFO ][node ] [Thor] initializing ...
[2016-03-15 23:01:06,685][INFO ][plugins ] [Thor] modules [lang-expression, lang-groovy], plugins [cloud-aws], sites []
[2016-03-15 23:01:10,016][INFO ][node ] [Thor] initialized
[2016-03-15 23:01:10,017][INFO ][node ] [Thor] starting ...
[2016-03-15 23:01:10,106][INFO ][transport ] [Thor] publish_address {localhost/127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2016-03-15 23:01:10,115][INFO ][discovery ] [Thor] elasticsearch/9PmYq5tXQcaPUPqDh4VTSQ
[2016-03-15 23:01:40,116][WARN ][discovery ] [Thor] waited for 30s and no initial state was set by the discovery
[2016-03-15 23:01:40,155][INFO ][http ] [Thor] publish_address {localhost/127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2016-03-15 23:01:40,155][INFO ][node ] [Thor] started
[2016-03-15 23:54:40,863][DEBUG][action.admin.cluster.health] [Thor] no known master node, scheduling a retry
[2016-03-15 23:55:10,864][DEBUG][action.admin.cluster.health] [Thor] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2016-03-15 23:55:10,874][INFO ][rest.suppressed ] /_cluster/health Params: {pretty=}
MasterNotDiscoveredException[null]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:205)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:239)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:794)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I have the port range 9200 - 9400 open between the elasticsearch servers but the log seems to indicate that the discovery is still timing out. I set "discovery.ec2.tag.*" to speed up the discovery process but this hasn't helped.
Does anyone have any idea how this plugin needs to be configured? I have read a few guides and a lot use even less configuration options than I and are still able to join nodes to the master.

I'm running ElasticSearch 2.2. Here's an example of my working config:
plugin.mandatory: cloud-aws
cluster.name: mynewcluster
cloud.aws.access_key: mykey
cloud.aws.secret_key: mysecret
cloud.aws.region: us-east-1
discovery.type: ec2
discovery.ec2.tag.elasticsearch: mynewcluster
I think you need to look at your network config. Specificallynetwork.host. From the docs:
Elasticsearch binds to localhost only by default. This is sufficient for you to run a local development server (or even a development cluster, if you start multiple nodes on the same machine), but you will need to configure some basic network settings in order to run a real production cluster across multiple servers.
I don't have the network.host config in my elasticsearch.yml. Which leads me to suggest taking it out altogether. However, since the docs say that it binds to localhost by default I would also suggest that you try to set it to the public hostname or IP of the node.
All of this assumes that you correctly set up IAM, Security Groups according to: https://github.com/elastic/elasticsearch-cloud-aws

So after having this chat in the es forums: https://discuss.elastic.co/t/cloud-aws-plugin-not-able-to-join-cluster/44703/3
I decided to rebuild the node cleanly as i suspected a Java downgrade from 8 to 7 to allow the cloud-aws plugin to work may be causing the issue and i had tried to many failed fixes. I also (from advise in the link provided) installed the marvel-agent and license plugins but I haven't seen any one else do this to get discovery working so I am not sure this is a requirement. I also made sure to hold es package upgrades because the marvel plugin did a bit of complaining when es upgraded (although the plugin could also have been upgraded so just a personal preference really).
Discovery is now working very well.

Related

Kernel panic during EC2 launch

During deployment of new app version on new EC2 with template from AMI image I got following error:
[ 1.670047] No filesystem could mount root, tried:
[ 1.670048]
[ 1.677170] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 1.685026] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.14.181-140.257.amzn2.x86_64 #1
[ 1.692532] Hardware name: Amazon EC2 t3a.nano/, BIOS 1.0 10/16/2017
previously it was working like charm - I haven't change any AMI or template setting - just added new SSL certs to nginx
I found this troubleshooting article but cannot move with it since I cannot access instance terminal.
I've tried to boo new instance with add different kernel ID but got this error:
You cannot launch multiple AMIs with different virtualization styles at the same time.
I believed during AMI kernel has somehow changed and it causes problem - is it possible?
What can cause that issue and how to solve?

Unable to connect to neo4j server on my aws ec2 instance - port 7474

After installing neo4j on my aws ec2 instance, the following seems to indicate that the server is up.
# bin/neo4j console
Active database: graph.db
Directories in use:
home: /usr/local/share/neo4j-community-3.3.1
config: /usr/local/share/neo4j-community-3.3.1/conf
logs: /usr/local/share/neo4j-community-3.3.1/logs
plugins: /usr/local/share/neo4j-community-3.3.1/plugins
import: /usr/local/share/neo4j-community-3.3.1/import
data: /usr/local/share/neo4j-community-3.3.1/data
certificates: /usr/local/share/neo4j-community-3.3.1/certificates
run: /usr/local/share/neo4j-community-3.3.1/run
Starting Neo4j.
WARNING: Max 1024 open files allowed, minimum of 40000 recommended.
See the Neo4j manual.
2017-12-01 16:03:04.380+0000 INFO ======== Neo4j 3.3.1 ========
2017-12-01 16:03:04.447+0000 INFO Starting...
2017-12-01 16:03:05.986+0000 INFO Bolt enabled on 127.0.0.1:7687.
2017-12-01 16:03:11.206+0000 INFO Started.
2017-12-01 16:03:12.860+0000 INFO Remote interface available at
http://localhost:7474/
At this point I am not able to connect. I have opened up ports 7474 - and 7687 - and I can access port 80, plus ssh into the instance, etc.
Is this a neo4j or aws problem?
Any help is appreciated.
Colin Goldberg
Set the dbms.connectors.default_listen_address to be 0.0.0.0, then only open the SSL port located on 7473 using Amazon's ec2 security groups. Don't use 7474 if you don't have to.
It looks like Neo4j is only listening on the localhost interface. If your run netstat -a | grep 7474 you want to see something like *:7474. If you see something like localhost:7474 then you won't be able to connect to the port from outside.
Take a look at Configuring Neo4j connectors. I believe you want dbms.connectors.default_listen_address set to 0.0.0.0.
And now a warning - you are opening your Neo4j to the entire planet if you do this. That may be ok but it seems unlikely that this is what you want to do. The defaults are there for a reason - you don't want the entire planet being able to try to hack into your database. Use caution if you enable this.

fail to insert data into local Google App Engine datastore

I am following the example from Google's Mobile Shopping Assistant sample which asks me to import data according to this link.
I tried the steps according to the example (all the sample codes are vanilla, i didn't change any thing except to fix the warning to use the latest Gradle build version)
I believe that I am missing an essential step that is not stated in the example. Can someone provide some insights into which steps I did wrongly?
the following are the steps i did
start local googleAppEngine app "backend"
ran cmd " appcfg.py upload_data --config_file bulkloader.yaml --url=http
://localhost:8080/remote_api --filename places.csv --kind=Place -e myEmailAddress#gmail.com".
This command is supposed to insert 2 rows into the datastore (places.csv contains two entries)
this gave me the following readout
10:07 AM Uploading data records.
[INFO ] Logging to bulkloader-log-20151020.100728
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
Error 302: --- begin server output ---
--- end server output ---
I then go to "http://localhost:8080/admin/buildsearchindex" which displays "MaintenanceTasks completed."
Next I go to "http://localhost:8080/_ah/admin" but it displays
Datastore has no entities in the Empty namespace. You need to add data
programatically before you can use this tool to view and edit it.
I had the same problem but not with the local developer server but with the deployed version. After nearly going insane, I found a workaround to upload the data using appcfg. In my case, I noticed that when trying the following
Input not working for me:
gcloud auth login
appcfg.py upload_data --config_file bulkloader.yaml --url=http://<yourproject>.appspot.com/remote_api --filename places.csv --kind=Place --email=<you#gmail.com>
Output of error:
11:10 AM Uploading data records.
[INFO ] Logging to bulkloader-log-20160108.111007
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
Error 302: --- begin server output ---
--- end server output ---
as expected, I was not asked to authenticate myself again during the second command but apparently, appcfg could still not authenticate my account. I am using Win7, with Python 2.7, the Python Google App Engine SDK including appcfg.py and gcloud from the Google Cloud SDK if I get it right.
However, on https://cloud.google.com/container-registry/docs/auth it is shown that you can print out the access token using the gcloud command and then insert it manually in your appcfg command which worked for me
Input working for me:
gcloud auth login
gcloud auth print-access-token
This prints out the access token which you can use with appcfg
appcfg.py upload_data --oauth2_access_token=<oauth2_access_token> --config_file bulkloader.yaml --url=http://<yourproject>.appspot.com/remote_api --filename places.csv --kind=Place --email=<you#gmail.com>
Output of successful data upload:
10:42 AM Uploading data records.
[INFO ] Logging to bulkloader-log-20160108.104246
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
[INFO ] Opening database: bulkloader-progress-20160108.104246.sql3
[INFO ] Connecting to <yourproject>.appspot.com/remote_api
[INFO ] Starting import; maximum 10 entities per post
.
[INFO ] 3 entities total, 0 previously transferred
[INFO ] 3 entities (4099 bytes) transferred in 1.7 seconds
[INFO ] All entities successfully transferred
I hope this helps anybody trying to solve this problem. To me, it is not clear what the source of this problem is, but this is a workaround that works for me.
BTW, I observed the same problem on a mac.
So here is what I found through testing. I went through the same steps initially and got the same error, but what is worthy of note in the output is the entry INFO client.py:669 access_token is expired:
MobileAssistant-Data> appcfg.py upload_data --config_file bulkloader.yaml --url=http://localhost:8080/remote_api --filename places.csv --kind=Place -e myEmailAddress#gmail.com
05:12 PM Uploading data records.
[INFO ] Logging to bulkloader-log-20151112.171238
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
2015-11-12 17:12:38,466 INFO client.py:669 access_token is expired. Now: 2015-11-12 22:12:38.466000, token_expiry: 2015-11-06 01:33:21
Error 302: --- begin server output ---
This looked somewhat like an issue I saw in the Remote API handler for the dev server that surfaced after ClientLogin was deprecated (but in the Python SDK). Just to test I edited the build.gradle to use the latest SDK version (1.9.28 from 1.9.18) and ran it again:
MobileAssistant-Data> appcfg.py upload_data --config_file bulkloader.yaml --url=http://localhost:8080/remote_api --filename places.csv --kind=Place -e myEmailAddress#gmail.com
05:16 PM Uploading data records.
[INFO ] Logging to bulkloader-log-20151112.171615
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
2015-11-12 17:16:15,177 INFO client.py:669 access_token is expired. Now: 2015-11-12 22:16:15.177000, token_expiry: 2015-11-06 01:33:21
2015-11-12 17:16:15,565 INFO client.py:669 access_token is expired. Now: 2015-11-12 22:16:15.565000, token_expiry: 2015-11-06 01:33:21
2015-11-12 17:16:15,573 INFO client.py:571 Refreshing due to a 401 (attempt 1/2)
2015-11-12 17:16:15,575 INFO client.py:797 Refreshing access_token
2015-11-12 17:16:16,039 INFO client.py:571 Refreshing due to a 401 (attempt 2/2)
2015-11-12 17:16:16,040 INFO client.py:797 Refreshing access_token
... (ad infinitum)
This output now mirrors the Python Remote API issue exactly. So it seems that the same issue exists with the Java Remote API (the auth check has not been properly updated to use the new auth scheme).
The workaround in Python was to manually edit the SDK source code and stub out the auth check. I suspect a similar hack would be necessary for the Java SDK. It's not as straightforward though as the SDK would need to be rebuilt from source.
If I come across anything else I will update this answer with my findings. Note that this should work perfectly fine with a deployed application - it's only the dev server that is affected.
Update:
The culprit is the admin check in com/google/apphosting/utils/remoteapi/RemoteApiServlet.java as with the same issue in the Python SDK noted above. Unfortunately you cannot trivially rebuild the SDK from source, as the build target in build.xml only includes 'jsr107cache' and the rest of the build is done from pre-generated binaries. It looks like we'll have to just wait until this is fixed in a future release, but for now I will update the public bug.
For now I would recommend sticking to the documentation and only using the deployed app version for remote_api uploads.
Better use the new UI in google developer console. URL : https://console.developers.google.com/project/<YOUR_PROJECT_ID>/datastore
You can see Query or indexes subsections of this for your kinds[You can also use GQL] and indexes.
NOTE : Also I noticed a particular kind does not appear in query section unless some data is added in it [GQL also returns empty data]. I see that particular kind in indexes section though.

Spark 0.90 Stand alone connection refused

I am using spark 0.90 stand alone mode.
When I tried with a streaming application in stand alone mode, I am getting a connection refused exception.
I added hostname in /etc/hosts also tried with IP alone. In both cases worker got registered with master without any issues.
Is there a way to solve this issue?
14/02/28 07:15:01 INFO Master: akka.tcp://driverClient#127.0.0.1:55891 got disassociated, removing it.
14/02/28 07:15:04 INFO Master: Registering app Twitter Streaming
14/02/28 07:15:04 INFO Master: Registered app Twitter Streaming with ID app-20140228071504-0000
14/02/28 07:34:42 INFO Master: akka.tcp://spark#127.0.0.1:33688 got disassociated, removing it.
14/02/28 07:34:42 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.165.35.96%3A38903-6#-1146558090] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/02/28 07:34:42 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.165.35.96:8910] -> [akka.tcp://spark#127.0.0.1:33688]: Error [Association failed with [akka.tcp://spark#127.0.0.1:33688]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#127.0.0.1:33688]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /127.0.0.1:33688
I had a similar issue when running in Spark in cluster mode. My problem was that the server was started with the hostname 'fluentd:7077' and not the FQDN. I edited the
/sbin/start-master.sh
to reflect how my remote nodes connect with the -ip flag.
/usr/lib/jvm/jdk1.7.0_51/bin/java -cp :/home/vagrant/spark-0.9.0-incubating-bin- hadoop2/conf:/home/vagrant/spark-0.9.0-incuba
ting-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.ap
ache.spark.deploy.master.Master --ip fluentd.alex.dev --port 7077 --webui-port 8080
Hope this helps.

SaltStack: Getting Up and Running Minion on EC2

I am working through the SaltStack walk through to set up salt on my ec2 cluster. I just edited /etc/salt/minion and added the public dns of my salt master.
master: ec2-54-201-153-192.us-west-2.compute.amazonaws.com
Then I restarted the minion. In debug mode, this put out the following
$ sudo salt-minion -l debug
[DEBUG ] Reading configuration from /etc/salt/minion
[INFO ] Using cached minion ID: localhost.localdomain
[DEBUG ] loading log_handlers in ['/var/cache/salt/minion/extmods/log_handlers', '/usr/lib/python2.6/site-packages/salt/log/handlers']
[DEBUG ] Skipping /var/cache/salt/minion/extmods/log_handlers, it is not a directory
[DEBUG ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found the in the configuration. Not loading the Logstash logging handlers module.
[DEBUG ] Configuration file path: /etc/salt/minion
[INFO ] Setting up the Salt Minion "localhost.localdomain"
[DEBUG ] Created pidfile: /var/run/salt-minion.pid
[DEBUG ] Chowned pidfile: /var/run/salt-minion.pid to user: root
[DEBUG ] Reading configuration from /etc/salt/minion
[DEBUG ] loading grain in ['/var/cache/salt/minion/extmods/grains', '/usr/lib/python2.6/site-packages/salt/grains']
[DEBUG ] Skipping /var/cache/salt/minion/extmods/grains, it is not a directory
[DEBUG ] Attempting to authenticate with the Salt Master at 172.31.21.27
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
Sure enough, 172.31.21.27 is the private ip of the master. So far this looks ok. According to the walkthrough, the next step is to accept the minions key on the master:
"Now that the minion is started it will generate cryptographic keys and attempt to
connect to the master. The next step is to venture back to the master server and
accept the new minion's public key."
However, when I go to the master node and look for new keys I don't see any pending requests.
$ sudo salt-key -L
Accepted Keys:
Unaccepted Keys:
Rejected Keys:
And the ping test does not see the minion either:
$ sudo salt '*' test.ping
This is where Im stuck, what should I do next to get up and running?
Turn off iptables and do salt-key -L to check if the key shows up. If it does, then you need to open port 4505 and 4506 on the master for the minion to be able to connect to it. You could do lokkit -p tcp:4505 -p tcp:4506 to open these ports.
You likely need to add rules for 4505/4506 between the salt master and minion security group. Salt master needs these ports to be able to communicate with the minions.