aerospike cluster crashed after index creation

aerospike cluster crashed after index creation - amazon-web-services

We have a cluster at AWS of 4 machines t2micro (1cpu 1gb ram 15gb ssd) and we were testing aerospike.
We used the aws marketplace AMI to install aerospike v3 community edition, and configured only the aerospike.conf file to have a namespace on the disk.
We had one namespace with two sets, totaling 18M documents, 2gb ram occupied and aprox 40gb of disk space occupied.
After the creation of an index in a 12M records set the system crashed.
Some info:
aql on the instance:
[ec2-user#ip-172-XX-XX-XXX ~]$ aql
2015-09-16 18:44:37 WARN AEROSPIKE_ERR_CLIENT Socket write error: 111
Error -1: Failed to seed cluster*
Tail of the log: (it keeps adding only lines repeated)
Sep 16 2015 19:08:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:08:46 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:06 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
asmonitor:
$ asmonitor -h 54.XX.XXX.XX
request to 54.XX.XXX.XX : 3000 returned error
skipping 54.XX.XXX.XX:3000
***failed to connect to any hosts
asadm:
$ asadm -h 54.XXX.XXX.XX -p 3000
Aerospike Interactive Shell, version 0.0.10-6-gdd6fb61
Found 1 nodes
Offline: 54.207.67.238:3000
We tried restarting the instances, one of them is back but working as a standalone node, the rest are in the described state.
The instances are working, but the aerospike service is not.

There is a guide dedicated to using Aerospike on Amazon EC2 and you probably want to follow it closely to get started.
When you see a AEROSPIKE_ERR_CLIENT "Failed to seed cluster" it means that your client cannot connect to any seed node in the cluster. A seed node is the first node the client connects to, from which it learns about the cluster partition table and the other nodes. You are using aql with the default host (127.0.0.1) and port (3000) values. Try with -h and -p, or use --help for information on the flags.
There are many details you're not including, such as are these nodes all in the same Availability Zone of the same EC2 region? Did you configure your /etc/aerospike.conf with mesh configuration (that's the mode needed in Amazone EC2). Simply, can your nodes see each other? You're using what looks like public IP, but your nodes need to see each other through their local IP addresses. They have no idea what their public IP is, unless you configured it. At the same time the clients may be connecting from other AZs, so you will need to set up the access_address correctly. See this discussion forum post on the topic: https://discuss.aerospike.com/t/problems-configuring-clustering-on-aws-ec2-with-3-db-instances/1676

Related

How to solve Memory issues with Paketo buildpack used to build a spring-boot app?

I am building Docker image with the spring-boot-maven-plugin that is deployed to AWS BeanStalk. I use the plugin through the 2.4.3 spring boot starter dependency)
However, when the container is started, I get the error below.
I am a bit new in the buildpack stuff, but tried to solve it by playing with the Buildpack env variables as described on the website. But it had absolutely no effect on the values shown in the error log below.
I found this github issue but not sure if it's relevant and how to use it.
I am using AWS Micro instance that has 1G total RAM, it performs a rolling update, so at the time of starting the new image, the other is also running till the new one started with success, so to start the container could as well be that only 300MB is available, however, during normal run it has more available.
Why do I need this memory calculation? Can't I just disable it? When I build a Docker image of the app.jar and deploy it to aws beanstalk, it works well without any memory settings:
docker build . --build-arg JAR_FILE=./target/app.jar -t
$APPLICATION_NAME
But I would love to use the image build through the spring-boot-maven plugin.
Please some advice on how to solve this?
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<image>
<name>${image.name}</name>
<env>
<tag>${project.version}</tag>
<!--BPE_APPEND_JAVA_TOOL_OPTIONS>-XX:MaxDirectMemorySize=1M</BPE_APPEND_JAVA_TOOL_OPTIONS-->
<BPE_JAVA_TOOL_OPTIONS>-Xms1024m -Xmx3048m</BPE_JAVA_TOOL_OPTIONS>
</env>
</image>
</configuration>
</plugin>
The AWS Beanstalk error during deployment:
Tue May 18 2021 18:07:14 GMT+0000 (UTC) INFO Successfully built aws_beanstalk/staging-app
Tue May 18 2021 18:07:22 GMT+0000 (UTC) ERROR Docker container quit unexpectedly after launch: 0M, -Xss1M * 250 threads
[31;1mERROR: [0mfailed to launch: exec.d: failed to execute exec.d file at path '/layers/paketo-buildpacks_bellsoft-liberica/helper/exec.d/memory-calculator': exit status 1. Check snapshot logs for details.
Tue May 18 2021 18:07:24 GMT+0000 (UTC) ERROR [Instance: i-0dc33dcb517e89ef9] Command failed on instance. Return code: 1 Output: (TRUNCATED)...pectedly after launch: 0M, -Xss1M * 250 threads
[31;1mERROR: [0mfailed to launch: exec.d: failed to execute exec.d file at path '/layers/paketo-buildpacks_bellsoft-liberica/helper/exec.d/memory-calculator': exit status 1. Check snapshot logs for details.
Hook /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
Tue May 18 2021 18:07:24 GMT+0000 (UTC) INFO Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
Tue May 18 2021 18:07:24 GMT+0000 (UTC) ERROR Unsuccessful command execution on instance id(s) 'i-0dc33dcb517e89ef9'. Aborting the operation.
Tue May 18 2021 18:07:24 GMT+0000 (UTC) ERROR Failed to deploy application.
Tue May 18 2021 18:07:24 GMT+0000 (UTC) ERROR During an aborted deployment, some instances may have deployed the new application version. To ensure all instances are running the same version, re-deploy the appropriate application version.
##[error]Error: Error deploy application version to Elastic Beanstalk
The Docker error log downloaded in AWS Beanstalk:
Docker container quit unexpectedly on Tue May 18 18:07:21 UTC 2021:
Setting Active Processor Count to 1
Calculating JVM memory based on 274300K available memory
unable to calculate memory configuration
fixed memory regions require 662096K which is greater than 274300K available for allocation: -XX:MaxDirectMemorySize=10M, -XX:MaxMetaspaceSize=150096K, -XX:ReservedCodeCacheSize=240M, -Xss1M * 250 threads
[31;1mERROR: [0mfailed to launch: exec.d: failed to execute exec.d file at path '/layers/paketo-buildpacks_bellsoft-liberica/helper/exec.d/memory-calculator': exit status 1

OK, so here's what this is telling us:
Calculating JVM memory based on 274300K available memory
The memory calculator is detecting a maximum amount of memory available in the container as 274300KB, or about 274M.
fixed memory regions require 662096K which is greater than 274300K available for allocation: -XX:MaxDirectMemorySize=10M, -XX:MaxMetaspaceSize=150096K, -XX:ReservedCodeCacheSize=240M, -Xss1M * 250 threads
This message is saying that the memory calculator needs at least 662096KB or 662M in its present configuration.
It's also breaking down why it needs/wants that much:
10M for direct memory
150096K for metaspace
240M for reserved code cache
250M for threads (thread stack specifically)
That's not counting the heap which will require more (you seem to want at least 1G for the heap).
This leaves two possibilities:
The container is not provisioned large enough. You need to give it more memory.
The memory calculator is not correctly detecting the memory limit.
If you suspect #2, look at the following. The memory calculator selects it's max memory limit (i.e. the 274M in the example above) by looking in these places in this order.
Check the configured container memory limit by looking at /sys/fs/cgroup/memory/memory.limit_in_bytes inside the container.
Check the system's max available memory by looking at /proc/meminfo and the MemAvailable metric, again, from inside the container.
If all else fails, it'll end up with a 1G fallback.
If it's truly not working as described above, please open a bug and provide as much detail as you can.
Alternatively, you may tune the memory calculator. You can instruct it to give less memory to specific regions such that you reduce the total memory required to be less than the max available memory.
You can do that by setting the JVM memory flags in the JAVA_TOOL_OPTIONS env variable (you have BPE_JAVA_TOOL_OPTIONS which isn't right). See https://paketo.io/docs/buildpacks/language-family-buildpacks/java/#runtime-jvm-configuration.
For example, if you want to override the heap size then set -Xmx in JAVA_TOOL_OPTIONS to something custom. The memory calculator will see what you've set and adjust the remaining memory settings accordingly. Override as many as necessary.
To get things down to fit within 274M of RAM, you'd have to go really small. Something like -Xss256K -XX:ReservedCodeCacheSize=64M -XX:MaxMetaspaceSize=64 -Xmx64M. I didn't test to confirm, but this shows the idea of what you need to do. Reduce the memory settings such that the sum total all fits within your max memory limit for the container.
This also does not take into account if your application will actually be able to run within such small limits. If you go too small you may at some point see OutOfMemoryErrors or StackOverflowErrors, and your app will crash. You can also negatively impact performance by reducing code cache size too much since this is where the JIT stores byte code that it's optimized to native code. You could even cause GC issues, or degraded performance due to too much GC if the heap isn't sized right. In short, be very careful if you're going to do this.

AWS EC2: cannot get bare metal instance

I have tried several times in the last two weeks to log on to a c5.metal instance. Each time I get "Initializing" in the status checks field, but after 10 minutes it is still "Initializing" and I'm not able to log on. I have had success with c5.metal before, but not any more.
Today I also tried to get an m5.metal instance. This time the instance successfully initialized after 10 minutes but I was not able to log on with Putty. I stopped the instance, then after about 30 minutes I tried again and this time I did not get past "Initializing" in the status check field and I stopped it after 15 minutes.
I get billed for the 10 to 15 minute bare metal wait periods, even when initialization doesn't complete. I have no problems with AWS virtual instances.
Thanks for any ideas on what I can do to get the bare metal instances to work.

To reproduce your situation, I did the following:
Launched an Amazon EC2 instance in Ohio:
Instance Type: c5.metal
AMI: Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
Network: In my Default VPC so that it uses a Public Subnet
Security Group: Default settings, which grants port 22 access from the Internet
Instance entered running state very quickly, Status Checks showed as Initializing
It took about 8 minutes until the status checks were showing 2/2 checks (it might have been faster, but I was testing other things in the meantime).
I was able to successfully login to the instance:
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-1065-aws x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Sat Jun 6 23:21:18 UTC 2020
System load: 0.02 Processes: 924
Usage of /: 13.7% of 7.69GB Users logged in: 0
Memory usage: 0% IP address for enp125s0: 172.31.9.77
Swap usage: 0%
0 packages can be updated.
0 updates are security updates.
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
ubuntu#ip-172-31-9-77:~$
(Actually, I first tried to login as ec2-user and it took me a while to realize this was an Ubuntu AMI, so I connected as ubuntu).
It is possible that the slow startup is due to the Operating System or hardware checking the 192GB of RAM that is allocated to the instance.
I booted another instance using an Amazon Linux 2 AMI and it required approximately 7 minutes before I could connect.
I also noticed that the c5.metal instances did not provide anything for "Get System Log" or "Get Instance Screenshot". This might be a result of using a bare-metal instance.

I joined John Rotenstein's twitch.tv channel and he showed how he got a c5.metal instance. What I learned is that if a metal instance does not work in the region you had chosen, try launching a new instance in a different data center region. For example, I had a c5.metal instance at us-east-2a. Following John's directions, I launched an instance at us-east-2c and after about 8 minutes the instance was ready for use.

Orderer disconnections in a Hyperledger Fabric application

We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)

It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.

EC2 Security Group not connecting to my IP

Seems like a basic job, but for some reason it is not working for me. I wish to access my EC2 instances from my office IP only.
I went into my security group and added an SSH rule with source for my IP only like this -
But this does not seems to be working for me at all. I get connection denied when I try to connect via WinSCP or by using terminal.
Everything works if I change my source to Everywhere (0.0.0.0/0)
Anyone has any pointer for me please.

Login to the EC2 using the method that works and issue the command
who am i
It will say something like
ec2-user pts/0 2016-02-29 15:06 (104.66.242.192)
Use the ip address shown for you (not the one above) in the security group rule

Although "who am i" work fine. However I'd like to add two more solutions.
both are very easy.
Solution 1:
Step 1: Open security group for all IP's (0.0.0.0/0) for a while.
Step 2: Make ssh connection to your server.
Step 3: run "w" command and check the output in FROM column.
ubuntu#ip-172-31-39-228:~$ w
23:20:09 up 5 min, 1 user, load average: 0.08, 0.08, 0.04
USER TTY FROM LOGIN# IDLE JCPU PCPU WHAT
ubuntu pts/0 52.95.75.17 23:20 0.00s 0.01s 0.00s w
Step 4: Replace this IP in the security group with 0.0.0.0/0 ( like 52.95.75.17/32 ).
Solution 2:
Step 1: Open security group for all IP's (0.0.0.0/0) for a while.
Step 2: Make ssh connection to your server.
Step 3: Check the last login info on welcome message.
like :
Learn more about enabling ESM Apps service at https://ubuntu.com/esm
Last login: Thu Feb 9 23:21:42 2023 from 52.95.75.17
ubuntu#ip-172-31-39-228:~$
ubuntu#ip-172-31-39-228:~$
Step 4 ( optional ): If IP address not available in welcome message. Then run "last" command.
ubuntu#ip-172-31-39-228:~$
ubuntu#ip-172-31-39-228:~$ last
ubuntu pts/2 52.95.75.17 Thu Feb 9 23:33 still logged in
ubuntu pts/1 52.95.75.17 Thu Feb 9 23:21 still logged in
Step 5: Replace this IP in the security group with 0.0.0.0/0 ( like 52.95.75.17/32 ).
Check below screenshot for reference of above solutions:

Feel free to use my powershell script for this .
The script detects your public ip and adds it to the inbound security group rules of dedicated RDP and SSH security groups .
If these groups do not exist , the script will create them and add it to the appropriate instances .
https://github.com/manuelh2410/public/blob/1/AWSIP_Linux_Win.ps1

AWS VPN issue routing to 2nd ip block

I've just setup a VPN link between our local network and an Amazon VPC.
Our local network has two ip blocks of interest:
192.168.0.0/16 - block local-A
10.1.1.0/24 - block local-B
The AWS VPC has a ip block of:
172.31.0.0/16 - block AWS-A
I have setup the VPN connection with static routes to the local-A and local-B ip blocks.
I can connect from:
local-A to AWS-A and
AWS-A to local-A.
I can't connect from:
AWS-A to local-B (e.g. 173.31.17.135 to 10.1.1.251)
From the 173.31.17.135 server, 10.1.1.251 seems to be resolving to an Amazon server, rather than a server on our local network. This is despite setting up local-B (10.1.1.0/24) as a route on the VPN. See tracert output...
tracert 10.1.1.251
Tracing route to ip-10-1-1-251.ap-southeast-2.compute.internal [10.1.1.251]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms 169.254.247.13
2 <1 ms <1 ms <1 ms 169.254.247.21
3 33 ms 35 ms 36 ms 169.254.247.22
4 34 ms 35 ms 33 ms ip-10-1-1-251.ap-southeast-2.compute.internal [10.1.1.251]
How can I change the config so that the 10.1.1.251 address resolves to the server on my local network when accessed from within AWS?
Disclosure - I've asked this same question at serverfault too.
Edit...tracert to the other subnet
tracert -d 192.168.0.250
Tracing route to 192.168.0.250 over a maximum of 30 hops
1 <1 ms <1 ms <1 ms 169.254.247.13
2 <1 ms <1 ms <1 ms 169.254.247.21
3 36 ms 35 ms * 169.254.247.22
4 35 ms 36 ms 34 ms 192.168.0.250
Edit #2...tracetcp to the bad and good destinations:
>tracetcp 10.1.1.251 -n
Tracing route to 10.1.1.251 on port 80
Over a maximum of 30 hops.
1 0 ms 0 ms 0 ms 169.254.247.13
2 0 ms 0 ms 0 ms 169.254.247.21
3 31 ms 31 ms 16 ms 169.254.247.22
4 * * * Request timed out.
5 ^C
>tracetcp 192.168.0.250 -n
Tracing route to 192.168.0.250 on port 80
Over a maximum of 30 hops.
1 0 ms 0 ms 0 ms 169.254.247.13
2 0 ms 0 ms 0 ms 169.254.247.21
3 47 ms 46 ms 32 ms 169.254.247.22
4 Destination Reached in 31 ms. Connection established to 192.168.0.250
Trace Complete.
Hop #3 is the VPN end point on the local side. We are getting that far and then something is failing to route past that. So yes, the problem is on our side, i.e. AWS setup is fine. I'll update when we get to the bottom of it.
Edit #3
Fixed! The problem was the Barracuda Firewall that we run internally. We had created incoming and outgoing holes in the firewall AND there were no logs showing blocked requested to the destination so we had ruled the firewall out as the cause early on. In desperation we quickly switched the firewall off and the connection went through ok. Since then we have updated the firmware on the firewall and it has fixed the issue and the firewall is now allowing through the traffic.
tracert and tracetcp were helpful diagnostics but the firewall never showed up on the list of hops (even now it's fixed) so it was hard to tell the firewall was blocking it. I would have expected all traffic to be passing through the firewall and that the firewall would be in the hop list. Thanks for the help.

You appear to be blurring the distinction between routing and name resolution. These two things are completely unrelated.
Looking at the round-trip-times, it certainly doesn't look like you're hitting an AWS server. I would speculate that you're routing fine, but amazon's internal DNS service is showing you the wrong name.
Have you actually tried connecting to your 10.1.1.251 machine? Intuition tells me it's fine, you're just seeing a hostname you don't expect.
The fix, of course, is to configure your EC2 VPC instances to use your corporate DNS servers instead of Amazon's. Or you can just use tracert -d so that you don't see this. :)
Okay, update. Your references to resolution and the latency involved made me think otherwise. I'm still inclined to think you actually have connectivity at some level, though because the 36 milliseconds seems inexplicably high. What's the round-trip time to the other subnet?
Can you pull the Ethernet cable from 10.1.1.251 and see if it stops responding?
If you trace from a 10.1.1.x machine on your LAN toward the VPC or try to connect to a VPC machine from 10.1.1.x... what happens there?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js