SolrCloud: round-robin replicas not working when using Collections API - amazon-web-services

I've followed the "Getting Started Guide", the Two-Shards / Two-Replicas secnario and everything worked perfectly.
Then I started using the Collections API which is the preferable way of managing collection,shards and replication.
I launched two instances locally (afterwards with AWS, same problem)
I created a new collection with two shards using the following command:
/admin/collections?action=CREATE&name=collection1&numShards=2&collection.configName=collection
This successfully created two shards, one on each instance.
Then I launched another instance, expecting it to automatically set itself as a replica for the first shard, just like in the example. That didn't happen.
Is there something I'm missing?
There were two ways I was able to achieve this:
Manually, using the Collections API, I added a replica to shard1 and then another to shard2.
This is not good enough, as I will need to have this done automatically with Auto Scaling, so I'll need to micro-manage each server "role" - which replicas of which shards of which collections its handling, which complicates things a lot
The second way, which I couldn't find a documentation for is to launch an instance with a folder named "collectionX", inside a file named core.properties. In it the following line:
collection=collection1
Each instance I launched that way was automatically added as a replica in a round-robin way.
(Also working with several collections)
That's actually not a bad way at all, as I can pass parameter when I launch an AMI/instance in AWS.
Thanks everyone.
Amir

1) You are running the wrong command; the complete command is as follows:
curl 'http://localhost:8080/solr/admin/collections?action=CREATE&name=corename&numShards=2&replicationFactor=2&maxShardsPerNode=10'
Here I have given replication factor and due to which it will create the replica for your shards.

Related

How to sync data between two Elasticache Redis Instances

I have two AWS Elasticache instances, One of the instances (lets say instance A) has very important data sets and connections on it, downtime is unacceptable. Because of this situation, instead of doing normal migration (like preventing source data from new writes, getting dump on it, and restore it to the new one) I'm trying to sync instance A's data to another Elasticache instance (lets say instance B). As I said, this process should be downtime-free. In order to do that, I tried RedisShake, but because AWS restrict users to run certain commands (bgsave, config, replicaof,slaveof,sync etc), RedisShake is not working with AWS Elasticache. It's giving the error below.
2022/04/04 11:58:42 [PANIC] invalid psync response, continue, ERR unknown command `psync`, with args beginning with: `?`, `-1`,
[stack]:
2 github.com/alibaba/RedisShake/redis-shake/common/utils.go:252
github.com/alibaba/RedisShake/redis-shake/common.SendPSyncContinue
1 github.com/alibaba/RedisShake/redis-shake/dbSync/syncBegin.go:51
github.com/alibaba/RedisShake/redis-shake/dbSync.(*DbSyncer).sendPSyncCmd
0 github.com/alibaba/RedisShake/redis-shake/dbSync/dbSyncer.go:113
github.com/alibaba/RedisShake/redis-shake/dbSync.(*DbSyncer).Sync
... ...
I've tried rump for that matter, But it doesn't have enough stability to handle any important processes. First of all, it's not working as a background process, when the first sync finished, it's being closed with signal: exit done, so it will not be getting ongoing changes after the first finish.
Second of all, it's recognizing created/modified key/values in each run, for example, in first run key apple equals to pear, it's synced to the destination as is, but when I deleted the key apple and its value in source and ran the rump syncing script again, it's not being deleted in destination. So basically it's not literally syncing the source and the destination. Plus, last commit to the rump github repo is about 3 years ago. It seems a little bit outdated project to me.
After all this information and attempts, my question is, is there a way to sync two Elasticache for Redis instances, as I said, there is no room for downtime in my case. If you guys with this kind of experience have a bulletproof suggestion, I would be much appreciated. I tried but unfortunately didn't find any.
Thank you very much,
Best Regards.
If those two Elasticache Redis clusters exist in the same account but different regions, you can consider using AWS Elasticache global-datastore.
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastores-Console.html
It has some restrictions on the regions, type of nodes and that both the clusters should have same configurations in terms of number of nodes, etc.
Limitations - https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastores-Getting-Started.html
Otherwise, there's a simple brute-force mechanism and you would be able to code it yourself I believe.
Create a client EC2 (let's call this Sync-er) pub-sub channel from your EC Redis instance A.
Whenever there is a new data, Sync-er would make WRITE commands on EC Redis instance B.
NOTE - You'll have to make sure that the clusters are in connectable VPCs.
Elasticache is only available to the resources within the VPC. If your Instance A and Instance B are in different VPCs, you'll have to peer them or connect them via TransitGateway.

Automating copy of Google Cloud Compute instances between projects

I need to move more than 50 compute instances from a Google Cloud project to another one, and I was wondering if there's some tool that can take care of this.
Ideally, the needed steps could be the following (I'm omitting regions and zones for the sake of simplicity):
Get all instances in source project
For each instance get machine sizing and the list of attached disks
For each disk create a disk-image
Create a new instance, of type machine sizing, in target project using the first disk-image as source
Attach remaining disk-images to new instance (in the same order they were created)
I've been checking on both Terraform and Ansible, but I have the feeling that none of them supports creating disk images, meaning that I could only use them for the last 2 steps.
I'd like to avoid writing a shell script because it doesn't seem a robust option, but I can't find tools that can help me doing the whole process either.
Just as a side note, I'm doing this because I need to change the subnet for all my machines, and it seems like you can't do it on already created machines but you need to clone them to change the network.
There is no tool by GCP to migrate the instances from one project to another one.
I was able to find, however, an Ansible module to create Images.
In Ansible:
You can specify the “source_disk” when creating a “gcp_compute_image” as mentioned here
Frederic

Passing variables between EC2 instances in multi-step AWS data pipeline

I have a pipeline setup wherein I have 3 main stages:
1) take input from a zipped file, unzip this file in s3. run some basic verification on each file to guarantee its integrity, move to step 2
2) kick off 2 simultaneous processing tasks on separate EC2 instances (parallelization of this step saves us a lot of time, so we need it for efficiency sake). Each EC2 instance will run data processing steps on some of the files in s3 which were unzipped in step 1, the files required are different for each instance.
3) after the 2 simultaneous processes are both done, spin up another EC2 instance to do the final data processing. Once this is done, run a cleanup job to remove the unzipped files from s3, leaving only the original zip file in its place.
So, one of the problems we're running into is that we have 4 EC2 instances that run this pipeline process, but there are some global parameters we would like for each EC2 instance to have access to. If we were running on a single instance, we could of course use shell variables to accomplish this task, but really need the separate instances for efficiency. Currently our best idea is to store a flat file in the s3 bucket which has access to these global variables and just read them on initialization and write back to them if they need to change. This is gross and there seems like there should be a better way, however we can't figure one out yet. I saw there's a way to set parameters which can be accessed at any part of the pipeline, but it looks like you can only set this on a per pipeline level, not on the granularity of each run of the pipeline. Does anyone have any resources that could help here? Much appreciated.
We were able to solve this by using DynamoDB to keep track of variables/state. The pipeline itself doesn't have any mechanism to do this, other than parameter values, which unfortunately only work per pipeline, not per job. You'll need to setup a DynamoDB instance and then use the pipeline job ID to keep track of state, connecting via the CLI tools or some SDK.

AWS - Keeping AMIs updated

Let's say I've created an AMI from one of my EC2 instances. Now, I can add this manually to then LB or let the AutoScaling group to do it for me (based on the conditions I've provided). Up to this point everything is fine.
Now, let's say my developers have added a new functionality and I pull the new code on the existing instances. Note that the AMI is not updated at this point and still has the old code. My question is about how I should handle this situation so that when the autoscaling group creates a new instance from my AMI it'll be with the latest code.
Two ways come into my mind, please let me know if you have any other solutions:
a) keep AMIs updated all the time; meaning that whenever there's a pull-request, the old AMI should be removed (deleted) and replaced with the new one.
b) have a start-up script (cloud-init) on AMIs that will pull the latest code from repository on initial launch. (by storing the repository credentials on the instance and pulling the code directly from git)
Which one of these methods are better? and if both are not good, then what's the best practice to achieve this goal?
Given that anything (almost) can be automated using the AWS using the API; it would again fall down to the specific use case at hand.
At the outset, people would recommend having a base AMI with necessary packages installed and configured and have init script which would download the the source code is always the latest. The very important factor which needs to be counted here is the time taken to checkout or pull the code and configure the instance and make it ready to put to work. If that time period is very big - then it would be a bad idea to use that strategy for auto-scaling. As the warm up time combined with auto-scaling & cloud watch's statistics would result in a different reality [may be / may be not - but the probability is not zero]. This is when you might consider baking a new AMI frequently. This would enable you to minimize the time taken for the instance to prepare themselves for the war against the traffic.
I would recommend measuring and seeing which every is convenient and cost effective. It costs real money to pull down the down the instance and relaunch using the AMI; however thats the tradeoff you need to make.
While, I have answered little open ended; coz. the question is also little.
People have started using Chef, Ansible, Puppet which performs configuration management. These tools add a different level of automation altogether; you want to explore that option as well. A similar approach is using the Docker or other containers.
a) keep AMIs updated all the time; meaning that whenever there's a
pull-request, the old AMI should be removed (deleted) and replaced
with the new one.
You shouldn't store your source code in the AMI. That introduces a maintenance nightmare and issues with autoscaling as you have identified.
b) have a start-up script (cloud-init) on AMIs that will pull the
latest code from repository on initial launch. (by storing the
repository credentials on the instance and pulling the code directly
from git)
Which one of these methods are better? and if both are not good, then
what's the best practice to achieve this goal?
Your second item, downloading the source on server startup, is the correct way to go about this.
Other options would be the use of Amazon CodeDeploy or some other deployment service to deploy updates. A deployment service could also be used to deploy updates to existing instances while allowing new instances to download the latest code automatically at startup.

Can I run 1000 AWS micro instances simultaneously?

I have a simple pure C program, that takes an integer as input, runs for a while (let's say an hour) and then returns me a text file. I want to run this program 1000 times with input integers from 1 to 1000.
Currently I'm running this program in parallel (4 processors), what takes 250 hours. The program is such that it fits in a AWS micro instance (I've tested it). Would it be possible to use 1000 micro instances in AWS to do the whole job in one hour? (at a cost of ~20$ - $0.02/instance)?
If it is possible, does anybody have some guidlines on how to do that?
If it is not, does anybody have an also low-budget alternative to that?
Thanks a lot!
In order to achieve this, you will need to:
Create a S3 bucket to store your bootstrapping scripts, application, input data and output data
Custom Lightweight AMI: you might want to create a custom lightweight AMI which knows about downloading the bootstrapping script
Bootstrapping Script: will download your software from your S3 bucket, parse a custom instance tag which will contain the integer [1..1000] and download any additional data.
Your application: does the processing stuff.
End of processing script is which uploads the result to a another result S3 bucket and terminates the instance, you might also want to send a SNS notification to communicate the end of processing status.
If you need result consolidation you might want to create another instance and use it as a coordinator, waiting for all "end of processing" notifications in order to finish the processing. In this case you might consider using in the future the Amazon's Hadoop map reduce engine, since it will do almost all this heavy lifting for free.
You didn't specify what language you'd like to do this in and since the app you use to deploy your program doesn't have to be in the same language I would recommend C#. There are a number of examples around the web on how to programmatically spawn new Amazon Instances using the SDK. For example:
How to start an Amazon EC2 instance programmatically in .NET
You can create your own AMIs with the program already present but that might end up being a pain if you want to make adjustments to it since it will require you to recreate the entire AMI. I'd recommend creating an extra instance or simply hosting the program in a location which is accessible from a public URL. Then I would create some kind of service which would be installed on the AMI ahead of time to allow me to specify the URL to download the app from along with whatever command line parameters I wanted for that particular instance.
Hope this helps.
Be carefull by default you can only spawn 20 instance by zone. You need to ask amazon in order to use more instances.
If you want low cost but don't care about delay you should use spot instances.