Why does a Solandra 2 node cluster perform indexing worse than a single node cluster? - solrj

In indexing tests using Solrj, 2 Solandra nodes are performing worse than 1.
Each node is -Xms1G -Xmx12G
Single index; Index is ~10M docs; each doc is ~4KB in size with a unique id. I built up the index to about 6M on a single node, then added a new node to the ring and used "move" to assign new tokens to balance.
Using all Solandra default configs - for example: solandra.maximum.docs.per.shard = 1048576, solandra.index.id.reserve.size = 16384, solandra.shards.at.once = 4
Nodetool ring shows:node1 Up Normal 35.11 GB 50.00% 0, node2 Up Normal 54.5 GB 50.00% 85070591730234615865843651857942052864
Indexing performance:
single node: 166 docs/s
2 nodes(sending to single node): 111 docs/s
2 nodes (sending to both in parallel): 55 docs/s (see note below)
(note) I was sending batches of 100K (a batch is building up a list of SolrInputDocuments + commit on whole list), and when I switched to batches of 10K, there was some performance improvement to 98 docs/s
Some questions:
In general, for both indexing & searching, how can I get Solandra to perform better with more than 1 node?
Why does indexing performance degrade with 2 nodes versus 1? When should I expect a performance upgrade?
What is the recommended way to index documents with Solandra- sending to a single node in the ring or to multiple nodes?
What is the recommended way to query with Solandra- sending query requests to a single node or multiple nodes?
Send all query requests to a single node in a 2 node cluster performs roughly the same as a single node cluster - any ideas why?

Related

Is multi-node Sagemaker training batched per-node or shared?

I am using Tensorflow, and am noticing that individual steps are slower with multiple nodes than with one, so I am a bit confused as to what constitutes a step on multiple training nodes on Sagemaker.
If my batch size is 10 and I have 5 training nodes, is a "step" 2 from each node or 10 from each node?
What if I have a batch size of 1 and 5 nodes?
Note - a 'node' here is an individual training instance, count created from train_instance_count=5
Please look at this notebook for an example of distributed training with TF: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb
"Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called training step.
A global step is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of training steps between runs:
train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
That is the only required change for distributed training!"

Different types of Shards in Elastic Search

In domain of AWS ElasticService, different types of shards and their count is mentioned as follow:-
What do different shards mean here? I wanted to create a index of my own in new Domain and their I wanted to give number of shards? I am thinking to use this API?
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html
What is the number of shards in API request object correspond to in picture?
Also, as I came to know we cannot increase number of shards later, so I wanted to ask is there any disadvantage if we give large number of shards. Eg. 40 shards for 20 nodes.
My guess:
You have 1 index with the default of 5 shards and 1 replica.
You have a .kibana index (with the Kibana configurations) with the default of 1 shard and 1 replica.
Replicas will only be allocation on different instances since it doesn't add any value in having multiple copies on the same node. If you only have 1 node, your replicas will not be allocated — those are your 6 unassigned shards.
Initializing would be when you create new shards; relocating when existing ones are currently being moved around.
If you create another index, your shards will be created according to your definition (or the default of 5 shards and 1 replica).
Every shard has a specific overhead in terms of memory and file handles. Also the search results of each shard need to be combined. So having dozens of shards per node is fine, but avoid having hundreds or even thousands. And remember: Every index will add 10 shards by default (5 primary, 5 replica), so the number of indices will make your number of shards grow over time.

MapReduce.How does it work?

What is MapReduce and how does it work?
I have tried reading some links but couldn't clearly understand the concept.
Can anyone please explain it in simple terms? any help would be appreciated.
I am going to explain with example.
Consider You have a whether temperature data of last 100 years and you want to know the year wise highest temperature. Suppose total size of the data is 100PT. How you will cater this problem? We can not process the data in SQL like oracle, My SQL, or any sql database.
In hadoop, mainly there are two term:
Hadoop Distributed File System(HDFS)
Map-Reduce
HDFS is used to store the data in distributed environment. Therefore, HDFS will store your 100PT data in cluster. It may be 2 machines cluster or 100 machines. By default your data will be divided into 64MB chunks and stored in different machine in cluster.
Now, we move to process the data. For processing data in hadoop cluster we have Map-Reduce framework to write processing logic. We need to write the map-reduce code to find the maximum temperature.
Structure of Map-Reduce code (Just for understand, syntax is not right):
class XYZ{
static class map{
void map(){
//processing logic for mapper
}
}
static class Reduce{
void reduce(){
//processing logic for reducer
}
}
}
whatever you write in map() method, that will run by all data nodes in parallel on 64MB chunk of data and generate the output.
Now, Output of all the mapper instances will shuffle and sort. And then it pass to reduce() method as a input.
Reducer will generate final output.
In our example suppose, hadoop initiate 3 below mapper:
64 MB chunk data ->mapper 1 -> (year,temperature) (1901,45),(1902,34),(1903,44)
64 MB chunk data ->mapper 2 -> (year,temperature) (1901,55),(1902,24),(1904,44)
64 MB chunk data ->mapper 3 -> (year,temperature) (1901,65),(1902,24),(1903,46)
output of all mapper pass to the reducer.
output of all mapper -> reducer -> (1901,65),(1902,34),(1903,46),(1904,44)
a mapreduce has a Mapper and a Reducer.
Map is a common functional programming tool which does a single operation on multiple data. For example, if we have the array
arr = [1,2,3,4,5]
and invoke
map(arr,*2)
it will multiply each element of the array, such that the result would be:
[2,4,6,8,10]
Reduce is a bit counter-intuitive in my opinion, but it is not as complicated as one would expect.
Assume you have got the mapped array above, and would like to use a reducer on it. A reducer gets an array, a binary operator, and an initial element.
The action it does is simple. Assuming we have the mapped array above, the binary operator '+', and the initial element '0', the reducer applies the operator again and again in the following order:
0 + 2 = 2
2 + 4 = 6
6 + 6 = 12
12 + 8 = 20
20 + 10 = 30.
It actually takes the last result and the next array element, and applies the binary operator on them. In the represented case, we've got the sum of the array.

Pig killing data nodes while loading a lot of files

I have a script that tries to get the times that users start/end their days based on log files. The job always fails before it completes and seems to knock 2 data nodes down every time.
The load portion of the script:
log = LOAD '$data' USING SieveLoader('#source_host', 'node', 'uid', 'long_timestamp', 'type');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL AND $0#'type'=='USER_AUTH';
There are about 6500 files that we are reading from, so it seems to spawn about that many map tasks. The SieveLoader is a custom UDF that loads a line, passes it to an existing method that parses fields from the line and returns them in a map. The parameters passed in are to limit the size of the map to only those fields with which we are concerned.
Our cluster has 5 data nodes. We have quad cores and each node allows 3 map/reduce slots for a total of 15. Any advice would be greatly appreciated!

How are map and reduce workers in cluster selected?

For mapreduce job we need to specify partitioning of input data (count of map processes - M) and count of reduce processes (R). In MapReduce papers is example of their often settings: cluster with 2 000 workers and M = 200 000, R = 5 000. Workers are tagged as map-worker or reduce-worker. I wonder how are these workers in cluster selected.
Is this done so that is chosen fixed count of map-workers and fixed count of reduce-workers? (and then data stored in reduce-workers nodes has to be send to map-reduce workers)
Or map phase is running on each node in cluster and any count of nodes are then selected as reduce-workers?
Or is it done in another way?
Thanks for your answers.
The number of Map-Worker(Mapper) depends on the number of Input-splits of the input file.
so Ex: 200 input-splits( they are logical ) =200 Mapper .
How Mapper Node is selected ?
The Mapper is the Local Data Node , if its not possible then data is transferred to free Node and Mapper is invoked on that node
.
The number of Reducer can be set by the user( Job.setNumberOfReducer(Number) ) or else it will also be as per the number of splits of Intermediate-output of Mapper .
Other Question's Answers
Q1>so in one node can run for example 10 mappers in parallel at one time, or these mappers are processed sequentially?
Ans : sequentially (Max Number of (active/running)mapper =Number of DataNodes)
Q2>how are chosen the nodes where are reducers invoked?
Ans :
Intermediate Key-Values are stored in Local File system Not in HDFS , and then it is being copied(HDFS) to Reducer Node .
A single Mapper will feed Data to multiple reducer . so locality of data is out of the question coz a data for a particular reducer come from many Nodes if not from all .
So Reducer is (or atleast should be) selected on Bandwidth of a Node , keeping in minds all above points
Q3>if we need reducers count bigger then overall nodes count (for example 90 reducers in 50 nodes cluster), are the reducers on one node processed in parallel or sequentially?
Ans : sequentially (Max Number of (active/running)Reducer =Number of DataNodes)