Optimize single core of a solr server having multiple cores - solrj

My solr server is having multiple cores. Is it possible to optimize a single core of it using solrJ?

I believe this is what you're looking for:
public UpdateResponse optimize(String collection){}
Being a collection:
Collection: A complete logical index in a SolrCloud cluster. It is associated with a config set and is made up of one or more shards. If
the number of shards is more than one, it is a distributed index, but
SolrCloud lets you refer to it by the collection name and not worry
about the shards parameter that is normally required for
DistributedSearch.
And a core:
Solr Core: Also referred to as just a "Core". This is a running
instance of a Lucene index along with all the Solr configuration
(SolrConfigXml, SchemaXml, etc...) required to use it. A single Solr
application can contain 0 or more cores which are run largely in
isolation but can communicate with each other if necessary via the
CoreContainer. From a historical perspective: Solr initially only
supported one index, and the SolrCore class was a singleton for
coordinating the low-level functionality at the "core" of Solr. When
support was added for creating and managing multiple Cores on the fly,
the class was refactored to no longer be a Singleton, but the name
stuck.
You can find more information regarding this method and Solr terminology here:
https://lucene.apache.org/solr/5_2_1/solr-solrj/index.html?overview-summary.html
https://wiki.apache.org/solr/SolrTerminology
Hope this helps!

Related

Divide in-memory data between service instances

Recently in a system design interview I was asked a question where cities were divided into zones and data of around 100 zones was available. An api took the zoneid as input and returned all the restaurants for that zone in response. The response time for the api was 50ms so the zone data was kept in memory to avoid delays.
If the zone data is approximately 25GB, then if the service is scaled to say 5 instances, it would need 125GB ram.
Now the requirement is to run 5 instances but use only 25 GB ram with the data split between instances.
I believe to achieve this we would need a second application which would act as a config manager to manage which instance holds which zone data. The instances can get which zones to track on startup from the config manager service. But the thing I am not able to figure out is how we redirect the request for a zone to the correct instance which holds its data especially if we use kubernetes. Also if the instance holding partial data restarts then how do we track which zone data it was holding
Splitting dataset over several nodes: sounds like sharding.
In-memory: the interviewer might be asking about redis or something similar.
Maybe this: https://redis.io/topics/partitioning#different-implementations-of-partitioning
Redis cluster might fit -- keep in mind that when the docs mention "client-side partitioning": the client is some redis client library, loaded by your backends, responding to HTTP client/end-user requests
Answering your comment: then, I'm not sure what they were looking for.
Comparing Java hashmaps to a redis cluster isn't completely fair, considering one is bound to your JVM, while the other is actually distributed / sharded, implying at least inter-process communications and most likely network/non-local queries.
Then again, if the question is to scale an ever-growing JVM: at some point, we need to address the elephant in the room: how do you guarantee data consistency, proper replication/sharding, what do you do when a member goes down, ...?
Distributed hashmap, using Hazelcast, may be more relevant. Some (hazelcast) would make the argument it is safer under heavy write load. Others that migrating from Hazelcast to Redis helped them improve service reliability. I don't have enough background in Java myself, I wouldn't know.
As a general rule: when asked about Java, you could argue that speed and reliability very much rely on your developers understanding of what they're doing. Which, in Java, implies a large margin of error. While we could suppose: if they're asking such questions, they probably have some good devs on their payroll.
Whereas distributed databases (in-memory, on disk, SQL or noSQL), ... is quite a complicated topic, that you would need to master (on top of java), to get it right.
The broad approach they're describing was described by Adya in 2019 as a LInK store. Linked In-memory Key-value stores allow for application objects supporting rich operations to be sharded across a cluster of instances.
I would tend to approach this by implementing a stateful application using Akka (disclaimer: I am at this writing employed by Lightbend, which employs the majority of the developers of Akka and offers support and consulting services to clients using Akka; as my SO history indicates, I would have the same approach even multiple years before I was employed by Lightbend) along these lines.
Akka Cluster to allow a set of JVMs running an application to form a cluster in a peer-to-peer manner and manage/track changes in the membership (including detecting instances which have crashed or are isolated by a network partition)
Akka Cluster Sharding to allow stateful objects keyed by ID to be distributed approximately evenly across a cluster and rebalanced in response to membership changes
These stateful objects are implemented as actors: they can update their state in response to messages and (since they process messages one at a time) without needing elaborate synchronization.
Cluster sharding implies that the actor responsible for an ID might exist on different instances, so that implies some persistence of the state of the zone outside of the cluster. For simplicity*, when an actor responsible for a given zone starts, it initializes itself from datastore (could be S3, could be Dynamo or Cassandra or whatever): after this its state is in memory so reads can be served directly from the actor's state instead of going to an underlying datastore.
By directing all writes through cluster sharding, the in-memory representation is, by definition, kept in sync with the writes. To some extent, we can say that the application is the cache: the backing datastore only exists to allow the cache to survive operational issues (and because it's only in response to issues of that sort that the datastore needs to be read, we can optimize the data store for writes vs. reads).
Cluster sharding relies on a conflict-free replicated data type (CRDT) to broadcast changes in the shard allocation to the nodes of the cluster. This allows, for instance, any instance to handle an HTTP request for any shard: it simply forwards a representation of the important parts of the request as a message to the shard which will distribute it to the correct actor.
From Kubernetes' perspective, the instances are stateless: no StatefulSet or similar is needed. The pods can query the Kubernetes API to find the other pods and attempt to join the cluster.
*: I have a fairly strong prior that event sourcing would be a better persistence approach, but I'll set that aside for now.

Why cant we share cassandra session initialised in parent process to child process(python driver)?

I am developing a multi-process application and using cassandra, I have a single session opened at the begining of the server and i want to share the session to other processes.I just want to know is it possible in cassandra(python driver) . if not why ?
Yes, It is possible and recommended to use one session
4 simple rules when using the DataStax drivers for Cassandra
When using one of the DataStax drivers for Cassandra, either if it’s C#, Python, or Java, there are 4 simple rules that should clear up the majority of questions and that will also make your code efficient:
Use one Cluster instance per (physical) cluster (per application lifetime)
Use at most one Session per keyspace, or use a single Session and explicitely specify the keyspace in your queries
If you execute a statement more than once, consider using a PreparedStatement
You can reduce the number of network roundtrips and also have atomic operations by using Batches
Source http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra
No, its not recommended.
Quoting the official datastax documentation:
Be sure to never share any Cluster, Session, or ResponseFuture objects across multiple processes. These objects should all be created after forking the process, not before.
Source: https://datastax.github.io/python-driver/performance.html#multiprocessing

Using & Scaling Titan Graph Database

I am figuring out my options for storing hierarchical data (parent - child relationships).
Since a tree is a graph and a forest (of trees) is also technically a graph, a graph database seems to fit the bill much better than a RDBMS esp. since I am concerned with optimizing both read and write operations.
Optimizing writes implies changes in hierarchy require minimal writes.
Optimizing reads implies materializing the full path to a particular node consumers minimal read operations.
My use case is:
A tree per user. Should I store and use one graph across the user space or one graph per user?
Path queries starting at any node and back to root of tree for a user.
Child nodes store links to parent nodes
Since all of my resources are in AWS, being able to use the Titan DynamoDB backend seems ideal.
My real problem is in understanding how to scale and manage Titan though.
Do I need a gremlin server instance? In other words, do I need to stand up a EC2 instance with gremlin server in order to do anything with Titan? Or can I use the Java Titan API to work with graph data directly?
Do I need to explicitly shard the data? In other words, do I need to stand up more gremlin servers as usage increases and the amount of data and the amount of operations increase? When the number of servers scale out, do I need to consistent hash across those servers from the client in order to perform operations?
Do I need to setup an elastic search cluster to be able to start traversals from any node? Or is using vertices to represent objects and edges to represent parent relationships sufficient at this point? I can guarantee that vertex ID's are unique across the user-space ; I can also decorate each vertex with the unique user ID as well. In that case, do I need elastic search? My hope is that elastic search is for free form or more complex search type queries and not for exact queries!
As the number of front-ends increase, can each front-end open the graph (single graph across user space)? If a graph per user, then since front-ends have no affinity, the same graph may be opened for each user; is that OK?
I wasn't able to find much documentation on any of this. Thank you!
I will try to answer your questions in the following:
Both solutions are possible, it is highly depend on your application to decide between choosing gremlin server or having a customized data access layer with customized queries through other secondary data stores. Although I would prefer having customized data access layer, it is possible to response all gremlin query requirements through gremlin server.
Gremlin server is just an interface between your application and data stores, and due to the caching mechanism it is memory-intensive. Data can be stored in different machines for example a cluster of DynamoDB machines. It depend on the number of concurrent users, but I think vertical scaling is more than enough for most of the applications. If you are going to use titan in a highly concurrent environment, beyond resource of single machine, probably you have to create different gremlin-servers on different machines and handle the load balancing mechanism. The problem is you have to control sending requests in a way that similar queries hit the same gremlin-server from the cache efficiency point of view.
Yes, indexing backend is just useful for more complicated queries other than simple retrieval. Secondary index backend like Solr/ Elastic or Lucene is useful if you want to have conditional search or text search by similarity. It is because that indexer like Lucene can provide a reverse index structure that can be helpful for similar searches. If you are going to search for all parents/children with having "foo" in their names you have to use indexing backend. If you are going to search for all parents/children with age less than 40, you have to use indexing backend too.
More information about indexing backends could be accessed via these link.
http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html
http://s3.thinkaurelius.com/docs/titan/1.0.0/index-parameters.html
It is highly recommended to limit the number of open graphs to one for the entire application. Titan uses some caching mechanisms that encourage you to have a single graph instance in the entire application for the sake of performance. Since uncommitted data is just visible on a single graph instance and transaction, if you want real-time application it is suggested that use single graph instance and single transaction. However, using more than 1 graph instance in the entire application for a read-only transaction is not wrong, but it is not efficient.
You can find lots of information about Titan graph database in the following links:
Main Titan documentaion: http://s3.thinkaurelius.com/docs/titan/1.0.0/
An old but really useful document about how Titan works: https://github.com/elffersj/delftswa-aurelius-titan/tree/master/SA-doc

Does getting a session through Singleton class, have impact on parallel query execution

We are working on changing the database of our application from Oracle to Cassandra. We are using Datastax c/c++ driver to connect to cassandra database. In the application, multiple queries will be executed at same time on the tables of same Keyspace. We are planning to define a singleton class to get the session for a cluster and use it across the application. But will this have a impact if multiple queries are executed by the same session at the same time. Can any one please suggest a good way to achieve this. Thanks in advance.
It will not. In effect, it is recommended that you only use 1 session per keyspace. As referenced in 4 simple rules when using the DataStax drivers for Cassandra, the first two rules apply here:
Use one Cluster instance per (physical) cluster (per application lifetime)
Use at most one Session per keyspace, or use a single Session and explicitely specify the keyspace in your queries.
A Session comprises of a set of connections to your cassandra hosts. If you have multiple queries in different threads sharing this session, it is ok as it is thread safe and the connections will be shared. As referenced in the Architecture doc for the cpp driver:
CassSession is designed to be used concurrently from multiple threads. It is best practice to create a single session per keysapce. CassFuture is also thread safe. Other than these exceptions, in general, functions that modify an object's state are not thread safe. Objects that are immutable (marked const) can be read safely by multiple threads. None of the free functions can be called conncurently on the same instance of an object.
Creating additional Sessions won't really provide you more value. If you are running into issues where you are overwhelming your connections, you can configure more through your configuration.

django Mongo-db automatic failover from Primary(master) to Secondary(slave)

I've developed a web-app in django, and have used MongoDB for backend.
I'm not sure how to do an automatic failover for the database.
My requirement is that, suppose when Primary node of mongodb is down, django should automatically connect to Secondary node.
How can this be achieved?
I found this library, https://github.com/brianjaystanley/django-failover
which is for django 1.3, but i want for django 1.5
What settings do i need to change, or any library available for the rescue? Any solutions on the floor?
Thanks
You should not need to set up anything in your you application to handle this and the link you provided for the library is not appropriate for use with MongoDB as it is a relational back end solution.
The first case here is do you actually have a Replica Set Configuration for MongoDB? I can only answer presuming that you but the link is worthwhile reading as from your question you probably do not have a core understanding of MongoDB Replication concepts.
What will be explained there is that there is no Secondary for your application to failover to, what actually happens is the Replica Set itself elects amongst it's members which node will become the Primary.
Going on with the answer, you configure your application to handle the failover through settting up your Connection String to the driver. Read through that documentation and you will find that among other useful things, you are basically providing a list of hostnames which will be members of the Replica Set. You don't need all the members, but just enough to be a seed list so that the other nodes can be discovered. That would just happen anyway with the correct options, but it is good practice to have more than one host to contact even to get that information. Here's a sample:
mongodb://<Primary>,<Secondary>/<database>
You may possibly want to take a look at MongoEngine, considering you probably have experience with django and it uses modelling concepts that you will be familiar with, whilst still allowing access to MongoDB features. There is some documentation there on setting up Replica Set connections from memory.