How to maintain two connections to different ElasticSearch hosts using Elastisch? - clojure

I’m using Elastisch, and the rest/connect function return an endpoint, but I can’t see how to reuse this endpoint when calling other functions. I need to transfer some documents from one index to another on different hosts, using a scroll on the first one and bulk indexing on the second one.

elastisch also offers theconnect (without the !) that returns the connection to you instead of storing it in a local var. You can call this twice and then use binding to bind the appropriate one for each call.
(let [client1 (connect ...)
client2 (connect ...)
data (binding [clojurewerkz.elastisch.native/*client* client1] ...)
(binding [clojurewerkz.elastisch.native/*client* client2] ... put stuff))

Related

Connection Pooling in Clojure

I am unable to understand the use of pool-db and connection function
in this connection pooling guide.
(defn- get-pool
"Creates Database connection pool to be used in queries"
[{:keys [host-port db-name username password]}]
(let [pool (doto (ComboPooledDataSource.)
(.setDriverClass "com.mysql.cj.jdbc.Driver")
(.setJdbcUrl (str "jdbc:mysql://"
host-port
"/" db-name))
(.setUser username)
(.setPassword password)
;; expire excess connections after 30 minutes of inactivity:
(.setMaxIdleTimeExcessConnections (* 30 60))
;; expire connections after 3 hours of inactivity:
(.setMaxIdleTime (* 3 60 60)))]
{:datasource pool}))
(def pool-db (delay (get-pool db-spec)))
(defn connection [] #pool-db)
; usage in code
(jdbc/query (connection) ["Select SUM(1, 2, 3)"])
Why can't we simply do?
(def connection (get-pool db-spec))
; usage in code
(jdbc/query connection ["SELECT SUM(1, 2, 3)"])
The delay ensures that you create the connection pool the first time you try to use it, rather than when the namespace is loaded.
This is a good idea because your connection pool may fail to be created for any one of a number of reasons, and if it fails during namespace load you will get some odd behaviour - any defs after your failing connection pool creation will not be evaluated, for example.
In general, top level var definitions should be constructed so they cannot fail at runtime.
Bear in mind they may also be evaluated during the AOT compile process, as amalloy notes below.
In your application, you want to create the pool just one time and reuse it. For this reason, delay is used to wrap the (get-pool db-spec) method so that this method will be invoked only the first time it is forced with deref/# and will cache the pool return it in subsequent forcecalls
The difference is that in the delay version a pool will be created only if it is called (which might not be the case if everything was cached), but the non-delay version will instantiate a pool no matter what, i.e. always, even if a database connection is not used.
delay runs only if deref is called and does nothing otherwise.
I would suggest you use an existing library to handle connection pooling, something like hikari-cp, which is highly configurable and works across many implements of SQL.

Persisting State from a DRPC Spout in Trident

I'm experimenting with Storm and Trident for this project, and I'm using Clojure and Marceline to do so. I'm trying to expand the wordcount example given on the Marceline page, such that the sentence spout comes from a DRPC call rather than from a local spout. I'm having problems which I think stem from the fact that the DRPC stream needs to have a result to return to the client, but I would like the DRPC call to effectively return null, and simply update the persisted data.
(defn build-topology
([]
(let [trident-topology (TridentTopology.)]
(let [
;; ### Two alternatives here ###
;collect-stream (t/new-stream trident-topology "words" (mk-fixed-batch-spout 3))
collect-stream (t/drpc-stream trident-topology "words")
]
(-> collect-stream
(t/group-by ["args"])
(t/persistent-aggregate (MemoryMapState$Factory.)
["args"]
count-words
["count"]))
(.build trident-topology)))))
There are two alternatives in the code - the one using a fixed batch spout loads with no problem, but when I try to load the code using a DRPC stream instead, I get this error:
InvalidTopologyException(msg:Component: [b-2] subscribes from non-existent component [$mastercoord-bg0])
I believe this error comes from the fact that the DRPC stream must be trying to subscribe to an output in order to have something to return to the client - but persistent-aggregate doesn't offer any such outputs to subscribe to.
So how can I set up my topology so that a DRPC stream leads to my persisted data being updated?
Minor update: Looks like this might not be possible :( https://issues.apache.org/jira/browse/STORM-38

c++ driver mongodb connection options

It seems that c++ drivers doesn't accept mongodb connection uri format.
There's no documentation on how i should create connection string, any guess?
I need to connect to a replica set with 3 servers, and set readPreference options.
Create a connection to a replica set in MongoDB C++ client
Until the problems explained in #acm's answer are resolved, I have found a workaround to the bad Connection Strings of the C++ driver. You can create a DBClientReplicaSet using a vector of hosts and ports this way:
//First create a vector of hosts
//( you can ignore port numbers if yours are default)
vector<HostAndPort> hosts;
hosts.push_back(mongo::HostAndPort("YourHost1.com:portNumber1"));
hosts.push_back(mongo::HostAndPort("YourHost2.com:portNumber2"));
hosts.push_back(mongo::HostAndPort("YourHost3.com:portNumber3"));
//Then create a Replica Set DB Client:
mongo::DBClientReplicaSet connection("YourReplicaSetName",hosts,0);
//Connect to it now:
connection.connect();
//Authenticate to the database(s) if needed
std::string errmsg;
connection.auth("DB1Name","UserForDB1","pass1",errmsg);
connection.auth("DB2Name","UserForDB2","pass2",errmsg);
Now, you can use insert, update, etc. just as you did with DBClientConnection. For a quick fix, you can replace your references to DBClientConnection with DBClientBase (which is a parent to both DBClientConnection and DBClientReplicaSet)
Last pitfall: if you are using getLastError(), you must use it with the aimed database name like this:
connection.getLastError(std::string("DBName"));
cause otherwise it will always return "command failed: must log in" as described in this JIRA ticket.
Set the read preferences for every request
You have two ways to do that:
SlaveOK option
It lets your read queries be directed to secondary servers.
It takes place in the query options, which are at the end of the parameters of DBClientReplicaSet.query(). The options are listed in Mongo's official documentation
The one you would look for is mongo::QueryOption_SlaveOk, which will allow you to have reads made on secondary instances.
This is how you should call query();
connection.query("Database.Collection",
QUERY("_id" << id),
n,
m,
BSON("SomeField" << 1),
QueryOption_SlaveOk);
where n is the number of documents to return (0 if you don't want any limit), m the number to skip (defaults to 0), the next field is your projection and the last your query option.
To use several query option, you can use bitwise or | like this :
connection.query("Database.Collection",
QUERY("_id" << id),
n,
m,
BSON("SomeField" << 1),
QueryOption_SlaveOk | QueryOption_NoCursorTimeout | QueryOption_Exhaust);
Query::readPref option
The Query object has a readPref method which sets read preferences for a special query. It should be called for each query.
You can pass different arguments for more control. They are listed here.
So here's what you should do (I did not test that one cause I can't right now but it should work just fine)
/* you should pass an array for the tags. Not sure if this is required.
Anyway, let's create an empty array using the builder. */
BSONArrayBuilder bab;
/* if any, add your tags here */
connection.query("Database.Collection",
QUERY("_id" << id).readPref(ReadPreference_SecondaryPreferred, bab.arr()),
n,
m,
BSON("SomeField" << 1),
QueryOption_NoCursorTimeout | QueryOption_Exhaust);
Note: if any readPref option is used, it should override the slaveOk option.
Hope this helped.
Please see the connection string documentation for details on the connection string format.
(code links below are to 2.2.3 files)
To use a connection string with the C++ driver, you should use the ConnectionString class. You first call the ConnectionString::parse static method with a connection string to obtain a ConnectionString object. You then call ConnectionString::connect to obtain a DBClientBase object which you can then use to send queries.
As for read preference, at the moment I do not see a way to set the read preference in the connection string for the C++ driver, which would preclude a per-connection setting.
However, the implementation of DBClientBase returned by calling ConnectionString::parse with a string that identifies a replica set will return you an instance of DBClientReplicaSet. That class honors $readPreference in queries, so you can set your read preference on a per-query basis.
Since the current C++ drivers still do not accept the standard mongodb connection URIs, I've opened a ticket:
https://jira.mongodb.org/browse/CXX-2
Please vote for it to help get this fixed.
it seems like you can set read Preference before send a read request by call "readPref" method of your Query object. I'v not found a way to set read Preference on mongo collection object yet.

Clojure: architecture advice needed

I'm writing a little clojure pub/sub interface. It's very barebones, only two methods that will actually be used: do-pub and sub-listen. sub-listen takes a string (a sub name) and do-pub takes two strings (a sub name and a value).
I'm still fairly new at clojure and am having some trouble coming up with a workable way to do this. My first thought (and indeed my first implementation) uses a single agent which holds a hash:
{ subname (promise1 promise2 etc) }
When a thread wants to sub it conj's a promise object to the list associated with the sub it wants, then immediately tries to de-reference that promise (therefore blocking).
When a pub happens it goes through every item in the list for the sub and delivers the value to that item (the promise). It then dissoc's that subname from the map and returns it to the agent.
In this way I got a simple pub sub implementation working. However, the problem comes when someone subs, doesn't receive a pub for a certain amount of time, then gets killed due to timeout. In this scenario there will be a worthless promise in the agent that doesn't need to be, and moreover this will be a source of a memory leak if that sub never gets pub'd.
Does anyone have any thoughts on how to solve this? Or if there is a better way to do what I'm trying to do overall (I'm trying to avoid using any external pre-cooked pubsub libraries, this is a pet project not a work one)?
You can do something like this:
Create an atom
publish function will update the atom value by the passed in value to the function
Subscribers can use add-watch on the atom to be notified of when the atom value changes i.e due to call to publish function
Use remove-watch to remove the subscription.
This way you will have a very basic pub-sub system.
I have marked Ankur's answer as the solution but I wanted to expand on it a bit. What I ended up doing is having a central atom that all client threads do an add-watch on. When a pub is done the atom's value is changed to a vector containing the name of the sub and the value being pub'd.
The function the clients pass to add-watch is a partial function which looks like
(partial (fn [prom sub key ref _old new] ...) sub prom)
where prom is a promise previously generated. The client then blocks while waiting on that promise. The partial function checks if the sub in new is the same as sub, if so it removes the watch and delivers on the promise with the value from new.

Setting Connection Parameters via ADO for SQL Server

Is it possible to set a connection parameter on a connection to SQL Server and have that variable persist throughout the life of the connection? The parameter must be usable by subsequent queries.
We have some old Access reports that use a handful of VBScript functions in the SQL queries (let's call them GetStartDate and GetEndDate) that return global variables. Our application would set these before invoking the query and then the queries can return information between date ranges specified in our application.
We are looking at changing to a ReportViewer control running in local mode, but I don't see any convenient way to use these custom functions in straight T-SQL.
I have two concept solutions (not tested yet), but I would like to know if there is a better way. Below is some pseudo code.
Set all variables before running Recordset.OpenForward
Connection->Execute("SET #GetStartDate = ...");
Connection->Execute("SET #GetEndDate = ...");
// Repeat for all parameters
Will these variables persist to later calls of Recordset->OpenForward? Can anything reset the variables aside from another SET/SELECT #variable statement?
Create an ADOCommand "factory" that automatically adds parameters to each ADOCommand object I will use to execute SQL
// Command has been previously been created
ADOParameter *Parameter1 = Command->CreateParameter("GetStartDate");
ADOParameter *Parameter2 = Command->CreateParameter("GetEndDate");
// Set values and attach etc...
What I would like to know if there is something like:
Connection->SetParameter("GetStartDate", "20090101");
Connection->SetParameter("GetEndDate", 20100101");
And these will persist for the lifetime of the connection, and the SQL can do something like #GetStartDate to access them. This may be exactly solution #1, if the variables persist throughout the lifetime of the connection.
Since no one has ventured an answer I'm guessing there isn't an elegant solution, that said:
Global cursors persist for the duration of the connection and can be accessed from any SQL or stored proc so you could execute this once on the connection:
DECLARE KludgeKursor CURSOR GLOBAL STATIC FOR
SELECT StartDate = '2010-01-01', EndDate = '2010-04-30'
OPEN KludgeKursor
and in your stored procedures:
--get the values
DECLARE #StartDate datetime, #EndDate datetime
FETCH FIRST FROM GLOBAL KludgeKursor
INTO #StartDate, #EndDate
--go crazy
SELECT #StartDate, #EndDate
Each connection would only see their own values, so the same stored procs can be used for different connection/values. The global cursor is automatically deallocated when the connection ends