Is there a Clojure library to log and store into fluentd?

Is there a Clojure library to log and store into fluentd? - clojure

I have found a Java library to log into fluentd, but can't find one for clojure. Is there any Clojure library to log based on fluentd?

At the moment the answer is unfortunately, no. though I do use fluentd from Clojure both by sending messages over TCP and by using log4j to write to a log file and then having fluentd tail that log file. I found the tailing approach much more convenient though it has a significant limitation that all events from a single log file get the same tag in fluent, while when you send them over a network socket each message can have it's own tag.
If you can live with all events having the same tag in fluent from your clojure service then go wit a tailing appender. otherwise you get to use the java one or roll your own. We made one in-house and it was really not very hard you basically build a vector that looks like this:
[tag (long (/ (System/currentTimeMillis) 1000)) your-json-message]
and pack it into a protocol buffer and ship it over the socket. If I was going to start that project again I would choose the java library.

Related

Why BENCODE has been used for transporting clojure code to nrepl in CIDER?

Why can't we simply convert Clojure code to string and send it over TCP and evaluate on the other side(nrepl)?
For example : This is a hashmap {"foo" "bar", 1 "spam"} whose BENCODE encoding is d3:foo3:bari1e4:spame.
If we convert it to string -> {\"foo\" \"bar\", 1 \"spam\"}
and evaluate on the other side instead of using BENCODE as shown below.
(eval (read-string "{\"foo\" \"bar\", 1 \"spam\"}"))
; ⇒ {"foo" "bar", 1 "spam"}
I am new to the Clojure world. This might be a stupid question but anyway.

nREPL's maintainer here. There are several reasons why nREPL uses bencode by default:
We needed a data format that supports data streaming easily (you won't find many streaming JSON parsers)
We needed a data format that could easily be supported by many clients (support for formats like JSON and EDN is tricky in editors like Emacs and vim). I can tell you CIDER would not have existed if 8 years ago we had to deal with JSON there. :-)
Bencode is so simple that usually you don't even need to rely on a third-party library for it (many clients/servers have their own implementation of the encoding/decoding in less than 100 lines of code) - this means means that clients/servers have one less 3rd party library. Tools like nREPL can't really have runtime deps, as they conflict with the user application deps.
EDN didn't exist back when nREPL was created
Btw, these days nREPL supports EDN and JSON (via the fastlane library) as well, but I think that bencode is still the best transport in most cases.

For people looking for the answer, read the # Motivation section in https://github.com/clojure/tools.nrepl/blob/master/src/main/clojure/clojure/tools/nrepl/bencode.clj
This is very well written.

Apache Kafka and Strom Clojure implementation

This is the first time implementing stream processing infrastructure and my poison was storm 1.0.1,
kafka 0.9.0 and Clojure 1.5.
Now I have background working with a messaging system (RabbitMQ) and I liked it for a couple of reasons.
Simple to install and maintain
Nice frontend web portal
Persistent message states are maintained where I can start a consumer and it know which messages have not been consumed. i.e. "Exactly once"
However it cannot achieve the throughput I desire.
Now having gone through Kafka it heavily depends on manually maintaining offsets (internally in the Kafka broker,Zookeper or externally)
I at long last managed to create a spout in Clojure with the source being the Kafka broker which was nightmare.
Now like for most scenarios what I desire is "Exactly once messaging" and as per Kafka documentation states
So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
What does this translate to for a clojure kafka spout, finding it hard to conceptualize.
I may have several boltz along the way but the end point is Postgres cluster. Am i to store the offset in the database (sounds like a race hazard waiting to happen) and on initialization of my storm cluster i fetch the offset from Postgres?
Also is there any danger of setting my parallelism for the Kafka spout to a number greater than one?
I generally used this as a starting point, as examples for many things are just not available in Clojure. With a few minor tweaks for the version i am using. (my messages don't quite come out as I expect them but at least i can see them)
(def ^{:private true
:doc "kafka spout config definition"}
spout-config (let [cfg (SpoutConfig. (ZkHosts. "127.0.0.1:2181") "test" "/broker" (.toString (UUID/randomUUID)))]
;;(set! (. cfg scheme) (StringScheme.)) depricated
(set! (. cfg scheme) (SchemeAsMultiScheme. (StringScheme.)))
;;(.forceStartOffsetTime cfg -2)
cfg))
(defn mk-topology []
(topology
{;;"1" (spout-spec sentence-spout)
"1" (spout-spec my-kafka-spout :p 1)
"2" (spout-spec (sentence-spout-parameterized
["the cat jumped over the door"
"greetings from a faraway land"])
:p 2)}
{"3" (bolt-spec {"1" :shuffle}
split-sentence
:p 5)
"4" (bolt-spec {"3" ["word"]}
word-count
:p 1)}))

With any distributed system it's impossible to ensure that a portion of the work to be done will be worked on exactly once. At some point something will fail and it will need to retried (this is called "at least once" processing) or not retried (this is called "at most once" processing) though you can't have exactly the middle of that and get "exactly once" processing. What you can get is very close to exactly once processing.
The trick is to, at the end of your process, throw out the second copy if you then find that work was done twice. This is where the index comes in. When you are saving the result into the database, look to see if work with a later index than the index on this work as already been saved. If you find that this later work exists, then throw the work out and don't save it. As for the documentation, that's the kind of explanation that's only "strait forward" to people who have done it many times...

Boost.Log - Multiple processes to one log file?

Reading through the doc for Boost.Log, it explains how to "fan out" into multiple files/sinks pretty well from one application, and how to get multiple threads working together to log to one place, but is there any documentation on how to get multiple processes logging to a single log file?
What I imagine is that every process would log to its own "private" log file, but in addition, any messages above a certain severity would also go to a "common" log file. Is this possible with Boost.Log? Is there some configuration of the sinks that makes this easy?
I understand that I will likely have the same "timestamp out of order" problem described in the FAQ here, but that's OK, as long as the timestamps are correct I can work with that. This is all on one machine, so no remote filesystem problems either.

My expectation is that the Boost.Log backends that directly write logfiles will keep those files open in between writing the log entries.
This will cause problems with using the same logfile from multiple processes, because the filesystem usually won't allow more than one process to write to a file.
There are a few Boost.Log backends that can be used to have all the logging end up in one place.
These are the syslog and Windows eventlog backends. Of these, the syslog backend is probably the easiest to use.

Logging Etiquette

I have a server program that I am writing. In this program, I log allot. Is it customary in logging (for a server) to overwrite the log of previous runs, append to the file with some sort of new run header, or to create a new log file (it won't be restarted too often).
Which of these solutions is the way of doing things under Linux/Unix/MacOS?
Also, can anyone suggest a logging library for C++/C? I need one, regardless of the answer to the above question.

Take a look in /var/log/...you'll see that files are structured like
serverlog
serverlog.1
serverlog.2
This is done by logrotate which is called in a cronjob. But everything is simply in chronological order within the files. So you should just append to the same log file each time, and let logrotate split it up if needed.
You can also add a configuration file to /etc/logrotate.d/ to control how a particular log is rotated. Depending on how big your logfiles are, it might be a good idea to add here information about your logging. You can take a look at other files in this directory to see the syntax.

This is a rather complex issue. I don't think that there is a silver bullet that will kill all your concerns in one go.
The first step in deciding what policy to follow would be to set your requirements. Why is each entry logged? What is its purpose? In most cases this will result in some rather concrete facts, such as:
You need to be able to compare the current log with past logs. Even when an error message is self-evident, the process that led to it can be determined much faster by playing spot-the-difference, rather than puzzling through the server execution flow diagram - or, worse, its source code. This means that you need at least one log from a past run - overwriting blindly is a definite No.
You need to be able to find and parse the logs without going out of your way. That means using whatever facilities and policies are already established. On Linux it would mean using the syslog facility for important messages, to allow them to appear in the usual places.
There is also some good advice to heed:
Time is important. No only because there's never enough of it, but also because log files without proper timestamps for each entry are practically useless. Make sure that each entry has a timestamp - most system-wide logging facilities will do that for you. Make also sure that the clocks on all your computers are as accurate as possible - using NTP is a good way to do that.
Log entries should be as self-contained as possible, with minimal cruft. You don't need to have a special header with colors, bells and whistles to announce that your server is starting - a simple MyServer (PID=XXX) starting at port YYYYY would be enough for grep (or the search function of any decent log viewer) to find.
You need to determine the granularity of each logging channel. Sending several GB of debugging log data to the system logging daemon is not a good idea. A good approach might be to use separate log files for each logging level and facility, so that e.g. user activity is not mixed up with low-level data that in only useful when debugging the code.
Make sure your log files are in one place, preferably separated from other applications. A directory with the name of your application is a good start.
Stay within the norm. Sure you may have devised a new nifty logfile naming scheme, but if it breaks the conventions in your system it could easily confuse even the most experienced operators. Most people will have to look through your more detailed logs in a critical situation - don't make it harder for them.
Use the system log handling facilities. E.g. on Linux that would mean appending to the same file and letting an external daemon like logrotate to handle the log files. Not only would it be less work for you, it would also automatically maintain any general logging policies as a whole.
Finally: Always copy log important data to the system log as well. Operators watch the system logs. Please, please, please don't make them have to look at other places, just to find out that your application is about to launch the ICBMs...

https://stackoverflow.com/questions/696321/best-logging-framework-for-native-c
For the logging, I would suggest creating a new log file and clean it using a certain frequency to avoid it growing too fat. Overwrite logs of previous login is usually a bad idea.

How do I extract the network protocol from the source code of the server?

I'm trying to write a chat client for a popular network. The original client is proprietary, and is about 15 GB larger than I would like. (To be fair, others call it a game.)
There is absolutely no documentation available for the protocol on the internet, and most search results only come back with the client's scripting interface. I can understand that, since used in the wrong way, it could lead to ruining other people's experience.
I've downloaded the source code of a couple of alternative servers, including the one I want to connect to, but those
contain no documentation other than install instructions
are poorly commented (I did a superficial browsing)
are HUGE (the src folder of the target server contains 12 MB worth of .cpp and .h files), and grep didn't find anything related
I've also tried searching their forums and contacting the maintainers of the server, but so far, no luck.
Packet sniffing isn't likely to help, as the protocol relies heavily on encryption.
At this point, all my hope is my ability to chew through an ungodly amount of code. How do I start?
Edit: A related question.

If your original code is encrypted with some well known library like OpenSSL or Ctypto++ it might be useful to write your wrapper for the main entry points of these libraries, then delagating the call to the actual library. If you make such substitution and build the project successfully, you will be able to trace everything which goes out in the plain text way.
If your project is not using third party encryption libs, hopefully it is still possible to substitute the encryption routines with some wrappers which trace their input and then delegate encryption to the actual code.
Your bet is that usually enctyption is implemented in separate, relatively small number of source files so that should be easier for you to track input/output in these files.
Good luck!

I'd say
find the command that is used to send data through the socket (the call depends on the network library)
find references of this command and unroll from there. If you can modify-recompile the server code, it might help.
On the way, you will be able to log decrypted (or, more likely, not yet encrypted) network activity.

IMO, the best answer is to read the source code of the alternative server. Try using a good C++ IDE to help you. It will make a lot of difference.
It is likely that the protocol related material you need to understand will be limited to a subset of the files. These will contain references to network sockets and things. Start from there and work outwards as far as you need to.

A viable approach is to tackle this as a crypto challenge. That makes it easy, because you control so much.
For instance, you can use a current client to send a known message to the server, and then check server memory for that string. Once you've found out in which object the string ends, it also becomes possible to trace its ancestry through the code. Set a breakpoint on any non-const method of the object, and find the stacktraces. This gives you a live view of how messages arrive at the server, and a list of core functions essential to message processing. You can next find related functions (caller/callee of the functions on your list).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js