See if part of data is lazy in clojure - clojure

Is there a function in clojure that checks whether data contains some lazy part?
Background:
I'm building a small server in clojure. Each connection has a state, an input-stream and an output-stream
The server reads a byte from an input-stream, and based on the value calls one of several functions (with the state and the input and output stream as parameters). The functions can decide to read more from the input-stream, write a reply to the output stream, and return a state. This part loops.
This will all work fine, as long as the state doesn't contain any lazy parts. If there is some lazy part in the state, that may, when it gets evaluated (later, during another function), start reading from the input-stream and writing to the output-stream.
So basically I want to add a post-condition to all of these functions, stating that no part of the returned state may be lazy. Is there any function that checks for lazy sequences. I think it would be easy to check whether the state itself is a lazy sequence, but I want to check for instance whether the state has a vector that contains a hash-map, one of whose values is lazy.

it's easier to ensure that it is not lazy by forcing evaluation with doall
I had this problem in a stream processing crypto app a couple years back and tried several ways until I finally accepted my lazy side and wrapped the input streams in a lazy sequence that closed in input streams when no more data was available. effectively separating concern for closing the streams from the concern over what the streams contained. The state you are tracking sounds a little more sophisticated than open vs closed though you may be able to separate it in a similar manner.

You could certainly force evaluation with doall as Arther wisely suggests.
However I would recommend instead refactoring to solve the real problem, which is that your handler function has side effects (reading from input, writing to output).
You could instead turn this into a pure function if you did the following:
Wrap the input stream as a lazy sequence
use [input-sequence state] as input to your handler function
use [list-of-writes new-state rest-of-input-sequence] as output, where list of writes is whatever needs to be subsequently written to the output stream
If you do this, your handler function is pure, you just need to run it in a simple loop (sending the list-of-writes to the output stream on each iteration) until all input is consumed and/or some other termination condition is reached.

Related

Why think in terms of buffers and not lines? And can fgets() read multiple lines at one time?

Recently I have been reading Unix Network Programming Vol.1. In section 3.9, the last two paragraphs above Figure 3.18 said, here I quote:
...But our advice is to think in terms of butters and not lines, Write your code to read butters of data, and if a line is expected, check the buffer to see if it contains that line.
And in the next paragraph, the authors gave a more specific example, here I quote:
...as we'll see in Section 6.3. System functions like select still won't know about readline's internal buffer, so a carelessly written program could easily find itself waiting in select for data already received and stored in readline's butters.
In section 6.5, the actual problem is "mixing of stdio and select()", which would make the program, here I quote the book, "error-prone". But how?
I know that the authors gave the answer later in the same section and according to my understanding to the book, it is because of the data being hidden from select() and thus select() could not know that the data that has been read is consumed or not.
The answer is literally there, but the first problem here is that I really have a hard time getting it, I cannot imagine what damage would it make to the program, maybe I need a demo program that suffers from the problem to help me understand it.
Still in section 6.5, the authors tried to explain the problem further by giving, here I quote:
... Consider the case when several lines of input are available from the standard input.
select will cause the code at line 20 to read the input using fgets and that, in turn, will read the available lines into a buffer used by stdio. But, fgets only returns a single line and leaves any remaining data sitting in the stdio buffer ...
The "line 20" mentioned above is:
if (Fgets(sendline, MAXLINE, fp) == NULL)
where sendline is an array of char and fp is a pointer to FILE. I looked up into the detailed implementation of Fgets, and it just wrapped fgets() up with some extra error-dealing logic and nothing more.
And here comes my second question, how does fgets manage to, here I quote again, read the available lines? I mean, I looked up the man-page of fgets, it says fgets normally stops on the first newline character. Doesn't this mean that only one line would be read by fgets? More specifically, if I type one line in the terminal and press the enter key, then fgets reads this exact line. I do this again, then the next new line is read by fgets, and the point is one line at a time.
Thanks for your patience in reading all the descriptions, and looking forward to your answers.
One of the main reasons to think about buffers rather than lines (when it comes to network programming) is because TCP is a streaming protocol, where data is just a stream of bytes beginning with a connection and ending with a disconnection.
There are no message boundaries, and there are no "lines", except what the application-level protocol on top of TCP have decided.
That makes it impossible to read a "line" from a TCP connection, there are no such primitive functions for it. You must read using buffers. And because of the streaming and the lack of any kind of boundaries, a single call to receive data may give your application less than you ask for, and it may be a partial application-level message. Or you might get more than a single message, including a partial message at the end.
Another note of importance is that sockets by default are blocking, so a socket that don't have any data ready to be received will cause any read call to block, and wait until there are data. The select call only tells if the read call won't block right now. If you do the read call multiple times in a loop it can (and ultimately will) block when the data to receive is exhausted.
All this makes it really hard to use high-level functions like fgets (after a fdopen call of course) to read data from TCP sockets, as it can block at any time if you use blocking socket. Or it can return with a failure if you use non-blocking sockets and the read call returns with the failure that it would block (yes that is returned as an error).
If you use your own buffering, you can use select in the same loop as read or recv, to make sure that the call won't block. Or of you use non-blocking sockets you can gather data (and append to your buffer) with single read calls, and add detection when you have a full message (either by knowing its length or by detecting the message terminator or separator, like a newline).
As for fgets reading "multiple lines", it can cause the underlying reads to fill the buffers with multiple lines, but the fgets function itself will only fill your supplied buffer with a single line.
fgets will never give you multiple lines.
select is a Linux kernel call. It will tell you if the Linux kernel has data that your process hasn't received yet.
fgets is a C library call. To reduce the number of Linux kernel calls (which are typically slower) the C library uses buffering. It will try to read a big chunk of data from the Linux kernel (typically something like 4096 bytes) and then return just the part you asked for. Next time you call it, it will see if it already read the part you asked for, and then it won't need to read it from the kernel. For example, if it's able to read 5 lines at once from the kernel, it will return the first line, and the other 4 will be stored in the C library and returned in the next 4 calls.
When fgets reads 5 lines, returns 1, and stores 4, the Linux kernel will see that all the data has been read. It doesn't know your program is using the C library to read the data. Therefore, select will say there is no data to read, and your program will get stuck waiting for the next line, even though there already is one.
So how do you resolve this? You basically have two options: don't do buffering at all, or do your own buffering so you get to control how it works.
Option 1 means you read 1 byte at a time until you get a \n and then you stop reading. The kernel knows exactly how much data you have read, and it will be able to accurately tell you whether there's more data. However, making a kernel call for every single byte is relatively slow (measure it) and also, the computer on the other end of the connection could cause your program to freeze simply by not sending a \n at all.
I want to point out that option 1 is completely viable if you are just making a prototype. It does work, it's just not very good. If you try to fix the problems with option 1, you will find the only way to fix them is to do option 2.
Option 2 means doing your own buffering. You keep an array of say 4096 bytes per connection. Whenever select says there is data, you try to fill up the array as much as possible, and you check whether there is a \n in the array. If so, you process that line, remove the line from the array*, and repeat. This means you minimize kernel calls, and you also won't freeze if the other computer doesn't send a \n since the unfinished line will just stay in the array. If all 4096 bytes are used, and there is still no \n, you can either choose to process it as a big line (if this makes sense, e.g. in a chat program) or you can disconnect the connection, since the other computer is breaking the rules. Of course you can choose to use a bigger number than 4096.
* Extra for experts: "removing the line from the array" can be fast if you implement a "circular buffer" data structure.

What does completing function do in Clojure?

I've came across completing function on clojuredocs but there is no doc at the moment.
Could you provide some examples?
completing is used to augment a binary reducing function that may not have a unary overload with a unary "completion" arity.
The official transducers reference page hosted # clojure.org explains the purpose of the nullary, unary and binary overloads of transducing functions and includes a good example of when the unary "completion" arity is useful in the section called "Creating Transducers" (the example used is partition-all which uses completion to produce the final block of output).
In short, the completion arity is used after all input is consumed and gives the transducing function the opportunity to perform any work to flush any buffers that the transducing process might maintain (as in the case of partition-all), apply any final transformations to the output (see below for an example of that) etc. Here by "transducing function" I mean the transducing function actually passed to transduce (or eduction or any similar function that sets up a transducing process) together with all the wrapping transducers.
For an interesting example of completing used with a non-trivial completion function, have a look at Christophe Grand's xforms: net.cgrand.xforms.rfs/str is a transduce-friendly version of clojure.core/str that will build up a string in linear time when used in a transduce call. (In contrast, clojure.core/str, if used in reduce/transduce, would create a new string at each step, and consequently run in O(n²) time.1) It uses completing to convert the StringBuilder that it uses under to hood to a string once it consumes all input. Here's a stable link to its definition as of the current tip of the master branch.
1 Note, however, that clojure.core/str runs in linear time if used with apply – it uses a StringBuilder under the hood just like net.cgrand.xforms.rfs/str. It's still convenient from time to time to have a transduce-friendly version (for use with transducers, or in higher-order contexts, or possibly for performance reasons when dealing with a large collection that can be reduced be efficiently than via first/next looping that str uses).

Use seekoff to detect a fire hose istream?

When processing input, it can be useful to know if you are being sent data in a firehose-like manner; that is, you get to see the data once and it's forever lost once it goes through the stream.
There are lots of examples of firehoses, such as console input, signal capture devices, data streams from network sockets, or named pipes.
C++ streams and streambufs are supposed to encapsulate input and output behavior, but I'm not sure there's a standard non-destructive way to detect whether the associated sequence lurking behind a streambuf you've been handed is seekable or not.
What if you called streambuf::pubseekoff (0, cur, std::ios_base::in) ? Can you rely on that?
Does the standard (or at least design sense) require such a call to seekoff to return streampos(-1) if the associated sequence is not physically seekable? Would it set failbit? Is it undefined behavior?
These streams don't have a viable concept of "absolute position" and so a seekoff probably should return -1 even if the seek does nothing.
But if that behavior is not required, some library designer might have decided to return something else from seekoff in such a case. Perhaps the number of characters fed through the firehose so far. Or gptr()-eback() (useless). Or a random number.
The streambuf type I have the most uncertainty about is basic_filebuf. You should certainly be able to seek within a disk file, but the OS can also encapsulate firehose type streams within the 'file' concept.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

Stream or Iterator to generate all strings that match a regular expression?

This is a follow-up to my previous question.
Suppose I want to generate all strings that match a given (simplified) regular expression.
It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.
I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?
The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.
First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,
List(
Constant("abc"),
ZeroOrOne(Constant("d")),
Constant("efg"),
OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
Constant("qrs"),
AnyChar()
)
Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.
Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.
Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.