Data format safety in clojure

Data format safety in clojure - clojure

Coming from a Java background, I'm quite fond of static type safety and wonder how clojure programmers deal with the problem of data format definitions (perhaps not just types but general invariants, because types are just a special case of that.)
This is similar to an existing question "Type Safety in Clojure", but that one focuses more on the aspect of how to check types at compile time, while I'm more interested in how the problem is pragmatically addressed.
As a practical example I'm considering an editor application which handles a particular document format. Each document consists of elements that come in several different varieties (graphics elements, font elements etc.) There would be editors for the different element types, and also of course functions to transform a document from/to a byte stream in its native on-disk format.
The basic problem I am interested in is that the editors and the read/write functions have to agree on a common data format. In Java, I would model the document's data as an object graph, e.g. with one class representing a document and one class for each element variety. This way, I get a compile-time guarantee about what the structure of my data looks like, and that the field "width" of a graphics element is an integer and not a float. It does not guarantee that width is positive - but using a getter/setter interface would allow the corresponding class to add invariant guarantees like that.
Being able to rely on this makes the code dealing with this data simpler, and format violations can be caught at compile-time or early at runtime (where some code attempts to modify data that would violate invariants).
How can you achieve a similar "data format reliability" in Clojure? As far as I know, there is no way to perform compile-time checking and hiding domain data behind a function interface seems to be discouraged as non-idiomatic (or maybe I misunderstand?), so what do Clojure developers do to feel safe about the format of data handed into their functions? How do you get your code to error out as quickly as possible, and not after the user edited for 20 more minutes and tries to save to disk, when the save function notices that there is a graphics element in the list of fonts due to an editor bug?
Please note that I'm interested in Clojure and learning, but didn't write any actual software with it yet, so it's possible that I'm just confused and the answer is very simple - if so, sorry for wasting your time :).

I don't see anything wrong or unidiomatic about using a validating API to construct and manipulate your data as in the following.
(defn text-box [text height width]
{:pre [(string? text) (integer? height) (integer? width)]}
{:type 'text-box :text text :height height :width width})
(defn colorize [thing color]
{:pre [(valid-color? color)]}
(assoc thing :color color))
... (colorize (text-box "Hi!" 20 30) :green) ...
In addition, references (vars, refs, atoms, agents) can have an associated validator function that can be used to ensure a valid state at all times.

Good question - I also find that moving from a statically typed language to a dynamic one requires a bit more care about type safety. Fortunately TDD techniques help a huge amount here.
I typically write a "validate" function which checks all your assumptions about the data structure. I often do this in Java too for invariant assumptions, but in Clojure it's more important because you need to check thinks like types as well.
You can then use the validate function in several ways:
As a quick check at the REPL: (validate foo)
In unit tests: (is (validate (new-foo-from-template a b c)))
As a run-time check for key functions, e.g. checking that (read-foo some-foo-input-stream) is valid
If you have a complex data structure which is a tree of multiple different component types, you can write a validate function for each component type and have the validate function for the whole document call validate for each sub-component recursively. A nice trick is to use either protocols or multimethods to make the validate function polymorphic for each component type.

Related

What is the name of the data structure for and-or-lists (or and-or-trees) and where can I read about it?

I recently needed to make a data structure which was a nested list of and/or questions. Since most every interesting thing has been discovered by someone else previously, I’m looking for the name of this data structure. it looks something like this.
‘((a b c) (b d e) (c (a b) (f a)))
The interpretation is I want to find abc or bde or caf or caa or cbf or cba and the list encapsulates that. At the top level each item is or’ed together and sub-lists of the top level are and’ed together and sub-lists of sub-lists are or’ed again sub-lists of those are and’ed and sub-lists of those or’ed ad infinitum. Note that in my example, all the lists are the same length, in my real application the lists vary in length.
The code to walk such a “tree” is relatively simple, but I’m assuming that there is a name for that type of tree and there is stuff I can read about it.
These lists are equivalent to fixed length regular expressions (which I've seen referred to as "network expressions", but I am particularly interested in this data structure and representation thereof.

In general (in the very high level of abstraction) it is:
Context free grammar -Wiki
If you allow it to be infinitely nested, then it is not a regular expression because of presence of parentheses (left and right should match).
If you consider, that expressions inside parentheses are ordered. I mean that a and b and c is equivalent to (a and b) and c. You get then Binary expression tree -Wiki
But for your particular case, it is probably: Disjunctive normal form -Wiki
I am not sure, but my intuition says that it is regular expression again because you have only 2 levels of nesting (1st - for 'or-ed' and 2nd - for 'and-ed' parts)

The trees are also a subset of DAWGS - directed acyclic word graphs and one could construct them the same way.
In my case, I have a very small set that I have built by hand and I don't worry about getting the minimal set, but instead just want something that I can easily write down but deals with the types of simple variations I see. Basically, I have different ways of finding where I keep my .el files based upon the different directory structures of various OSes I use. (E.g. when I was working at Google, the /usr/local/emacs/site-lisp directory was actually more like /usr/local/Google/emacs/site-lisp.)
I don't need a full regex, but there are about a dozen variations, some having quite long lists of nested sub-directories (c:\users\cfclark\appData\roaming\emacs.emacs.d or some other awful thing) that I wanted to write down (and then have emacs make an automated search to find the one that is appropriate to this particular installation). And every time I go to a new job, I can simply add to the list a description of where they are in that setup.
Anyway, as that code has evolved, I found that I had I was doing (nested or's and and's and realized that the structure generalized to the alternating or/and/or/and/... case). So, my assumption is that someone must have discovered this before. I had hints of it myself several years ago, but didn't set down to implement it. The Disjunctive Normal Form link mpasko256 gave is also particularly relevant. I don't normalize to that level, I still keep nested and's and or's rather than flattening to 2, but I do have a distinct structure, or's at the top, then and's, then or's....

clojure code modification preserving reader macros

In clojure, read-string followed by str will not return the original string, but the string for which the reader macros have been expanded:
(str (read-string "(def foo [] #(bar))"))
;"(def foo [] (fn* [] (bar)))"
This is problematic if we want to manipulate a small part of the code, far away from any reader macros, and get back a string representation that preserves the reader macros. Is there a work around?

The purpose of read is to build an AST of the code and as such the function does not preserve all the properties of the original text. Otherwise, it should keep track of the original code-layout (e.g. location of parenthesis, newlines, indentation (tabs/spaces), comments, and so on). If you have a look at LispReader.java, you can see that the reader macros are unconditionally applied (*read-eval* does not influence all reader macros).
Here is what I would recommend:
You can take inspiration of the existing LispReader and implement your own reader. Maybe it is sufficient to change the dispatch-table so that macro characters are directed to you own reader. You would also need to build runtime representations of the quoted forms and provide adequate printer functions for those objects.
You can process your original file with Emacs lisp, which can easily navigate the structure of your code and edit it as you wish.
Remark: you must know that what you are trying to achieve smells fishy. You might have a good reason to want to do this, but this looks needlessly complex without knowing why you want to work at the syntax level. It would help if you could provide more details.

How would you idiomatically extend arithmetric functions for other datatypes in Clojure?

So I want to use java.awt.Color for something, and I'd like to be able to write code like this:
(use 'java.awt.Color)
(= Color/BLUE (- Color/WHITE Color/RED Color/GREEN))
Looking at the core implementation of -, it talks specifically about clojure.lang.Numbers, which to me implies that there is nothing I do to 'hook' into the core implementation and extend it.
Looking around on the Internet, there seems to be two different things people do:
Write their own defn - function, which only knows about the data type they're interested in. To use you'd probably end up prefixing a namespace, so something like:
(= Color/BLUE (scdf.color/- Color/WHITE Color/RED Color/GREEN))
Or alternatively useing the namespace and use clojure.core/- when you want number math.
Code a special case into your - implementation that passes through to clojure.core/- when your implementation is passed a Number.
Unfortunately, I don't like either of these. The first is probably the cleanest, as the second makes the presumption that the only things you care about doing maths on is their new datatype and numbers.
I'm new to Clojure, but shouldn't we be able to use Protocols or Multimethods here, so that when people create / use custom types they can 'extend' these functions so they work seemlessly? Is there a reason that +,- etc doesn't support this? (or do they? They don't seem to from my reading of the code, but maybe I'm reading it wrong).
If I want to write my own extensions to common existing functions such as + for other datatypes, how should I do it so it plays nicely with existing functions and potentially other datatypes?

It wasn't exactly designed for this, but core.matrix might be of interest to you here, for a few reasons:
The source code provides examples of how to use protocols to define operations that work with with various different types. For example, (+ [1 2] [3 4]) => [4 6]). It's worth studying how this is done: basically the operators are regular functions that call a protocol, and each data type provides an implementation of the protocol via extend-protocol
You might be interested in making java.awt.Color work as a core.matrix implementation (i.e. as a 4D RGBA vector). I did something simiilar with BufferedImage here: https://github.com/clojure-numerics/image-matrix. If you implement the basic core.matrix protocols, then you will get the whole core.matrix API to work with Color objects. Which will save you a lot of work implementing different operations.

The probable reason for not making arithmetic operation in core based on protocols (and making them only work of numbers) is performance. A protocol implementation require an additional lookup for choosing the correct implementation of the desired function. Although from design point of view it may feel nice to have protocol based implementations and extend them whenever required, but when you have a tight loop that does these operations many times (and this is very common use case with arithmetic operations) you will start feeling the performance issues because of the additional lookup on each operation that happen at runtime.
If you have separate implementation for your own data types (ex: color/-) in their own namespace then it will be more performant due to a direct call to that function and it also make things more explicit and customizable for specific cases.
Another issue with these functions will be their variadic nature (i.e they can take any number of arguments). This is a serious issue in providing a protocol implementation as protocol extended type check only works on first parameter.

You can have a look at algo.generic.arithmetic in algo.generic. It uses multimethods.

Named parameter string formatting in C++

I'm wondering if there is a library like Boost Format, but which supports named parameters rather than positional ones. This is a common idiom in e.g. Python, where you have a context to format strings with that may or may not use all available arguments, e.g.
mouse_state = {}
mouse_state['button'] = 0
mouse_state['x'] = 50
mouse_state['y'] = 30
#...
"You clicked %(button)s at %(x)d,%(y)d." % mouse_state
"Targeting %(x)d, %(y)d." % mouse_state
Are there any libraries that offer the functionality of those last two lines? I would expect it to offer a API something like:
PrintFMap(string format, map<string, string> args);
In Googling I have found many libraries offering variations of positional parameters, but none that support named ones. Ideally the library has few dependencies so I can drop it easily into my code. C++ won't be quite as idiomatic for collecting named arguments, but probably someone out there has thought more about it than me.
Performance is important, in particular I'd like to keep memory allocations down (always tricky in C++), since this may be run on devices without virtual memory. But having even a slow one to start from will probably be faster than writing it from scratch myself.

The fmt library supports named arguments:
print("You clicked {button} at {x},{y}.",
arg("button", "b1"), arg("x", 50), arg("y", 30));
And as a syntactic sugar you can even (ab)use user-defined literals to pass arguments:
print("You clicked {button} at {x},{y}.",
"button"_a="b1", "x"_a=50, "y"_a=30);
For brevity the namespace fmt is omitted in the above examples.
Disclaimer: I'm the author of this library.

I've always been critic with C++ I/O (especially formatting) because in my opinion is a step backward in respect to C. Formats needs to be dynamic, and makes perfect sense for example to load them from an external resource as a file or a parameter.
I've never tried before however to actually implement an alternative and your question made me making an attempt investing some weekend hours on this idea.
Sure the problem was more complex than I thought (for example just the integer formatting routine is 200+ lines), but I think that this approach (dynamic format strings) is more usable.
You can download my experiment from this link (it's just a .h file) and a test program from this link (test is probably not the correct term, I used it just to see if I was able to compile).
The following is an example
#include "format.h"
#include <iostream>
using format::FormatString;
using format::FormatDict;
int main()
{
std::cout << FormatString("The answer is %{x}") % FormatDict()("x", 42);
return 0;
}
It is different from boost.format approach because uses named parameters and because
the format string and format dictionary are meant to be built separately (and for
example passed around). Also I think that formatting options should be part of the
string (like printf) and not in the code.
FormatDict uses a trick for keeping the syntax reasonable:
FormatDict fd;
fd("x", 12)
("y", 3.141592654)
("z", "A string");
FormatString is instead just parsed from a const std::string& (I decided to preparse format strings but a slower but probably acceptable approach would be just passing the string and reparsing it each time).
The formatting can be extended for user defined types by specializing a conversion function template; for example
struct P2d
{
int x, y;
P2d(int x, int y)
: x(x), y(y)
{
}
};
namespace format {
template<>
std::string toString<P2d>(const P2d& p, const std::string& parms)
{
return FormatString("P2d(%{x}; %{y})") % FormatDict()
("x", p.x)
("y", p.y);
}
}
after that a P2d instance can be simply placed in a formatting dictionary.
Also it's possible to pass parameters to a formatting function by placing them between % and {.
For now I only implemented an integer formatting specialization that supports
Fixed size with left/right/center alignment
Custom filling char
Generic base (2-36), lower or uppercase
Digit separator (with both custom char and count)
Overflow char
Sign display
I've also added some shortcuts for common cases, for example
"%08x{hexdata}"
is an hex number with 8 digits padded with '0's.
"%026/2,8:{bindata}"
is a 24-bit binary number (as required by "/2") with digit separator ":" every 8 bits (as required by ",8:").
Note that the code is just an idea, and for example for now I just prevented copies when probably it's reasonable to allow storing both format strings and dictionaries (for dictionaries it's however important to give the ability to avoid copying an object just because it needs to be added to a FormatDict, and while IMO this is possible it's also something that raises non-trivial problems about lifetimes).
UPDATE
I've made a few changes to the initial approach:
Format strings can now be copied
Formatting for custom types is done using template classes instead of functions (this allows partial specialization)
I've added a formatter for sequences (two iterators). Syntax is still crude.
I've created a github project for it, with boost licensing.

The answer appears to be, no, there is not a C++ library that does this, and C++ programmers apparently do not even see the need for one, based on the comments I have received. I will have to write my own yet again.

Well I'll add my own answer as well, not that I know (or have coded) such a library, but to answer to the "keep the memory allocation down" bit.
As always I can envision some kind of speed / memory trade-off.
On the one hand, you can parse "Just In Time":
class Formater:
def __init__(self, format): self._string = format
def compute(self):
for k,v in context:
while self.__contains(k):
left, variable, right = self.__extract(k)
self._string = left + self.__replace(variable, v) + right
This way you don't keep a "parsed" structure at hand, and hopefully most of the time you'll just insert the new data in place (unlike Python, C++ strings are not immutable).
However it's far from being efficient...
On the other hand, you can build a fully constructed tree representing the parsed format. You will have several classes like: Constant, String, Integer, Real, etc... and probably some subclasses / decorators as well for the formatting itself.
I think however than the most efficient approach would be to have some kind of a mix of the two.
explode the format string into a list of Constant, Variable
index the variables in another structure (a hash table with open-addressing would do nicely, or something akin to Loki::AssocVector).
There you are: you're done with only 2 dynamically allocated arrays (basically). If you want to allow a same key to be repeated multiple times, simply use a std::vector<size_t> as a value of the index: good implementations should not allocate any memory dynamically for small sized vectors (VC++ 2010 doesn't for less than 16 bytes worth of data).
When evaluating the context itself, look up the instances. You then parse the formatter "just in time", check it agaisnt the current type of the value with which to replace it, and process the format.
Pros and cons:
- Just In Time: you scan the string again and again
- One Parse: requires a lot of dedicated classes, possibly many allocations, but the format is validated on input. Like Boost it may be reused.
- Mix: more efficient, especially if you don't replace some values (allow some kind of "null" value), but delaying the parsing of the format delays the reporting of errors.
Personally I would go for the One Parse scheme, trying to keep the allocations down using boost::variant and the Strategy Pattern as much I could.

Given that Python it's self is written in C and that formatting is such a commonly used feature, you might be able (ignoring copy write issues) to rip the relevant code from the python interpreter and port it to use STL maps rather than Pythons native dicts.

I've writen a library for this puporse, check it out on GitHub.
Contributions are wellcome.

calculating user defined formulas (with c++)

We would like to have user defined formulas in our c++ program.
e.g. The value v = x + ( y - (z - 2)) / 2. Later in the program the user would define x,y and z -> the program should return the result of the calculation. Somewhen later the formula may get changed, so the next time the program should parse the formula and add the new values. Any ideas / hints how to do something like this ? So far I just came to the solution to write a parser to calculate these formulas - maybe any ideas about that ?

If it will be used frequently and if it will be extended in the future, I would almost recommend adding either Python or Lua into your code. Lua is a very lightweight scripting language which you can hook into and provide new functions, operators etc. If you want to do more robust and complicated things, use Python instead.

You can represent your formula as a tree of operations and sub-expressions. You may want to define types or constants for Operation types and Variables.
You can then easily enough write a method that recurses through the tree, applying the appropriate operations to whatever values you pass in.

Building your own parser for this should be a straight-forward operation:
) convert the equation from infix to postfix notation (a typical compsci assignment) (I'd use a stack)
) wait to get the values you want
) pop the stack of infix items, dropping the value for the variable in where needed
) display results

Using Spirit (for example) to parse (and the 'semantic actions' it provides to construct an expression tree that you can then manipulate, e.g., evaluate) seems like quite a simple solution. You can find a grammar for arithmetic expressions there for example, if needed... (it's quite simple to come up with your own).
Note: Spirit is very simple to learn, and quite adapted for such tasks.

There's generally two ways of doing it, with three possible implementations:
as you've touched on yourself, a library to evaluate formulas
compiling the formula into code
The second option here is usually done either by compiling something that can be loaded in as a kind of plugin, or it can be compiled into a separate program that is then invoked and produces the necessary output.
For C++ I would guess that a library for evaluation would probably exist somewhere so that's where I would start.

If you want to write your own, search for "formal automata" and/or "finite state machine grammar"
In general what you will do is parse the string, pushing characters on a stack as you go. Then start popping the characters off and perform tasks based on what is popped. It's easier to code if you force equations to reverse-polish notation.

To make your life easier, I think getting this kind of input is best done through a GUI where users are restricted in what they can type in.
If you plan on doing it from the command line (that is the impression I get from your post), then you should probably define a strict set of allowable inputs (e.g. only single letter variables, no whitespace, and only certain mathematical symbols: ()+-*/ etc.).
Then, you will need to:
Read in the input char array
Parse it in order to build up a list of variables and actions
Carry out those actions - in BOMDAS order

With ANTLR you can create a parser/compiler that will interpret the user input, then execute the calculations using the Visitor pattern. A good example is here, but it is in C#. You should be able to adapt it quickly to your needs and remain using C++ as your development platform.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js