Why build a Clojure string using literal characters?

Why build a Clojure string using literal characters? - clojure

I was reading some code just now, and I ran across this line:
(str cache \, \space lru \, \space tick \, \space limit)
This is odd to me. Consecutive literal characters are used, instead of a string containing those characters. I would expect something more like this:
(str cache ", " lru ", " tick ", " limit)
But this is in a core library written by some venerable Clojure veterans, which makes me think that maybe there's a reason. What's the reason? Performance? Or what?

There is no good reason. It probably performs worse than the version with strings, in addition to being uglier. You could also write (join ", " [cache lru tick limit]), which would be my preference, but I can understand choosing to write it out longhand. Using characters instead of strings is inexplicable, though. I can't track down the commit that added this line, because the repo has been destroyed and rebuilt at least twice as it switched names, but I'd wager there was no good reason. Maybe some misguided performance ideas.

Related

Would it be more efficient to have the \n character in the same string literal than to have an extra set of "<<"?

I was wondering if in code such as this:
std::cout << "Hello!" << '\n';
Would it be more efficient to have the \n character in the same string literal, as in this case, "Hello!"? Such as the code would then look like this:
std::cout << "Hello!\n";
I am aware that things such as these are so miniscule in difference whether one is more efficient than the other (especially in this case), that it just doesn't matter, but it has just been boggling my mind.
My reasoning:
If I am not mistaken having the \n character in the same string literal would be more efficient, since when you have the extra set of the insertion operator (operator<<) you have to call that function once again and whatever is in the implementation of that function, which in this case, in the Standard Library, will happen just for one single character (\n). In comparison to only having to do it once, that is, if I were to append that character to the end of the string literal and only have to use one call to operator<<.
I guess this is a very simple question but I am not 100% sure about the answer and I love to spend time knowing the details of a language and how little things work better than others.
Update:
I am working to find the answer myself as for questions such as this, it is better for one to try and find the answer for themselves, I haven't worked with any Assembly Code in the past, so this will be my firs time trying to.

Yes; but all of the ostreams are so inefficient in practice that if you care much about efficiency you shouldn't be using them.
Write code to be clear and maintainable. Making code faster takes work, and the simpler and clearer your code is the easier to make it faster.
Identify what your bottleneck is, then work on optimizing that by actually working out what is taking time. (This only fails when global behaviour caused global slowdowns, like fragmentation or messing with caches because).

How to manipulate regex replacement strings in Elixir

I found myself wanting to do this in Elixir:
re_sentence_frag = %r/(\w([^\.]|\.(?!\s|$))*)(?=\.(\s|$))/
Regex.replace(re_sentence_frag, " oh. a DOG. woOf. ", String.capitalize("\\1"))
Of course, that has no effect. (It capitalizes the string "\\1" just once.) What I really meant is to apply String.capitalize/1 to every match found by the replace function. But the 3rd parameter can't take a function reference, so passing &(String.capitalize("\\1") also doesn't work.
This seems so fundamental that I'm surprised it's not possible. Is there another approach that would as neatly express this kind of manipulation? It looks like the underlying Erlang libraries would not immediately support passing a function reference as the 3rd parameter, so this may not be completely trivial to fix in Elixir.
How would you program manipulation of each matched string?

Here is one solution based on split:
" oh. a DOG. woOf. pi is 3.14159. try version 7.a." |>
String.split(%r/(^|\.)(\s+|$)/) |>
Enum.map_join(&String.capitalize/1)
I guess it's not much more clumsy than my original attempt. The regex is considerably simpler, as it only needs to find the bits between sentences.

Encode/decode certain text sequences in Qt

I have a QTextEdit where the user can insert arbitrary text. In this text, there may be some special sequences of characters which I wish to translate automatically. And from the translated version, I wish I could go back to the sequences.
Take for instance this:
QMessageBox::information(0, "Foo", MAGIC_TRANSLATE(myTextEdit->text()));
If the user wrote, inside myTextEdit's text, the sequence \n, I would like that MAGIC_TRANSLATE converted the string \n to an actual new line character.
In the same way, if I give a text with a new line inside it, a MAGIC_UNTRANSLATE will convert the newline with a \n string.
Now, of course I can implement these two functions by myself, but what I am asking is if there is something already made, easy to use, in Qt, which allows me to specify a dictionary and it does the rest for me.
Note that sequences with common prefix can create some conflicts, for example converting:
\foo -> FOO
\foobar -> FOOBAR
can give rise to issues when translating the text asd \foobar lol, because if \foo is searched and replaced before \foobar, then the resulting text will be asd FOObar lol instead of the (more natural) asd FOOBAR lol.
I hope to have made clear my needs. I believe that this may be a common task, so I hope there is a Qt solution which takes into account this kind of issues when having conflicting prefixes.
I am sorry if this is a trivial topic (as I think it may be), but I am not familiar at all with encoding techniques and issues, and my knowledge of Qt encoding cover only very simple Unicode-related issues.
EDIT:
Btw, in my case a data-oriented approach, based on resources or external files or anything that does not requires a recompilation would be great.

It sounds like your question is, "I want to run a sequence of regular expression or simple string replacements to map between two encodings of some text".
First you need to work out your mapping, exactly. As you say, if your escape sequences like \foo and \foobar are fiddly, you might find that you don't have a bidirectional, lossless mapping. No library in the world can help you if your design or encoding is flawed.
When you end up with a precise design (which we can't help you on given the complete lack of information provided on the purpose of this function), you'll probably find that a sequence of string replacements is fine. If it really is more complicated, then some QRegExps should be enough.

It is always a bit ugly to self-answer questions, but... Maybe this solution is useful to someone.
As suggested by Nicholas in his answer, a good strategy is to use replacement. It is simple and effective in most cases, for example in the plain C/C++ escaping:
\n \r \t etc
This works because they are all different. It will always work with a replacement if the sequences are all different and, in particular, if no sequence is a prefix to another sequence.
For example, if your sequences are the one aboves plus some greek letters, you will not like the \nu sequence, which should be translated to ν.
Instead, if the replacing function tests for \n before \nu, the result is wrong.
Assuming that both sequences will be translated in two completely different entities, there are two solutions: place a close-sequence character, for example \nu;, or just replace by longest to shorter strings. This ensure that any sequence which is prefix of another one is not replaced before it.
For various reasons, I tried another way: using a trie, which is a tree of all the prefixes of a dictionary of words. Long story short: it works fairly well and probably works faster than (most) regexes and replacements.
Regex are state machines and it is not rare to re-process the input, with a trie, you avoid to re-match characters twice, so you go pretty fast.
Code for tries is pretty easy to find on the internet, and the modifications to do efficient matching are trivial, so I will not write the code here.

Parse many small strings or a single big string - which is faster?

In a scenario where a large number of strings must be parsed with regular expressions, considering the same RegEx needle will be used for all tests, which would be faster:
To test each string in an array, individually, or;
To concatenate everything into a single big string and test just once?
I believed number 2 would be best instead of having to fire up the RegEx engine multiple times for processing an array of strings. However, after some testing in PHP (PCRE), it seemed untrue.
Benchmark
I've made a simple benchmark in PHP 5.3 (source code) and got the following results:
122185 interactions in 5 seconds testing multiple smaller strings inside an array
26853 interactions in 5 seconds doing the single big-string test
Therefore, I must conclude the first method is up to 5 times faster. However, I'd like to ask for an authoritative answer confirming this; I could be assuming things mistakenly due to some PHP optimization I'm unaware of.
Is it always a more optimized solution to fragment large strings before testing them with regular expressions, not specifically in PCRE?
preg_grep()
I don't think this function should be considered here. It's a benchmark test, not an optimization issue. Not to mention that function is a PHP-specific method. Also, preg_match_all returns all matched substrings whereas preg_grep just indicates which array elements matched.

Your benchmark is flawed. Look at this piece of your code:
while(time() - $TimeStart < 5)
for($i = 0; $i < $Length; $i++, $Iterations++)
{
preg_match_all($RegEx, $Input[$i], $m);
}
}
The $Iterations should only be increased in the while, not inside the for. Dividing the former results results in:
24437 iterations using array
26853 iterations using big string
You shouldn't use time() for time measurements, microtime() would be more suitable to gain accuracy.
Lastly, this benchmark isn't complete, because to obtain the same results for both tests, the array method needs to perform array_merge() after every iteration. Also, somewhere a big string needs be transformed into an array and that takes time too.

You definitely should not merge all the target strings into one. For one thing, it will break a lot of regex that work okay on the shorter strings. Anchors like ^ and $, \A and \z, would suddenly find themselves with nothing to match. Also, regexes that rely heavily on things like .* or.*?`, which work on shorter strings despite their inherent inefficiency, would tend to become catastrophically slow when used on the Frankenstring.
But even if the concatenated version turned out to be faster, would it matter? Have you tried the array version and found it to be too slow? This is a pretty drastic solution (if solution it is); I'd hold off on implementing it until I had a problem it could solve, if I were you.

Why must C/C++ string literal declarations be single-line?

Is there any particular reason that multi-line string literals such as the following are not permitted in C++?
string script =
"
Some
Formatted
String Literal
";
I know that multi-line string literals may be created by putting a backslash before each newline.
I am writing a programming language (similar to C) and would like to allow the easy creation of multi-line strings (as in the above example).
Is there any technical reason for avoiding this kind of string literal? Otherwise I would have to use a python-like string literal with a triple quote (which I don't want to do):
string script =
"""
Some
Formatted
String Literal
""";
Why must C/C++ string literal declarations be single-line?

The terse answer is "because the grammar prohibits multiline string literals." I don't know whether there is a good reason for this other than historical reasons.
There are, of course, ways around this. You can use line splicing:
const char* script = "\
Some\n\
Formatted\n\
String Literal\n\
";
If the \ appears as the last character on the line, the newline will be removed during preprocessing.
Or, you can use string literal concatenation:
const char* script =
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated during preprocessing, so these will end up as a single string literal at compile-time.
Using either technique, the string literal ends up as if it were written:
const char* script = " Some\n Formatted\n String Literal\n";

One has to consider that C was not written to be an "Applications" programming language but a systems programming language. It would not be inaccurate to say it was designed expressly to rewrite Unix. With that in mind, there was no EMACS or VIM and your user interfaces were serial terminals. Multiline string declarations would seem a bit pointless on a system that did not have a multiline text editor. Furthermore, string manipulation would not be a primary concern for someone looking to write an OS at that particular point in time. The traditional set of UNIX scripting tools such as AWK and SED (amongst MANY others) are a testament to the fact they weren't using C to do significant string manipulation.
Additional considerations: it was not uncommon in the early 70s (when C was written) to submit your programs on PUNCH CARDS and come back the next day to get them. Would it have eaten up extra processing time to compile a program with multiline strings literals? Not really. It can actually be less work for the compiler. But you were going to come back for it the next day anyhow in most cases. But nobody who was filling out a punch card was going to put large amounts of text that wasn't needed in their programs.
In a modern environment, there is probably no reason not to include multiline string literals other than designer's preference. Grammatically speaking, it's probably simpler because you don't have to take linefeeds into consideration when parsing the string literal.

In addition to the existing answers, you can work around this using C++11's raw string literals, e.g.:
#include <iostream>
#include <string>
int main() {
std::string str = R"(a
b)";
std::cout << str;
}
/* Output:
a
b
*/
Live demo.
[n3290: 2.14.5/4]: [ Note: A source-file new-line in a raw string
literal results in a new-line in the resulting execution
string-literal. Assuming no whitespace at the beginning of lines in
the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
Though non-normative, this note and the example that follows it in [n3290: 2.14.5/5] serve to complement the indication in the grammar that the production r-char-sequence may contain newlines (whereas the production s-char-sequence, used for normal string literals, may not).

Others have mentioned some excellent workarounds, I just wanted to address the reason.
The reason is simply that C was created at a time when processing was at a premium and compilers had to be simple and as fast as possible. These days, if C were to be updated (I'm looking at you, C1X), it's quite possible to do exactly what you want. It's unlikely, however. Mostly for historical reasons; such a change could require extensive rewrites of compilers, and so will likely be rejected.

The C preprocessor works on a line-by-line basis, but with lexical tokens. That means that the preprocessor understands that "foo" is a token. If C were to allow multi-line literals, however, the preprocessor would be in trouble. Consider:
"foo
#ifdef BAR
bar
#endif
baz"
The preprocessor isn't able to mess with the inside of a token - but it's operating line-by-line. So how is it supposed to handle this case? The easy solution is to simply forbid multiline strings entirely.

Actually, you can break it up thus:
string script =
"\n"
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated by the compiler.

Strings can lay on multiple lines, but each line has to be quoted individually :
string script =
" \n"
" Some \n"
" Formatted \n"
" String Literal ";

I am writing a programming language
(similar to C) and would like to let
write multi-line strings easily (like
in above example).
There is no reason why you couldn't create a programming language that allows multi-line strings.
For example, Vedit Macro Language (which is C-like scripting language for VEDIT text editor) allows multi-line strings, for example:
Reg_Set(1,"
Some
Formatted
String Literal
")
It is up to you how you define your language syntax.

You can also do:
string useMultiple = "this"
"is "
"a string in C.";
Place one literal after another without any special chars.

Literal declarations doesn't have to be single-line.
GPUImage inlines multiline shader code. Checkout its SHADER_STRING macro.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js