What are the uses of symbols like '$ % &' in EDN? - clojure

I am new to EDN and going through EDN spec - https://github.com/edn-format/edn
What is the use of EDN symbols like '$ % &' and how can I make use of them while reading EDN in Clojure?

The spec mentions Symbols, but they don't mean it as what say on a keyboard is referred to as symbols.
Its a bit confusing, so let me re-frame it. On a keyboard for example, someone might say that there are symbols, and those would be #, #, $, %, ^, &, among others. Lets call these character symbols.
Now in EDN, you have Symbols, but it's not the referring to a character symbol. It refers to a data-type. What's even more confusing, is that it mentions that an EDN Symbol can contain a certain set of character symbol, but it is not a character symbol.
So what are EDN symbols? Here's some:
hello
abc
+
person/name
this$is#insane
Each of these is a valid EDN Symbol. It helps to contrast them to understand them. so here are a bunch of EDN Strings:
"hello"
"abc"
"+"
"person/name"
"this$is#insane"
And here's a bunch of EDN keywords:
:hello
:abc
:+
:person/name
:this$is#insane
So what distinguishes these? Well you see, EDN Symbols, Strings and Keywords are all just a set of characters, depending if it is a Symbol, String or Keyword, the allowed characters differ, and for example, that's why the EDN spec says that a Symbol can contain certain characters like $ and ?. But it does not mention all characters, for example: ^ is not allowed in a Symbol, but it is in a String:
hello^john ; Not a valid EDN symbol
"hello^john" ; A valid EDN string
What else, you can see that an EDN string must have the set of characters enclosed between open and closing double quotes "". On the other hand, a keyword must have the set of characters starting with a colon :. And a symbol doesn't need any marker, any continuous set of valid characters are a symbol, as long as they don't begin with : or are enclosed in double quotes.
Now the second thing to understand is... what are they for? This is more nebulous. They are for whatever you want to use them for when you model your data as EDN. You could use EDN strings instead, or EDN keyword instead, and vice versa. Anytime you have a set of characters that only contain allowed symbol characters you could choose to use a symbol to represent them in EDN.
In general, people use keywords for keys of maps or for tagging, such as saying that the type of animal is :monkey:
{:animal-type :monkey}
And in general, string is used to represent free-form text. Text entered by a user, or needing to be displayed to a user.
{:animal-type :monkey
:animal-name "Bruno the monkey"}
Finally, Symbols are normally used to refer to other objects within the language itself. Such as referring to a function, another piece of data, etc.
{:animal-type :monkey
:animal-name "Bruno the monkey"
:transform-fn animal/add-owner-info}

Related

Regex matching behavior with multi-character unicode symbol

I am having trouble understanding some behavior I observed with multi-character unicode symbols.
Take, as an example, the string πŸ€šπŸΎπŸ‡ΊπŸ‡ΈπŸ€šπŸΎπŸ‡ΊπŸ‡ΈπŸ€šπŸΎ, and the regex (🀚🏾|πŸ‡ΊπŸ‡Έ)(?![🏾]), I get three matches: Both flags, and the last hand. Expected: 5 matches, each symbol once.
Since both 🀚🏾 and πŸ‡ΊπŸ‡Έ are 2 character symbols, I tried writing a non-unicode example. With the string abcdabcdab and the regex (ab|cd)(?![b]), I get the expected 5 matches, each pair of ab and cd once.
Thinking that there might be some interaction between 🀚🏾 and 🏾, I used a different unicode character, giving me the regex (🀚🏾|πŸ‡ΊπŸ‡Έ)(?![πŸ‡Ή]). Here I get the same result that I got in the first example.
Since both πŸ‡Ή and 🏾 are usually not used individually, I tried using "normal" unicode or ASCII characters instead of πŸ‡Ή. I my example, I used 🐰 and a, which gave me the expected result of 5 matches, each symbol once.
Is someone able to explain this behavior, or is this a bug?
This behavior only happened in PCRE and the JavaScript regex engine, I used this site to test it. https://regex101.com/
You should not put a multibyte character inside a character class like in (?![🏾]). Inside the character class, it got "decomposed" into a sequence of two bytes, \uD83C and \uDFFE , matching either of them, not as a sequence. As the hand emoji is a sequence of \uD83E\uDD1A\uD83C\uDFFE (it ends with these two bytes), the lookahead got triggered and affected the matches.
To solve the problem, you just need to remove the brackets and use (🀚🏾|πŸ‡ΊπŸ‡Έ)(?!🏾) so that the 🏾 char could be treated as a byte sequence, not one or another char.

Is that because Clojure is limited by JVM so this code can't evaluate?

code include something like '(1+2) in Clojure will cause a java.lang.RuntimeException, which leaves a error message "Unmatched delimiter: )".
But in any other lisp dialect I've ever used like Emacs Lisp or Racket, '(1+2) will just return a list, which should act like this because with the special form quote, anything in the list should not be evaluate.
So I just wonder is that because of the limitation of JVM so these codes can't act like how they act in other dialects? Or is it a bug of Clojure? Or maybe there is something different between the definition of quote in Clojure and other lisp dialects?
These are artifacts of the way tokenizers are set in different languages. In Clojure, if a token starts with a digit, it is consumed until the next reader macro character (that includes parentheses among other things,) whitespace or end of file (whitespace includes comma.) And what's consumed must be a valid number, which includes integer, float and rational. So when you feed '(1+2) to the reader, it consumes 1+2 as one token, which then fails to match against integer, float or rational number patterns. After that, the reader tries to recover, which resets its state. In this state, a ) is unmatched.
Try to enter '(1 + 2) instead (mind the spaces around +,) you will see exactly what you expect.

Distinguish among "symbol-constituent characters", "symbol-constituents", and "word constituents"

The regexp part of Emacs manual seems confusing w.r.t. the above three concepts.
I list out my interpretations of the explanations below first:
"symbol-constituents" is mutually exclusive with "word constituents";
"symbol-constituent characters" includes both "symbol-constituents" and "word constituents"
Is this correct understanding?
And below are the relevant quotes from the manual:
-quote 1:
Word constituents: β€˜w’:
Parts of words in human languages. These are typically used in variable and command names in programs. All upper- and lower-case letters, and the digits, are typically word constituents.
-quote 2:
Symbol constituents: β€˜_’:
Extra characters used in variable and command names along with word constituents. Examples include the characters β€˜$&*+-<>’ in Lisp mode, which may be part of a symbol name even though they are not part of English words. In standard C, the only non-word-constituent character that is valid in symbols is underscore (β€˜β€™).
quote 1 and 2
-quote 3:
\_<:
matches the empty string, but only at the beginning of a symbol. A symbol is a sequence of one or more symbol-constituent characters. A symbol-constituent character is a character whose syntax is either β€˜w’ or β€˜_’. β€˜_<’ matches at the beginning of the buffer only if a symbol-constituent character follows.
quote 3
My understanding is that "symbol-constituent characters" should only be used to mean characters which are themselves symbol-constituents (and therefore, as you correctly understand, not word-constituent).
Your quote three is indeed confusing, but that wording has since been fixed. In my Emacs (from trunk, about three months ago) it reads:
`\_<'
matches the empty string, but only at the beginning of a symbol. A
symbol is a sequence of one or more word or symbol constituent
characters. `\_<' matches at the beginning of the buffer (or
string) only if a symbol-constituent character follows.
`\_>'
matches the empty string, but only at the end of a symbol. `\_>'
matches at the end of the buffer (or string) only if the contents
end with a symbol-constituent character.

What are the allowed characters in a Clojure keyword?

I am looking for a list of the allowed characters in a clojure keyword. Specifically I am interested to know if any of the following characters are allowed: - _ /.
I am not a java programmer, so I would not know the underlying ramifications if any. I don't know if the clojure keyword is mapped to a java keyword if there is such a thing.
Edit:
When I initially composed this answer, I was probably a little too heavily invested in the question of "what can you get away with?" In fairness to myself though, the keyword admissibility issue appears to be unsettled still. So:
First, a little about keywords, for new readers:
Keywords come in two flavours, qualified and unqualified. Unqualified keywords, like :foo, have no namespace component. Qualified keywords look like :foo/bar where the part prior to the slash is the namespace, ostensibly. Keywords can't be referred, and can be given a non-existent namespace, so their namespace behaviour is different from other Clojure objects.
Keywords can be created either by literals to the reader, like :foo, or by the keyword function, which is (keyword name-str) or (keyword ns name).
Keywords evaluate to themselves only, unlike symbols which point to vars. Note that keywords are not symbols.
What is officially permitted?
According to the reader documentation a single slash is permitted, a no periods in the name, and all rules to do with symbols.
What is actually permitted?
More or less anything but spaces seem to be permitted in the reader. For instance,
user> :-_./asdfgse/aser/se
:-_./asdfgse/aser/se
Appears to be legal. The namespace for the above keyword is:
user> (namespace :-_./asdfgse/aser/se)
"-_./asdfgse/aser"
So the namespace appears to consist of everything prior to the last forward slash.
The keyword function is even more permissive:
user> (keyword "////+" "/////")
:////+//////
user> (namespace (keyword "////+" "/////"))
"////+"
And similarly, spaces are fine too if you use the keyword function. I'm not sure exactly what limitations are placed on Unicode characters, but the REPL doesn't appear to complain when I put in arbitrary characters.
What's likely to happen in the future:
There have been some rumblings about validating keywords as they are interned. Supposedly one of the longest open clojure tickets is concerned with validation of keywords. So the keyword function may cease to be so permissive in the future, though that seems to be up in the air. See the assembla ticket and google group discussion.
The "correct" answer is documented:
Symbols begin with a non-numeric character and can contain alphanumeric characters and *, +, !, -, _, and ? (other characters will be allowed eventually, but not all macro characters have been determined). '/' has special meaning, it can be used once in the middle of a symbol to separate the namespace from the name, e.g. my-namespace/foo. '/' by itself names the division function. '.' has special meaning - it can be used one or more times in the middle of a symbol to designate a fully-qualified class name, e.g. java.util.BitSet, or in namespace names. Symbols beginning or ending with '.' are reserved by Clojure. Symbols containing / or . are said to be 'qualified'. Symbols beginning or ending with ':' are reserved by Clojure. A symbol can contain one or more non-repeating ':'s.
Edit: And further with respect to keywords:
Keywords are like symbols, except:
* They can and must begin with a colon, e.g. :fred.
* They cannot contain '.' or name classes.
* A keyword that begins with two colons is resolved in the current namespace
From that list, the reader certainly allows - and _, but / has a special meaning as the delimiter between namespaces and symbol names. Period (which you didn't ask about) is problematic inside symbol names as well, since it is used in fully-qualified Java class names.
As far as Clojure idiom goes, - is your best friend in symbol names. It takes the place of camel case in Java or the underscore in Ruby.
starting in 1.3 you can use ' anywhere not starting a keyword. so :arthur's-keyword is allowed now :)
I use the keywords :-P and :-D to spice up my code occasionally (as synonyms for true and false)

How to do string concatenation in gdb/ada

According to the manual, string concatenation isn't implemented in gdb. I need it however, so is there a way to achieve this, perhaps using array functions?
I don't have a copy of gdb around to try this on, but perhaps this line from later in the Ada section of the document will help you?
Rather than use catenation and
symbolic character names to introduce
special characters into strings, one
may instead use a special bracket
notation, which is also used to print
strings. A sequence of characters of
the form ["XX"]' within a string or
character literal denotes the (single)
character whose numeric encoding is XX
in hexadecimal. The sequence of
characters["""]' also denotes a
single quotation mark in strings. For
example, "One line.["0a"]Next
line.["0a"]"
contains an ASCII newline character
(Ada.Characters.Latin_1.LF) after each
period.
For Objective-C:
[#"asd" stringByAppendingString:#"zxc"]
[#"ID: " stringByAppendingString:(NSString*) [aTaskDict valueForKey:#"ID"]]