character 0: character set expected - regex

I want to define a table name by regular expression defined here such that:
Always begin a name with a letter, an underscore character (_), or a
backslash (). Use letters, numbers, periods, and underscore
characters for the rest of the name.
Exceptions: You can’t use "C", "c", "R", or "r" for the name, because
they’re already designated as a shortcut for selecting the column or
row for the active cell when you enter them in the Name or Go To box.
let lex_valid_characters_0 = ['a'-'z' 'A'-'Z' '_' '\x5C'] ['a'-'z' 'A'-'Z' '0'-'9' '.' '_']+
let haha = ['C' 'c' 'R' 'r']
let lex_table_name = lex_valid_characters_0 # haha
But it returns me an error character 0: character set expected.. Could anyone help?

Here is the description of # from the manual:
regexp1 # regexp2
(difference of character sets) Regular expressions regexp1 and regexp2 must be character sets defined with [… ] (or a single character expression or underscore _). Match the difference of the two specified character sets.
The description says the two sets must be character sets defined with [ ... ] but your definition of lex_valid_characters_0 is far more complex than that.
The idea of # is that it defines a pattern that matches exactly one character from a set specified as the difference of two one-character patterns. So it doesn't make sense to apply it to lex_valid_characters_0, which matches strings of arbitrary length.
Update
Here is my thinking on the problem, for what it's worth. There are no extra restrictions on names that are 2 or more characters long (as I read the spec). So it shouldn't be too difficult to specify a regular expression for these names. And it also wouldn't be that hard to come up with a regular expression that defines all the valid 1-character names. The full set of names is the union of these two sets.
You could also use the fact that the longest, first match is the one that applies for ocamllex. I.e., you could have rules for the 4 special cases before the general rule.

Related

Regex for string representation of a method call

I have a string that follows a specific pattern like so
operator(field,value)
and I'd like to use regex to extract out all three of operator, field and value. I'm struggling to come up with the syntax for how to capture these. In this case value can be alphanumeric as well, for example
"contains(name, Joe)"
or "lt(quantity, 2.5)"
Use something like this to capture groups, you may want to limit the characters accepted with [], note the use of ` and the use of \ escaping for () within the regexp:
func main() {
re := regexp.MustCompile(`(.+)\((.+),\s?(.+)\)`)
for _, t := range tests {
fmt.Println("result", re.FindStringSubmatch(t))
}
}
https://play.golang.org/p/43YLTafgQt
output:
result [contains(field, value) contains field value]
result [contains(name, Joe) contains name Joe]
result [lt(quantity, 2.5) lt quantity 2.5]
result [plus(no,44) plus no 44]
Depending on how strict you want to be you could use [a-z]+ or similar instead of .+ to match only certain characters but if you are not worried about bogus values this would probably be fine.
I don't know golang, but I do know regex's, so I'll do what I can here.
You probably want a group each for the "operator", "field", and "value". I'm going to assume for now that each of these can be represented as any combination of alphabetic, numeric, or underscore characters, with length of at least one character. In regex, we have a shortcut for that: \w represents a single alpha-numeric or underscore character, and the + modifier means "one or more". So \w+ means one or more such character in a row. If you want a more complex definition of what these fields can be named, I'll let you specify that in your question.
You say that you want to support "operator(field,value)". I'll start without whitespace anywhere, because it's simpler and you can easily remove all whitespace yourself before running the regex. We'll later add some whitespace support to the regex if you want it, but it'll make life difficult.
To do this, we want three groups, "1(2,3)" where 1 is the operator name, 2 is the field name, and 3 is the value name. Each of these, as given above, will be \w+ in our regex. We'll want to match the open and close parentheses as well as the comma, but we'll throw them away because they're really just delimiters. The parentheses will need to be escaped in the regex, since regex's have a special meaning for parentheses. The result looks like:
(\w+)\((\w+),(\w+)\)
\ 1 / \ 2 / \ 3 /
Where the second line shows you where the groups are each defined.
If you want to support some whitespace, you'll need to add \s* in all such locations. This gets hairy, but you can do it as such:
(\w+)\s*\(\s*(\w+)\s*,\s*(\w+)\s*\)
\ 1 / \ 2 / \ 3 /
You give an example of wanting to support floating point values, and I presume other kinds of values too. You can accomplish this using the "or" pipe, |. For example, group 3, instead of just being \w+, could be defined as
[a-zA-Z_]\w*|\d+\.?|\d*\.\d+
This string will support alphanumeric+underscore strings where the first character must be alphabetic or underscore, OR integers, OR floating point (defined as an integer string with a period at the beginning, middle, or end). Clearly, this can go on and on to support more complex string values, but you get the idea.
So the final regex might look like:
(\w+)\s*\(\s*(\w+)\s*,\s*([a-zA-Z_]\w+|\d+\.?|\d*\.\d+)\s*\)
Sorry for not giving any golang help, I hope someone else can edit my answer and fill in that major gap.

Issues with Regular Expressions

I understand the concept of repetition 0 or more times (*) and grouping '()' on there own, but I'm having trouble understanding them given practice examples.
For example, (yes)* contains both the empty set and the word 'yes', but not y or ss. I assume that it doesn't contain those words because of grouping, but would that mean the word 'yesyes' is also valid as the group has been repeated?
In contrast, I assume with the Regular Expression 'yes*', any character can be repeated. For example 'y', 'ye' 'es' 'yes', 'yy'. However the solutions we have been provided with state that the word 'y' isn't contained. I'm confused.
Your understanding of (yes)* is correct ...
(yes)* matches the string "yes" (exactly - no shorter, no longer) 0 or more times - ie the empty string or yes,yesyes, yesyesyesyesyesyes etc
But your understanding of yes* is NOT correct ...
yes* matches the string "ye" followed by 0 or more "s" characters - ie ye,yes,yess,yessssssss
The "zero or more" * modifier applies only to the character or group immediately preceding it.
In the first example, we have the group (yes)* - this will match '', 'yes', 'yesyes', etc.
In the second example, yes*, the modifier applies only to the letter s. It will match 'ye', 'yes', 'yess', etc.
If this is not clear then perhaps you can elaborate a little on the source of your confusion.

RegEx for company name variations

I have a requirement to accept string values ONLY where they meet the following criteria :
1) Can start with special character if required
2) Must start with capital letter ( Even if the first character is a special character )
3) The string value must not have 2 special characters in a row ( consecutive )
4) The string value must not have 2 spaces in a row ( consecutive )
5) Accented characters are allowed ( eg: Faddas )
6) Enclosed values at the start of the string or at the end are valid but must be inside parenthesis ( ie: (Ltd) )
7) Numerics are allowed anywhere in the string value
I have the following regex value : ^(\(([^)]+)\))?[\#\#\$\%\&\*\(\)\-\_\+\]\[\'\;\:\?\.\,\!]?\p{Lu}+[\s'-]?\p{L}+(?:[\s'-]\p{L}+)+(\(([^)]+)\))*$
This works ok for the following tested values :
Éast-Shipping-ltd
Éast-Shipping(LTD)
But fails the next example :
Éast-123Shipping(LTD)
Is there any way to allow for numerics mid string ?
I have tried [0-9] variations, [A-Za-z09] variations and p{N} variations but to no avail.
Many thanks for your time.
This is a REALLY nasty pattern, but I was able to simplify it a bit and do what you wanted:
^(\(([^)]+)\))?[[:punct:]]?\p{Lu}+(?:[\s'-]?[\p{L}\d]+)+(\(([^)]+)\))*$
There are lots of useful shorthand character classes, including [[:punct:]], which I used to replace your massive punctuation character class. To add the ability to include numbers, I put the \p{L} in a character class with the \d token, which will match any number (in any language, with the Unicode flag).
Demo on Regex101
Here we have some characters acceptable for company names
^[0-9A-Za-zÀ-ÿ\s,._+;()*~'##!?&-]+$

XSL - Remove non breaking space

In my XSL implementation (2.0), I tried using the below statement to remove all the spaces & non breaking spaces within a text node. It works for spaces only but not for non breaking spaces whose ASCII codes are,                            ​  etc. I am using SAXON processor for execution.
Current XSL code:
translate(normalize-space($text-nodes[1]), ' ' , '' ))
How can I have them removed. Please share your thoughts.
Those codes are Unicode, not ASCII (for the most part), so you should probably use the replace function with a regex containing the Unicode separator character class:
replace($text-nodes[1], '\p{Z}+', '')
In more detail:
The regex \p{Z}+ matches one or more characters that are in the "separator" category in Unicode. \p{} is the category escape sequence, which matches a single character in the category specified within the curly braces. Z specifies the "separator" category (which includes various kinds of whitespace). + means "match the preceding regex one or more times". The replace function returns a version of its first argument with all non-overlapping substrings matching its second argument replaced with its third argument. So this returns a version of $text-nodes[1] with all sequences of separator characters replaced with the empty string, i.e. removed.

A regular expression about folder name

I want a regular expression to test that a string meets the following rules:
It must not begin with ..
It must not end with ..
It must not include special characters like !##$%^&*, but can include ..
It must not include two dots . side by side.
Sample valid input:
na.me (single dot in middle)
Sample invalid input:
.name (begins with dot)
name. (ends with dot)
na..me (includes two dots side-by-side)
$name (special character not allowed in any position)
name# (likewise)
na#me (likewise)
I believe this should work:
^(\w+\.?)*\w+$
If not in ECMAScript, then replace the \w's with [a-zA-Z_0-9].
The approach here is that instead of citing what's NOT acceptable, it's easier to cite what's acceptable.
Translation of the expression is:
Start with one or many letters, followed by zero or one period (.)
All of which can occur zero or many times
End with at least one letter