Distinguish among "symbol-constituent characters", "symbol-constituents", and "word constituents"

Distinguish among "symbol-constituent characters", "symbol-constituents", and "word constituents" - regex

The regexp part of Emacs manual seems confusing w.r.t. the above three concepts.
I list out my interpretations of the explanations below first:
"symbol-constituents" is mutually exclusive with "word constituents";
"symbol-constituent characters" includes both "symbol-constituents" and "word constituents"
Is this correct understanding?
And below are the relevant quotes from the manual:
-quote 1:
Word constituents: ‘w’:
Parts of words in human languages. These are typically used in variable and command names in programs. All upper- and lower-case letters, and the digits, are typically word constituents.
-quote 2:
Symbol constituents: ‘_’:
Extra characters used in variable and command names along with word constituents. Examples include the characters ‘$&*+-<>’ in Lisp mode, which may be part of a symbol name even though they are not part of English words. In standard C, the only non-word-constituent character that is valid in symbols is underscore (‘’).
quote 1 and 2
-quote 3:
\_<:
matches the empty string, but only at the beginning of a symbol. A symbol is a sequence of one or more symbol-constituent characters. A symbol-constituent character is a character whose syntax is either ‘w’ or ‘_’. ‘_<’ matches at the beginning of the buffer only if a symbol-constituent character follows.
quote 3

My understanding is that "symbol-constituent characters" should only be used to mean characters which are themselves symbol-constituents (and therefore, as you correctly understand, not word-constituent).
Your quote three is indeed confusing, but that wording has since been fixed. In my Emacs (from trunk, about three months ago) it reads:
`\_<'
matches the empty string, but only at the beginning of a symbol. A
symbol is a sequence of one or more word or symbol constituent
characters. `\_<' matches at the beginning of the buffer (or
string) only if a symbol-constituent character follows.
`\_>'
matches the empty string, but only at the end of a symbol. `\_>'
matches at the end of the buffer (or string) only if the contents
end with a symbol-constituent character.

Related

C++ regex for properly matching strings that contain c-style escape characters (ECMAScript style, no look-behind)

I'm a regex noob attempting to match either the contents or the entirety of a quoted segment of text without breaking on escaped quotation marks.
Put another way, I need a regex that, between two question marks, will match all characters that are not quotation marks and also any quotation mark that has an odd number of consecutive backslashes preceding it. It has to be an odd number of backslashes as a pair of backslashes escapes to a single backslash.
I've successfully created a regex that does this but it relied on look-behind and because this project is in C++ and because the regex implementation of standard C++ does not have look-behind functionality, I could not use said regex.
Here is the regex with look-behind that I came up with: "(((?<!\\)(\\\\)*\\"|[^"])*)"
The following text should produce 8 matches:
"Woah. Look. A tab."
"This \\\\\\\\\\\\\" is all one string"
"This \"\"\"\" is\" also\"\\ \' one\"\\\" string."
"These \\""are separate strings"
"The cat said,\"Yo.\""
"
\"Shouldn't it work on multiple lines?\" he asked rhetorically.
\"Of course it should.\"
"
"If you don't have exactly 8 matches, then you've failed."
Here's a picture of my (probably naive) look-behind version for the visual people among you (You know who you are):
And here's a link to this example: https://regex101.com/r/uOxqWl/1
If this is impossible to do without look-behind, please let me know.
Also, if there is a well-regarded C++ regex library that allows regex look-behind, please let me know (It doesn't have to be ECMAScript, though I would slightly prefer that).

Let's derive a garden variety regular expression for C-style strings from an English description.
A string is a quotation mark, followed by a sequence of string-characters, followed by another quotation mark.
std::regex stringMatcher ( R"("<string-character>*")" );
Obviously this doesn't work as we didn't define the string-character yet. We can do so piece by piece.
Firstly, a string character could be any character except a quotation mark and a backslash.
R"([^\\"])"
Secondly, a string character could be an escape sequence consisting of a backslash and a single other character from a fixed set.
R"(\\[abfnrtv'"\\?])"
Thirdly, it can be an octal escape sequence that consists of a backslash and three octal digits
R"(\\[0-7][0-7][0-7])"
(We simplify here a bit because the real C standard allows 1, 2 or 3 octal digits. This is easy to add.)
Fourthly, it can be a hexadecimal escape sequence that consists of a backslash, a letter x, and a hexadecimal number. The range of the number is implementation defined, so we need to accept any one.
R"(\\x[0-9a-fA-F][0-9a-fA-F]*)"
We omit universal character names, they could be added in an exactly the same way. There are none in the given test example.
So, to bring this all together:
std::regex stringMatcher ( R"("([^\\"]|\\([abfnrtv'"\\?]|[0-7][0-7][0-7]|x[0-9a-fA-F][0-9a-fA-F]*))*")" );
// collapsed the leading backslashes of all the escape sequence types together
Live demo.

RegEx: Non-repeating patterns?

I'm wrestling with how to write a specific regex, and thought I'd come here for a little guidance.
What I'm looking for is an expression that does the following:
Character length of 7 or more
Any single character is one of four patterns (uppercase letters, lowercase letters, numbers and a specific set of special characters. Let's say #$%#).
(Now, here's where I'm having problems):
Another single character would also match with one of the patterns described above EXCEPT for the pattern that was already matched. So, if the first pattern matched is an uppercase letter, the second character match should be a lowercase letter, number or special character from the pattern.
To give you an example, the string AAAAAA# would match, as would the string AAAAAAa. However, the string AAAAAAA, nor would the string AAAAAA& (as the ampersand was not part of the special character pattern).
Any ideas? Thanks!

If you only need two different kinds of characters, you can use the possessive quantifier feature (available in Objective C):
^(?:[a-z]++|[A-Z]++|[0-9]++|[#$%#]++)[a-zA-Z0-9#$%#]+$
or more concise with an atomic group:
^(?>[a-z]+|[A-Z]+|[0-9]+|[#$%#]+)[a-zA-Z0-9#$%#]+$
Since each branch of the alternation is a character class with a possessive quantifier, you can be sure that the first character matched by [a-zA-Z0-9#$%#]+ is from a different class.
About the string size, check it first separately with the appropriate function, if the size is too small, you will avoid the cost of a regex check.

First you need to do a negative lookahead to make sure the entire string doesn't consist of characters from a single group:
(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)
Then check that it does contain at least 7 characters from the list of legal characters (and nothing else):
^[a-zA-Z0-9#$%#]{7,}$
Combining them (thanks to Shlomo for pointing that out):
^(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)[a-zA-Z0-9#$%#]{7,}$

Why do escape characters in regex mismatch?

If I want to match the dot symbol (.) I have to write this regex:
/\./
Escape character is needed to match the symbol itself.
If I want to match the 'd' symbol I have to write this one:
/d/
Escape character is not needed to match the symbol itself.
And if I want to match any character (/./) or any digit character (/\d/) it's vice versa.
It seems to me that this approach is not very consistent. What is the reasoning that stands behind it?
Thank you.

The . character is a reserved regular expression keyword. The d isn't. You need to include the escape character when you match a period to explicitly tell regex that you want to use the period as a normal matching character. d by itself isn't a reserved word, so you don't need to escape it, but \d is a reserved word.
I can see how, to someone coming to regex it can be a little odd, but the . is used so often, and I can't think of a time I've really needed to match periods it just makes more sense to have it be one character without the backslash.

How to include special chars in this regex

First of all I am a total noob to regular expressions, so this may be optimized further, and if so, please tell me what to do. Anyway, after reading several articles about regex, I wrote a little regex for my password matching needs:
(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(^[A-Z]+[a-z0-9]).{8,20}
What I am trying to do is: it must start with an uppercase letter, must contain a lowercase letter, must contain at least one number must contain at least on special character and must be between 8-20 characters in length.
The above somehow works but it doesn't force special chars(. seems to match any character but I don't know how to use it with the positive lookahead) and the min length seems to be 10 instead of 8. what am I doing wrong?
PS: I am using http://gskinner.com/RegExr/ to test this.

Let's strip away the assertions and just look at your base pattern alone:
(^[A-Z]+[a-z0-9]).{8,20}
This will match one or more uppercase Latin letters, followed by by a single lowercase Latin letter or decimal digit, followed by 8 to 20 of any character. So yes, at minimum this will require 10 characters, but there's no maximum number of characters it will match (e.g. it will allow 100 uppercase letters at the start of the string). Furthermore, since there's no end anchor ($), this pattern would allow any trailing characters after the matched substring.
I'd recommend a pattern like this:
^(?=.*[a-z])(?=.*[0-9])(?=.*[!##$])[A-Z]+[A-Za-z0-9!##$]{7,19}$
Where !##$ is a placeholder for whatever special characters you want to allow. Don't forget to escape special characters if necessary (\, ], ^ at the beginning of the character class, and- in the middle).
Using POSIX character classes, it might look like this:
^(?=.*[:lower:])(?=.*[:digit:])(?=.*[:punct:])[:upper:]+[[:alnum:][:punct:]]{7,19}$
Or using Unicode character classes, it might look like this:
^(?=.*[\p{Ll}])(?=.*\d)(?=.*[\p{P}\p{S}])[\p{Lu}]+[\p{L}\d\p{P}\p{S}]{7,19}$
Note: each of these considers a different set of 'special characters', so they aren't identical to the first pattern.

The following should work:
^(?=.*[a-z])(?=.*[0-9])(?=.*[^a-zA-Z0-9])[A-Z].{7,19}$
I removed the (?=.*[A-Z]) because the requirement that you must start with an uppercase character already covers that. I added (?=.*[^a-zA-Z0-9]) for the special characters, this will only match if there is at least one character that is not a letter or a digit. I also tweaked the length checking a little bit, the first step here was to remove the + after the [A-Z] so that we know exactly one character has been matched so far, and then changing the .{8,20} to .{7,19} (we can only match between 7 and 19 more characters if we already matched 1).

Well, here is how I would write it, if I had such requirements - excepting situations where it's absolutely not possible or practical, I prefer to break up complex regular expressions. Note that this is English-specific, so a Unicode or POSIX character class (where supported) may make more sense:
/^[A-Z]/ && /[a-z]/ && /[1-9]/ && /[whatever special]/ && ofCorrectLength(x)
That is, I would avoid trying to incorporate all the rules at once.

How do dashes work in regex?

I'm curious on the algorithm for deciding which characters to include, in a regex when using a -...
Example: [a-zA-Z0-9]
This matches any character of any case, a through z, and numbers 0 through 9.
I had originally thought that they were used sort of like macros, for example, a-z translates to a,b,c,d,e etc.. but after I saw the following in an open source project,
text.tr('A-Za-z1-90', 'Ⓐ-Ⓩⓐ-ⓩ①-⑨⓪')
my paradigm on regex's has changed entirely, because these are characters that are not your typical characters, so how the heck did this work correctly, i thought to myself.
My theory is that the - literally means
Any ASCII value between the left character, and the right character. (e.g. a-z [97-122])
Could anybody confirm if my theory is correct? Does the regex pattern in-fact calculate using the character codes, between any character?
Furthermore, if it IS correct, could you perform a regex match like,
A-z
because A is 65, and z is 122 so theoretically, it should also match all characters between those values.

From MSDN - Character Classes in Regular Expressions (bold is mine):
The syntax for specifying a range of characters is as follows:
[firstCharacter-lastCharacter]
where firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.
So your assumption is correct, but the effect is, in fact, wider: Unicode character codes, not just ASCII.

Both of your assumptions are correct. (therefore, technically you could do [#-~] and it would still be valid, capturing uppercase letters, lowercase letters, numbers, and certain symbols.)
ASCII Table
You can also do this with Unicode, like [\u0000-\u1000].
You should not do [A-z], however, because there are some characters between the uppercase and lowercase letters (specifically [, \, ], ^, _, `).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Distinguish among "symbol-constituent characters", "symbol-constituents", and "word constituents" - regex

Related

C++ regex for properly matching strings that contain c-style escape characters (ECMAScript style, no look-behind)

RegEx: Non-repeating patterns?

Why do escape characters in regex mismatch?

How to include special chars in this regex

How do dashes work in regex?

Categories

Resources