Regular expression with $> symbol - regex

/\/\*[ \t]*\./ /import/i /[ \t\w\/\.\=\-;\[\]\$>"']+\*\/[ \t]*[\n\r]{1,2}/
In the above regular expression, I don't know the meaning of [ \t\w\/\.\=\-;\[\]\$>"']+
which type of data syntax its going to handle.
Can any one please explain me with an example data?

Your characters are inside a Character Class, which means..
[ \t\w\/\.\=\-;\[\]\$>"']+
Any character of:
' ', \t (tab)
word characters (a-z, A-Z, 0-9, _)
\/, \., \=, \-, ;, \[, \], \$, >, ", '
(1 or more times)
In a regular expression characters that are to be interpreted literally rather than as metacharacters can be escaped by preceding them with a backslash symbol (\). Therefore, If you want to use any of these characters as a literal in a regular expression, you need to escape them with a backslash.
For PCRE, and most other Perl-compatible flavors, escape these inside of character classes:
^]\-
And escape these outside of character classes:
^.*+?$|()[{\
Note: The hyphen does not necessarily need escaped if it's considered the first or last character of range inside of the character class.
So basically, this could be simplified to the following.
[ \t\w\/.=;[\]$>"'-]+

To escape a character means to not use its common role, but its special role (if it has one).
For example, the common role for the letter "w" is a simple character "w" inside or outside a character class.
If the character "w" is escaped by putting a \ character before it, \w will have a special role and means any "word" character (letters, digits and _ character) inside or outside a character class.
The common role for the character "]" is not the simple character "]", but it has the role of ending a character class.
If the character "]" is escaped by placing a \ before it, ] will have a special role and it will mean this time a simple "]" character inside or outside a character class.
Outside a character class some characters like "$", "*", "?", "+" have another role than their simple character values, so when you want to specify a plus sign symbol for example, you need to escape it using + because otherwise its common role will be to mean "the previous character one or more times".
Inside a character class however, some of the characters are always used as common characters, so they don't need to be escaped. So for example you don't need to use \= * + \? inside a character class, but only = * + ?.
Inside a character class you need to escape however some characters like "]" because otherwise it will mean the end of the character class.
You also need to escape the character "-" because otherwise it will not be treated as a simple dash, but it will create a range between previous and next characters.
The alternative is to always place the "-" character as the first or the last character in the character class, and in that case it doesn't need to be escaped.
It may look to be complicated, but actually it is not.
You need to think logicly. What happends if you don't escape the "+" character when it appears in a character class? Can it mean that the previous character may appear once or for more times? It wouldn't have any sense such a thing in a character class, so you don't need to escape it. The "=" character don't have any special role insider or outside a character class, so you don't need to escape it either.
The simple dot "." outside a character class means any character but not \n unless the /s modifier is used), but in a character class its common meaning is to act as a simple dot (.) so you don't need to escape it either.
These are not all details regarding the common and special meanings of all characters but I gave them only as examples to show what escaping means.

Related

How to capture a literal in antlr4?

I am looking to make a rule for a regex character class that is of the form:
character_range
: '[' literal '-' literal ']'
;
For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?
Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.
In the parser:
character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;
And in the lexer:
OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];
The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.
You can also define a range of unicode characters. Try something like this in your lexer rules:
fragment LETTER: [a-zA-Z];
fragment LETTER_UNICODE: [\u0080-\uFFFF];
UTF8CHAR: ( LETTER | LETTER_UNICODE );

Printing "\" character in C++

This question may be silly but would be great if i understand the behavior.
I try to print
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
using a simple program
char testme [] ="\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\0";
cout<<"testme:"<<testme<<endl;
The out put in this case is
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
I intend to print 64 "\" characters, instead the output is 32 "\" characters.
There seems to be some thing that i am missing since the out put is exactly half.
Edit: The reason why i was asking is becasue , i have to ^ "\" to another char for HMAC encryption and i see some weird things.
in C++11 you can do like this...
char testme [] =R"(\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\0)";
cout<<"testme:"<<testme<<endl;
The R"(...)" for Raw Character Strings...
To represent a backslash () in a string literal, we have to precede it with a backslash. To prevent errors (cos of too many backslash), C++ provides raw string literals...
This is called escaping and is a mechanism to insert certain characters into a string. For example, if you want to insert a citation mark into a string, you need to escape it.
char testme [] ="I am a so called \"programmer\".";
There's also \n, \t and other codes. However, this applies to \ itself, since you might want to be able to have a string that says \n without converting it into a newline character.
char testme [] ="This is a backslash followed by the letter n: \\n";
\\ is used to denote a single backslash: \. This is because \ is used in string literals to denote other symbols like \t for a tab, \n for a newline and \" for a quotation character.
So \\ gives you one backslash, \\\\ gives you two and so on.
To print a \ standard states that
C11; 6.4.4.4 Character constants:
The double-quote " and question-mark ? are representable either by themselves or by the
escape sequences \" and \?, respectively, but the single-quote ' and the backslash \
shall be represented, respectively, by the escape sequences \' and \\1.
That mean to print a \ you need an extra backslash \ . To print two \\ you need four backslash \\\\ and hence for 64 backslash you need 128 backslash.
1. Emphasis is mine.
\ is a special character known as Escape Character. For ex:: \n means newline character. So, if you want to print single \, you have to give \\. The first \ says the compiler to not treat the next \ as an escape character.
If it is C++, why not use string:
string testme(64, '\\');
cout << testme << endl;
The backslash \ is a very widespread escape character, and C++ also uses it like that. This means it's used to express special meaning (usually nonprintable characters). For example, to encode a line-feed character (ASCII 10) into a string, you express it as \n in the string literal. Another example, putting a single backslash at the end of a line (that is, before the line's terminating newline character) escapes the newline - so this way, you can continue a macro definition or //-style comment across several source file lines, and they will still count as one logical line.
This of course means that to get a literal backslash character, you have to escape the backslash itself, to get remove its "escape character" status. So typing \\ into a string literal yields a literal \ character.
That's why you get only half the amount of backslashes output - the C++ source code parser consumes two to produce one.
Didn't you notice one thing:
You printed 64 '\' but it printed only 32 of them.
Did you try 60, or 54, or some odd combi. say 33 ?
In C, '\' is escape character. You should have used '\n' for newline didn't you notice then, that '\' is not being printed.
To print '\' you must use '\\'.
A question for you:
Try printing 64 '%'. See what you get. Try understanding the reason for the output.

Regex Check Whether a string contains characters other than specified

How to check whether a string contains character other than:
Alphabets(Lowe-Case/Upper-Case)
digits
Space
Comma(,)
Period (.)
Bracket ( )
&
'
$
+(plus) minus(-) (*) (=) arithmetic operator
/
using regular expression in ColdFusion?
I want to make sure a string doesn't contain even single character other than the specified.
You can find if there are any invalid characters like this:
<cfif refind( "[^a-zA-Z0-9 ,.&'$()\-+*=/]" , Input ) >
<!--- invalid character found --->
</cfif>
Where the [...] is a character class (match any single char from within), and the ^ at the start means "NOT" - i.e. if it finds anything that is not an accepted char, it returns true.
I don't understand "Small Bracket(opening closing)", but guess you mean < and > there? If you want () or {} just swap them over. For [] you need to escape them as \[\]
Character Class Escaping
Inside a character class, only a handful of characters need escaping with a backslash, these are:
\ - if you want a literal backslash, escape it.
^ - a caret must be escaped if it's the first character, otherwise it negates the class.
- - a dash creates a range. It must be escaped unless first/last (but recommended always to be)
[ and ] - both brackets should be escaped.
ColdFusion uses Java's engine to parse regular expressions, anyway to make sure a string doesn't contain one of the characters you mentioned then try:
^(?![a-zA-Z0-9 ,.&$']*[^a-zA-Z0-9 ,.&$']).*$
The above expression would only work if you are parsing the file line by line. If you want to apply this to text which contains multiple lines then you need to use the global modifier and the multi-line modifier and change the expression a bit like this:
^(?![a-zA-Z0-9 ,.&$']*[^a-zA-Z0-9 ,.&$'\r\n]).*$
Regex101 Demo
The regular expression:
[^][a-zA-Z0-9 ,.&'$]
will match if the string contains any characters other than the ones in your list.

c++ regexp for not preceded by backslash and preceded by backslash

I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.
In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!
You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"

Vim regex not matching spaces in a character class

I'm using vim to do a search and replace with this command:
%s/lambda\s*{\([\n\s\S]\)*//gc
I'm trying to match for all word, endline and whitespace characters after a {. For instance, the entirety of this line should match:
lambda {
FactoryGirl.create ...
Instead, it only matches up to the newline and no spaces before FactoryGirl. I've tried manually replacing all the spaces before, just in case there were tab characters instead, but no dice. Can anyone explain why this doesn't work?
The \s is an atom for whitespace; \n, though it looks similar, syntactically is an escape sequence for a newline character. Inside the collection atom [...], you cannot include other atoms, only characters (including some special ones like \n. From :help /[]:
The following translations are accepted when the 'l' flag is not
included in 'cpoptions' {not in Vi}:
\e <Esc>
\t <Tab>
\r <CR> (NOT end-of-line!)
\b <BS>
\n line break, see above |/[\n]|
\d123 decimal number of character
\o40 octal number of character up to 0377
\x20 hexadecimal number of character up to 0xff
\u20AC hex. number of multibyte character up to 0xffff
\U1234 hex. number of multibyte character up to 0xffffffff
NOTE: The other backslash codes mentioned above do not work inside
[]!
So, either specify the whitespace characters literally [ \t\n...], use the corresponding character class expression [[:space:]...], or combine the atom with the collection via logical or \%(\s\|[...]\).
Vim interprets characters inside of the [ ... ] character classes differently. It's not literally, since that regex wouldn't fully match lambda {sss or lambda {\\\. What \s and \S are interpreted as...I still can't explain.
However, I was able to achieve nearly what I wanted with:
%s/lambda\s*{\([\n a-zA-z]\)*//gc
That ignores punctuation, which I wanted. This works, but is dangerous:
%s/lambda\s*{\([\n a-zA-z]\|.\)*//gc
Because adding on a character after the last character like } causes vim to hang while globbing. So my solution was to add the punctuation I needed into the character class.