What does the regex "/\\*{2,}/" mean? [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I'm kinda new to regex, and specifically, I don't understand there are 2 backslashes? I mean, I know the second one is to escape the character "*", but what does the first backslash do?
Well I'm passing this regex expression to the php function preg_match(), and I'm trying to find strings that include 2 or more consecutive "*".

That regex is invalid syntax.
You have this piece:
*{2,}
Which basically would read: match n-times, 2 or more times.
The following regex:
/\\*.{2,}/
Is the simplest and closest regex to the one you have, which would read as:
match 0 or more '\' and 2 or more characters that aren't newlines
If you are talking about the string itself, is may be interpreted as 2 things:
/\\*{2,}/
Read as: match a single \ and another \ n-times 2 times or more
This is invalid syntax
/\*{2,}\
Read as match 2 or more *
This is valid syntax
It all varies, depending on the escape character.
Edit:
Since the question was updated to show which language and engine it is being used, I've updated to add the following information:
You have to pass the regex as '/\*{2,}/' OR as "/\\*{2,}/" (watch the quotes).
Both are very similar, except that single quotes ('') only support the following escape sequences:
\' - Produces '
\\- Produces \
Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:
\" - Produces "
\' - Produces '
\\ - Produces \
\x<2-digit hex number> - Same as chr(0x<2-digit hex number>)
\0 - Produces a null char
\1 - Produces a control char (same as chr(1))
\u<4-digit hex number> - Produces an UTF-8 character
\r - Produces a newline on old OSX
\n - Produces a newline on Linux/newer OSX/Windows (when writting a file without b)
\t - Produces a tab
\<number> or \0<number> - Same as \x, but the numbers are in octal (e.g.: "\75" and "\075" produce =)
... (some more that I probably forgot) ...
\<anything> - Produces <anything>
Read more about this on https://php.net/manual/en/language.types.string.php

Depending on the platrofm you're using, "/\\*{2,}/" may actually be a representation of a /\*{2,'}/ string - this is because languages like Java treat \ as an escape character, so to actually put that character within regex, you need to escape the character in regex string.
So, we have /\*{2'}/ regex. \*' matches the star character, and{2,}` means at least two times. Your regex will match any two or more consecutive star characters.

Is it a string literal written in a program and if so which one? The double backslash may be to escape the escape char so that this regex matches at least 2 * star characters.
In JavaScript for example you need to escape the \ so that your string literal can express it as data before you transform it into a regular expression when using the RegExp constructor. Why do regex constructors need to be double escaped?

For PHP what you have with that regex is to repeat literally a * 2 or more times. You can easily see with with below diagram:
But when you have to code it in PHP you have to escape the backslash (with a backslash) to use it in string. For instance:
$re = "/\\*{2,}/";
$str = "...";
preg_match($re, $str, $matches);

Related

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?
Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo
You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

Why do I need to write \\d instead of \d in a C++ regex?

I'mm starting to learn about Regular Expressions and I have written code in c++
my task is : Implement a function that replaces each digit in the given string with a '#' character.
For my example, the inputstring = "12 points".
I know I need to use \d for matches a digit. I tried to use this : std::regex_replace(input,std::regex("\d"),"#");
but it is not working: the output is still "12 points";
Then I searched the internet and the result is:
std::regex_replace(input,std::regex("\\d"),"#");
with the output is "## points".
Can anyone help me to understand what is "\\d" ?
\d means decimal, however, in the regular expression, the \ is a special character, which needs to be escaped on its own as well, hence in \\d you escape the \ to mark it to be used as a regular character instead of its special meaning.
When you use "\d" in a C++ application, the \ is an escape character in C++. So it doesn't treat the following d as a d.
Regex then gets a string that doesn't have \d in it, but most likely an empty string (since \d doesn't evaluate to anything in C++ to my knowledge).
When you use "\d" you are escaping the . So C++ reads the string as "\d" as you intended.
An example of when you'd use an escape character, is when you want to output a quote. "\"" would output a single double quote.

Flex regular expression String [duplicate]

This question already has answers here:
Regular expression for a string literal in flex/lex
(6 answers)
Closed 7 years ago.
I've got a regular expression that matches strings opening with " and closing with " and can contain \".
The regular expression is this \"".*[^\\]"\".
I don't understand what's the " that is followed after \" and after the [^\\].
Also this regular expression works when I have a \n inside a string but the . rule on flex doesn't match a \n.
I just tested for example the string "aaaaa\naaa\naaaa".
It matched it with no problem.
I made a regex for flex that matches what I need. It's this one \"(([^\\\"])|([\\\"]))*\". I understand how this works though.
Also I just tested my solutions against an "" an empty string. It doesn't work. Also the answers from all those that answered have been tested and don't work as well.
The pattern is a little naive and even indeed false. It doesn't handle correctly escaped quotes because it assumes that the closing quote is the first one that is not preceded by a backslash. This is a false assumption.
The closing quote can be preceded by a literal backslash (a backslash that is escaped with an other backslash, so the second backslash is no longer escaping the quote), example: "abcde\\" (so the content of this string is abcde\)
This is the pattern to deal with all cases:
\"[^"\\]*(?s:\\.[^"\\]*)*\"
or perhaps (I don't know exactly where you need to escape literal quotes in a flex pattern):
\"[^\"\\]*(?s:\\.[^\"\\]*)*\"
Note that the s modifier allows the dot to match newlines inside the non capturing group.
I just figured out everything :P
This \"".*[^\\]"\" works because in flex it means: I want to match something that starts with " and ends with ". Inside these quotes there will be another matching pattern(that's why there are the unexplained ", as I was pondering their existence in my question) that can be any set of any characters, but CANNOT end with \.
What confused me more was the use of ., cause in flex it means that it will match any character except a new line \n. So I was mistakenly thinking that it won't match a string such as "aaa\naaa".
But the reality is it will match it, because when flex reads it will read first \ and then n.
The TRUE newline would be, something like this:
"something
like
this"
But compilers in -ansi C for example(haven't tested it on other versions other than ansi) do not let you declare a string using in different lines.
I hope my answer is clear enough. Cheers.
Your pattern does not match "hello" but it matches ""hello"".
if you want to match anything that is in quotes and may contain \" try something like:
/(\"[\na-zA-Z\\"]*\")/gs

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!
I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.
The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.
Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

How is \\n and \\\n interpreted by the expanded regular expression?

Within an ERE, a backslash character (\, \a, \b, \f, \n,
\r, \t, \v) is considered to begin an escape sequence.
Then I see \\n and [\\\n], I can guess though both \\n and [\\\n] here means \ followed by new line, but I'm confused by the exact process to interpret such sequence as how many \s are required at all?
UPDATE
I don't have problem understanding regex in programing languages so please make the context within the lexer.
[root# ]# echo "test\
> hi"
This is dependent on the programming language and on its string handling options.
For example, in Java strings, if you need a literal backslash in a string, you need to double it. So the regex \n must be written as "\\n". If you plan to match a backslash using a regex, then you need to escape it twice - once for Java's string handler, and once for the regex engine. So, to match \, the regex is \\, and the corresponding Java string is "\\\\".
Many programming languages have special "verbatim" or "raw" strings where you don't need to escape backslashes. So the regex \n can be written as a normal Python string as "\\n" or as a Python raw string as r"\n". The Python string "\n" is the actual newline character.
This can becoming confusing, because sometimes not escaping the backslash happens to work. For example the Python string "\d\n" happens to work as a regex that's intended to match a digit, followed by a newline. This is because \d isn't a recognized character escape sequence in Python strings, so it's kept as a literal \d and fed that way to the regex engine. The \n is translated to an actual newline, but that happens to match the newline in the string that the regex is tested against.
However, if you forget to escape a backslash where the resulting sequence is a valid character escape sequence, bad things happen. For example, the regex \bfoo\b matches an entire word foo (but it doesn't match the foo in foobar). If you write the regex string as "\bfoo\b", the \bs are translated into backspace characters by the string processor, so the regex engine is told to match <backspace>foo<backspace> which obviously will fail.
Solution: Always use verbatim strings where you have them (e. g. Python's r"...", .NET's #"...") or use regex literals where you have them (e. g. JavaScript's and Ruby's /.../). Or use RegexBuddy to automatically translate the regex for you into your language's special format.
To get back to your examples:
\\n as a regex means "Match a backslash, followed by n"
[\\\n] as a regex means "Match either a backslash or a newline character".
Actually regex string specified by string literal is processed by two compilers: programming language compiler and regexp compiler:
Original Compiled Regex compiled
"\n" NL NL
"\\n" '\'+'n' NL
"\\\n" '\'+NL NL
"\\\\n" '\'+'\'+'n' '\'+'n'
So you must use the shortest format "\n".
Code examples:
JavaScript:
'a\nb'.replace(RegExp("\n"),'<br>')
'a\nb'.replace(RegExp("\\n"),'<br>')
'a\nb'.replace(RegExp("\\\n"),'<br>')
but not:
'a\nb'.replace(/\\\n/,'<br>')
Java:
System.out.println("a\nb".replaceAll("\n","<br>"));
System.out.println("a\nb".replaceAll("\\n","<br>"));
System.out.println("a\nb".replaceAll("\\\n","<br>"));
Python:
str.join('<br>',regex.split('\n','a\nb'))
str.join('<br>',regex.split('\\n','a\nb'))
str.join('<br>',regex.split('\\\n','a\nb'))