Parsing regex with escaped pipe delimiter [duplicate] - regex

This question already has answers here:
regular expression to match pipe separated strings with pipe escaping
(4 answers)
Closed 3 years ago.
Im trying to parse
|123|create|item|1497359166334|Sport|Some League|\|Team\| vs \|Team\||1497359216693|
With regex (https://regex101.com/r/KLzIOa/1/)
I currently have
[^|]++
Which is parsing everything correctly except \|Team\| vs \|Team\|
I would expect this to be parsed as |Team| vs |Team|
If i change the regex to
[^\\|]++
It parses the Teams separately instead of together with the escaped pipe
Basically i want to parse the fields between the pipes however, if there are any escaped pipes i would like to capture them. So with my example i would expect
["123", "create", "item", "1497359166334", "Sport", "Some League", "|Team| vs |Team|", "1497359216693"]

You can alternate between:
\\. - A literal backslash followed by anything, or
[^|\\]+ - Anything but a pipe or backslash
(?:\\.|[^|\\]+)+
https://regex101.com/r/KLzIOa/2
Note that there's no need for the possessive quantifier, because no backtracking will occur.
If you also want to replace \|s with |s, then do that afterwards: match \\\| and replace with |.

To handle escaping, you should match a backslash and the character after it as a single "item".
(?:\\.|[^|])++
This conveniently also works for escaping the backslashes themselves!
To then remove the backslashes from the results, use a simple replacement:
Replace: \\(.)
With: $1

Use:
(?:\\\||[^|])+
Demo & explanation

Related

Regex divide string by commas ignoring function syntax [duplicate]

This question already has answers here:
Split string delimited by comma without respect to commas in brackets
(3 answers)
Closed 4 years ago.
I need a regex that substitutes a string by looking at their commas.
For example the string:
str1 = "a,b,12,func(a,b),8,bob,func(1,2))"
should be transformed as following:
str1_transformed = "a;b;12;func(a,b);8;bob;func(1,2))"
I cannot substitute every "," with a ";" because it will look like:
str1_wrong = "a;b;12;func(a;b);8;bob;func(1;2))"
How can I deal with it?
I looked at the following threads without success:
How can I Split(',') a string while ignore commas in between quotes?
Regular Expression for Comma Based Splitting Ignoring Commas inside Quotes
If you know that you won't have unbalanced or escaped brackets below regex works well:
,(?![^()]*\))
Breakdown:
, Match a comma
(?! Start of negative lookahead
[^()]*\) That means, recent matched comma shouldn't follow a closing bracket without matching opening bracket
) End of lookahead
C# code:
Regex regex = new Regex(#",(?![^()]*\))");
string result = regex.Replace(#"a,b,12,func(a,b),8,bob,func(1,2))", #";");

\w doesn't work in vim search replace but a-zA-Z does? [duplicate]

This question already has answers here:
Vim regex with metacharacters inside bracket
(3 answers)
Closed 4 years ago.
tldr
[a-zA-Z\.-] works in Vim regex search replace, but [\w\.-] does not.
The text I'm searching:
1 string.here blah blah
24 another-string.here blah.
1523 another-string.goes.here. blah123
Desired output
string.here
another-string.here
another-string.goes.here
My Question
Why does this work:
:%s/\v^\d+\s+([a-zA-z\.-]+)\s+.*/\1/g
But this does not:
:%s/\v^\d+\s+([\w\.-]+)\s+.*/\1/g
E486: Pattern not found :%s/\v^\d+\s+([\w\.-]+)\s+.*/\1/g
The only difference between the two is a-zA-Z vs \w inside square brackets. But doesn't \w equal a-zA-Z (plus some other non-whitespace characters not in this example text)?
I'm using default vim. Unmodified. Whatever comes with Ubuntu.
Non-vim platforms
When I try with the atom text editor instead of vim, both expressions work.
Search: ^\d+\s+([a-zA-z\.-]+)\s+.*
Replace: $1
When I try with RegExr both expressions work. (Although I have to add the multiline tag)
Other things I've tried
My understanding is that \v is necessary for avoiding escaping hell. I've tried without it:
:%s/^\d\+\s\+\([a-zA-Z\.-]\+\)\s\+.*/\1/g
works
:%s/^\d\+\s\+\([\w\.-]\+\)\s\+.*/\1/g
does not work. ("Pattern not found")
I've also tried adding the m flag (so the end is /gm) but that didn't work
E488: Trailing characters
I've also tried without the ^.
:%s/\d\+\s\+\([\w\.-]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\([\w\.-]\+\)\s\+.*
I've also tried using \\w instead of \w.
:%s/\d\+\s\+\([\\w\.-]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\([\\w\.-]\+\)\s\+.*
I've also tried using \[ \] instead of [ ].
:%s/\d\+\s\+\(\[\\w\.-\]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\(\[\\w\.-\]\+\)\s\+.*
[a-zA-Z\.-] works in Vim regex search replace, but [\w\.-] does not.
[a-zA-Z\.-] is a collection of characters containing:
every character from a to z,
every character from A to Z,
the character .,
and the character -.
:help /collection is regrettably not explicit about this but character classes like \w are interpreted as "escaped w", and thus "plain w", so [\w\.-] is really just [w\.-] which is not what you want:
the character w,
the character .,
and the character -.

Prettier auto "correct" regex escaping forward slash `\` [duplicate]

This question already has answers here:
Why do regex constructors need to be double escaped?
(5 answers)
Closed 5 years ago.
pattern: '^131\.[0-9]{6}$',
prettier change it to pattern: '^131.[0-9]{6}$',. Is there a way to ignore line, or ignore file?
Assuming JavaScript (as you're using prettier.) The '^131\.[0-9]{6}$' is just a string, not a regex. Prettier removes unnecessary escape characters when reformatting. As \. isn't a meaningful escape, it's the same as just having . on its own in string context.
Your aim is to get \. into a regex, which I assume you're going to create using the new RegExp() constructor; in that case you want to escape the backslash:
pattern: '^131\\.[0-9]{6}$'

What does the regex "/\\*{2,}/" mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I'm kinda new to regex, and specifically, I don't understand there are 2 backslashes? I mean, I know the second one is to escape the character "*", but what does the first backslash do?
Well I'm passing this regex expression to the php function preg_match(), and I'm trying to find strings that include 2 or more consecutive "*".
That regex is invalid syntax.
You have this piece:
*{2,}
Which basically would read: match n-times, 2 or more times.
The following regex:
/\\*.{2,}/
Is the simplest and closest regex to the one you have, which would read as:
match 0 or more '\' and 2 or more characters that aren't newlines
If you are talking about the string itself, is may be interpreted as 2 things:
/\\*{2,}/
Read as: match a single \ and another \ n-times 2 times or more
This is invalid syntax
/\*{2,}\
Read as match 2 or more *
This is valid syntax
It all varies, depending on the escape character.
Edit:
Since the question was updated to show which language and engine it is being used, I've updated to add the following information:
You have to pass the regex as '/\*{2,}/' OR as "/\\*{2,}/" (watch the quotes).
Both are very similar, except that single quotes ('') only support the following escape sequences:
\' - Produces '
\\- Produces \
Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:
\" - Produces "
\' - Produces '
\\ - Produces \
\x<2-digit hex number> - Same as chr(0x<2-digit hex number>)
\0 - Produces a null char
\1 - Produces a control char (same as chr(1))
\u<4-digit hex number> - Produces an UTF-8 character
\r - Produces a newline on old OSX
\n - Produces a newline on Linux/newer OSX/Windows (when writting a file without b)
\t - Produces a tab
\<number> or \0<number> - Same as \x, but the numbers are in octal (e.g.: "\75" and "\075" produce =)
... (some more that I probably forgot) ...
\<anything> - Produces <anything>
Read more about this on https://php.net/manual/en/language.types.string.php
Depending on the platrofm you're using, "/\\*{2,}/" may actually be a representation of a /\*{2,'}/ string - this is because languages like Java treat \ as an escape character, so to actually put that character within regex, you need to escape the character in regex string.
So, we have /\*{2'}/ regex. \*' matches the star character, and{2,}` means at least two times. Your regex will match any two or more consecutive star characters.
Is it a string literal written in a program and if so which one? The double backslash may be to escape the escape char so that this regex matches at least 2 * star characters.
In JavaScript for example you need to escape the \ so that your string literal can express it as data before you transform it into a regular expression when using the RegExp constructor. Why do regex constructors need to be double escaped?
For PHP what you have with that regex is to repeat literally a * 2 or more times. You can easily see with with below diagram:
But when you have to code it in PHP you have to escape the backslash (with a backslash) to use it in string. For instance:
$re = "/\\*{2,}/";
$str = "...";
preg_match($re, $str, $matches);

Flex regular expression String [duplicate]

This question already has answers here:
Regular expression for a string literal in flex/lex
(6 answers)
Closed 7 years ago.
I've got a regular expression that matches strings opening with " and closing with " and can contain \".
The regular expression is this \"".*[^\\]"\".
I don't understand what's the " that is followed after \" and after the [^\\].
Also this regular expression works when I have a \n inside a string but the . rule on flex doesn't match a \n.
I just tested for example the string "aaaaa\naaa\naaaa".
It matched it with no problem.
I made a regex for flex that matches what I need. It's this one \"(([^\\\"])|([\\\"]))*\". I understand how this works though.
Also I just tested my solutions against an "" an empty string. It doesn't work. Also the answers from all those that answered have been tested and don't work as well.
The pattern is a little naive and even indeed false. It doesn't handle correctly escaped quotes because it assumes that the closing quote is the first one that is not preceded by a backslash. This is a false assumption.
The closing quote can be preceded by a literal backslash (a backslash that is escaped with an other backslash, so the second backslash is no longer escaping the quote), example: "abcde\\" (so the content of this string is abcde\)
This is the pattern to deal with all cases:
\"[^"\\]*(?s:\\.[^"\\]*)*\"
or perhaps (I don't know exactly where you need to escape literal quotes in a flex pattern):
\"[^\"\\]*(?s:\\.[^\"\\]*)*\"
Note that the s modifier allows the dot to match newlines inside the non capturing group.
I just figured out everything :P
This \"".*[^\\]"\" works because in flex it means: I want to match something that starts with " and ends with ". Inside these quotes there will be another matching pattern(that's why there are the unexplained ", as I was pondering their existence in my question) that can be any set of any characters, but CANNOT end with \.
What confused me more was the use of ., cause in flex it means that it will match any character except a new line \n. So I was mistakenly thinking that it won't match a string such as "aaa\naaa".
But the reality is it will match it, because when flex reads it will read first \ and then n.
The TRUE newline would be, something like this:
"something
like
this"
But compilers in -ansi C for example(haven't tested it on other versions other than ansi) do not let you declare a string using in different lines.
I hope my answer is clear enough. Cheers.
Your pattern does not match "hello" but it matches ""hello"".
if you want to match anything that is in quotes and may contain \" try something like:
/(\"[\na-zA-Z\\"]*\")/gs