I have the following regular expression:
"\[(\d+)\].?\s+(\S+)\s+(\/+?)\\r\n"
I am pretty new to regex. I have this regexp and a string that I am trying to see if it matches or not. I believe it should match it but my program says it doesn't, and an online analyser says they do not match. I am pretty sure I am missing something small. Here is my string:
[1]+ Stopped sleep 60
However, when using this online tool to check for a match (and my program is saying they're not equal), why does the following expression not match the above regexp? Any ideas?
you appear to have escaped the \ prior to the \r resulting in it searching for the letter r
RegExp interpretation and allowed characters vary slightly with implementation, so you should give your execution context, but this is probably generic enough.
Decomposing your regexp gives
\[ - an open bracket character.
(\d+) - one or more digits; save this as capture group 1 ($1).
\] - a close bracket character.
.? - 0 or 1 character, of any kind
\s+ - 1 or more spaces.
(\S+) - 1 or more non-space characters; save this as $2
\s+ - 1 or more spaces
(\/+?) - 1 or more forward-slash characters, optional as $3
(not sure about this, this is an odd construct)
\\r\n" - an (incorrectly specified) end of line sequence, I think.
First of all, if you want to match the end of a line, use $, not \r\n. That should match the end of a line in most contexts. ^ matches the beginning of a line.
Second, I can't tell from your regexp what you are trying to capture after the "Stopped" word, so I'm going to assume you want the rest as one block, including internal spaces. A reg-exp basically the same as yours will do it.
"\[(\d+)\].?\s+(\S+)\s+(.+)\s*$"
This captures
$1 = 1,
$2 = Stopped
$3 = sleep 60
This is basically the same as yours except for the end, which grabs everything after "stopped" up to the end of the line as a single capture group, $3, except for leading and trailing blanks. If you want to do additional parsing, replace the (.+) as appropriate. Note that there must be at least 1 non-blank character after "stopped " for this to match. If you want it to match even if there is no string $3, use \s*(.*)\s*$ instead of \s+(.+)\s*$
Try to use this pattern:
\[\d+\]\+\s*\w+\s*\w+\s*\d+
Related
The regex s/\A\s*\n// removes every all-whitespace line from the beginning of a string.
It leaves everything else alone, including any whitespace that might begin the first visible line.
By "visible line," I mean a line that satisfies /\S/.
The code below demonstrates this.
But how does it work?
\A anchors the start of the string
\s* greedily grabs all whitespace. But without the (?s) modifier, it should stop at the end of the first line, should it not?
See
https://perldoc.perl.org/perlre.
Suppose that without the (?s) modifier it nevertheless "treats the string as a single line".
Then I would expect the greedy \s* to grab every whitespace character it sees,
including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Nevertheless, the code does exactly what I want. Since I can't explain it, it's like a kludge, something that happens to work, discovered through trial and error. What is the reason it works?
#!/usr/bin/env perl
use strict; use warnings;
print $^V; print "\n";
my #strs=(
join('',"\n", "\t", ' ', "\n", "\t", ' dogs',),
join('',
"\n",
"\n\t\t\x20",
"\n\t\t\x20",
'......so what?',
"\n\t\t\x20",
),
);
my $count=0;
for my $onestring(#strs)
{
$count++;
print "\n$count ------------------------------------------\n";
print "|$onestring|\n";
(my $try1=$onestring)=~s/\A\s*\n//;
print "|$try1|\n";
}
But how does it work?
...
I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Correct -- the \s* at first grabs everything up to the d (in dogs) and with that the match would fail ... so it backs up, a character at a time, shortening that greedy grab so to give a chance to the following pattern, here \n, to match.
And that works! So \s* matches up to (the last!) \n, that one is matched by the following \n in the pattern, and all is well. That's removed and we stay with "\tdogs" which is printed.
This is called backtracking. See about it also in perlretut. Backtracking can be suppressed, most notably by possesive forms (like \w++ etc), or rather by extended construct (?>...).
But without the (?s) modifier, it should stop at the end of the first line, should it not?
Here you may be confusing \s with ., which indeed does not match \n (without /s)
There are two questions here.
The first is about the interaction of \s and (lack of) (?s). Quite simply, there is no interaction.
\s matches whitespaces characters, which includes Line Feed (LF). It's not affected by (?s) whatsoever.
(?s) exclusively affects ..
(?-s) causes . to match all characters except LF. [Default]
(?s) causes . to match all characters.
If one wanted to match whitespace on the current line, one could use \h instead of \s. It only matches horizontal whitespace, thus excluding CR and LF (among others).
Alternatively, (?[ \s - \n ])[1], [^\S\n][2] and \s(?<!\n)[3] all match whitespace characters other than LF.
The second is about a misconception of what greediness means.
Greediness or lack thereof doesn't affect if a pattern can match, just what it matches. For example, for a given input, /a+/ and /a+?/ will both match, or neither will match. It's impossible for one to match and not the other.
"aaaa" =~ /a+/ # Matches 4 characters at position 0.
"aaaa" =~ /a+?/ # Matches 1 character at position 0.
"bbbb" =~ /a+/ # Doesn't match.
"bbbb" =~ /a+?/ # Doesn't match.
When something is greedy, it means it will match the most possible at the current position that allows the entire pattern to match. Take the following for example:
"ccccd" =~ /.*d/
This pattern can match by having .* match only cccc instead of ccccd, and thus does so. This is achieved through backtracking. .* initially matches ccccd, then it discovers that d doesn't match, so .* tries matching only cccc. This allows the d and thus the entire pattern to match.
You'll find backtracking used outside of greediness too. "efg" =~ /^(e|.f)g/ matches because it tries the second alternative when it's unable to match g when using the first alternative.
In the same way as .* avoids matching the d in the earlier example, the \s* avoids matching the LF and tab before dog in your example.
Requires use experimental qw( regex_sets ); before 5.36, but it was safe to use since 5.18 as it was accepted without change since its introduction as an experimental feature..
Less clear because it uses double negatives.[^\S\n]= A char that's ( not( not(\s) or LF ) )= A char that's ( not(not(\s)) and not(LF) )= A char that's ( \s and not LF )
Less efficient, and far from as pretty as the regex set.
I'm working in notepad++, and using its find-replace dialog box.
NP++ documentation states: Notepad++ regular expressions use the Boost regular expression library v1.70, which is based on PCRE (Perl Compatible Regular Expression) syntax. ref: https://npp-user-manual.org/docs/searching
What I'm trying to do should be simple, but I'm a regex novice, and after 2-3 hrs of web searches and playing with online regex testers, I give up.
I want to replace all single quotes ' with double quote " , but if and only if the ' is to the RIGHT of one or more #, ie inside a python comment.
For example,
list1 = ['apple','banana','pear'] # All 'single quotes' to LEFT of # remained unchanged.
list2 = ['tomato','carrot'] # All 'single quotes' to RIGHT of one or more # are replaced
# # with "double quotes", like this.
The np++ file is over 800 lines, manual replacement would be tedious & error prone. Advice appreciated.
This regex should do what you want:
(^[^#]*#|(?<!^)\G)[^'\n]*\K'
It looks for a ' which is preceded by either
^[^#]*# : start of line and some number of non-# characters followed by a #; or
(?<!^)\G : the start of line or the end of the previous match (\G), with a negative lookbehind for start of line (?<!^), meaning that it only matches at the end of the previous match
and then some number of non ' or newline (to prevent the match wrapping around the end of the previous line) characters [^'\n]*.
We then use \K to reset the match, so that everything before that is discarded from the match, and the regex only matches the '.
That can then be replaced with ".
Demo on regex101
Update
You can avoid matching apostrophes within words by only matching ones that are either preceded or followed by a non-word character:
(^[^#]*#|(?<!^)\G)[^'\n]*\K('(?=\W)|(?<=\W)')
Demo on regex101
Update 2
You can also deal with the case where there are # characters in strings by qualifying the first part of the regex with the requirement for there to be matched pairs of quotes beforehand:
(?:^[^'#]*(?:'[^']*'[^#']*)*[^'#]*#|(?<!^)\G)[^'\n]*\K(?:'(?=\W)|(?<=\W)')
Demo on regex101
I have some obfuscated code which call functions, like this:
getAny([["text with symbols \"()[],.;\" and maybe 'ImVerySeriousFn'"], ...]);
setAny([["other text with \"()[],.;\""], ...]);...
Arguments contain random text. Functions follow each other without a new line.
How can I get arguments of getAny, setAny and other functions, using set of regular expressions?
I need this result:
regex1 result: [["text with symbols \"()[],.;\" and maybe 'ImVerySeriousFn'"], ...]
regex2 result: [["other text with \"()[],.;\""], ...]
...
I tried write regex1:
getAny\((.*)\)
but matching result also contains setAny call
[["text with symbols \"()[],.;\" and maybe 'ImVerySeriousFn'"], ...]);setAny([["other text with \"()[],.;\""], ...]
When I tried:
getAny\((.*?)\)
matching result break argument string
[["text with symbols \"(
I can't split by ; or ); because text in arguments can contains symbols ; or );
maybe impossible to do it using regex?
Your regex needs to be \(.*?\); since your code is obfuscated (and assumedly on one line).
Note that this will fail if one of your arguments contains ); inside of it.
Explanation (From Regex101.com):
/\((.*?)\);/g
\( matches the character ( literally
1st Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\) matches the character ) literally
; matches the character ; literally
g modifier: global. All matches (don't return on first match)
The main problem with your regex is that you never specified ; to end a match, so it went ahead and grabbed up until the last ) it saw because you used .*, which is greedy (grabs everything) unless followed by ?.
Demo
I don't know, if I understand your question, but if I do, you maybe could use a group and collect the allowed signs in it.
Your regex could be: \( ( ) " [ ],\.; a-zA-Z \)
outer brackets enclose the group
If I understand your pattern correctly, your function argument will always start with [[" and end with "]].
Regex:
/getAny\((\[\[".*?[^\\]"\]\])\);/
Demo: http://regex101.com/r/jC3vX5/2
Note the lazy .*?, and [^\\] to make sure the matching quote is not escaped.
I have regular expression:
BEGIN\s+\[([\s\S]*?)END\s+ID=(.*)\]
which select multiline text and ID from text below. I would like to select only IDs with prefix X_, but if I change ID=(.*) to ID=(X_.*) begin is selected from second pair not from third as I need. Could someone help me to get correct expression please?
text example:
BEGIN [
text a
END ID=X_1]
BEGIN [
text b
text c
END ID=Y_1]
text aaa
text bbb
BEGIN [
text d
text e
END ID=X_2]
text xxx
BEGIN [
text bbb
END ID=X_3]
It isn't the .* that's gobbling everything up as people keep saying, it's the [\s\S]*?. .* can't do it because (as the OP said) the dot doesn't match newlines.
When the END\s+ID=(X_.*)\] part of your regex fails to match the last line of the second block, you're expecting it to abandon that block and start over with the third one. That's what it have to do to make the shortest match.
In reality, it backtracks to the beginning of the line and lets [\s\S]*? consume it instead. And it keeps on consuming until it finds a place where END\s+ID=(X_.*)\] can match, which happens to be the last line of the third block.
The following regex avoids that problem by matching line by line, checking each one to see if it starts with END. This effectively confines the match to one block at a time.
(?m)^BEGIN\s+\[[\r\n]+((?:(?!END).*[\r\n]+)*)END\s+ID=(X_.*)\]
Note that I used ^ to anchor each match to the beginning of a line, so I used (?m) to turn on multiline mode. But I did not--and you should not--turn on single-line/DOTALL mode.
Assuming there aren't any newlines inside a block and the BEGIN/END statements are the first non-space of their line, I'd write the regex like this (Perl notation; change the delimiters and remove comments, whitespaces and the /x modifier if you use a different engine)
m{
\n \s* BEGIN \s+ \[ # match the beginning
( (?!\n\s*\n) .)*? # match anything that isn't an empty line
# checking with a negative look-ahead (?!PATTERN)
\n \s* END \s+ ID=X_[^\]]* \] # the ID may not contain "]"
}sx # /x: use extended syntax, /s: "." matches newlines
If the content may be anything, it might be best to create a list of all blocks, and then grep through them. This regex matches any block:
m{ (
BEGIN \s+ \[
.*? # non-greedy matching is important here
END \s+ ID=[^\]]* \] # greedy matching is safe here
) }xs
(add newlines if wanted)
Then only keep those matches that match this regex:
/ID = X_[^\]]* \] $/x # anchor at end of line
If we don't do this, backtracking may prevent a correct match ([\s\S]*? can contain END ID=X_). Your regex would put anything inside the blocks until it sees a X_.*.
So using BEGIN\s+\[([/s/S]*?)END\s+ID=(.*?)\] — note the extra question mark — one match would be:
BEGIN [
text b
text c
END ID=Y_1]
text aaa
text bbb
BEGIN [
text d
text e
END ID=X_2]
…instead of failing at the Y_. A greedy match (your unchanged regex) should result in the whole file being matched: Your (.*) eats up all characters (until the end of file) and then goes back until it finds a ].
EDIT:
Should you be using perls regex engine, we can use the (*FAIL) verb:
/BEGIN\s+\[(.*?)END\s+ID=(X_[^\]]*|(*FAIL))\]/s
"Either have an ID starting with X_ or the match fails". However, this does not solve the problem with END ID=X_1]-like statements inside your data.
Change your .* to a [^\]]* (i.e. match non-]s), so that your matches can't spill over past an END block, giving you something like BEGIN\s+\[([^\]]*?)END\s+ID=(X_[^\]]*)\]
I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.