Perl not matching regex?

Perl not matching regex? - regex

I'm trying to remove all the comments in a bunch of SGF files, and have come up with the following perl command:
perl -pi -e 's/P?C\[(?:[^\]\\]++|\\.)*+\]//gm' *.sgf
I'm trying to match and remove a C or PC followed by a left bracket, then characters that aren't right brackets (if they are they have to be escaped with a \) and then a right bracket.
I'm trying to match the following examples:
C[HelloBot9 [-\]: GTP Engine for HelloBot9 (white): HelloBot version 0.6.26.08]
PC[IA [-\]: GTP Engine for IA (black): GNU Go version 3.7.11
]
C[person [-\]: \\\]]
C[AyaMC [3k\]: GTP Engine for AyaMC (black): Aya version 6.61 : If you pass, AyaMC
will pass. When AyaMC does not, please remove all dead stones.]
And some examples that shouldn't be matched:
XYZ[Other stuff \]]
C[stuff\]
PC[stuff\\\]
The regex works in several online regex testers (including a few that state they are perl regex testers), but for some reason doesn't work on the command line. Help is appreciated.

You need to run perl with -0777 option to make sure that contents spanning across lines and matching the pattern can be found. So, using perl -0777pi -e instead of perl -pi -e will solve the issue.
I would also suggest optimizing the pattern a bit by unrolling the alternation group, thus, making matching process "linear":
s/P?C\[[^]\\]*(?:\\.[^]\\]*+)*]//sg
Note that if PC should be matched as a whole word, add \b before P.
Pattern details:
P?C\[ - either PC[ or C[ literal char sequence
[^]\\]* - zero or more chars other than \ and ]
(?:\\.[^]\\]*+)* - zero or more sequences of:
\\. - a literal \ and then any char (.)
[^]\\]*+ - 0+ chars other than ] and \ (matched possessively, no backtracking into the pattern)
] - a literal ] symbol (note it does not have to be escaped outside the character class to denote a literal closing bracket)

Related

How to match strings with at most n free characters between two well-defined patterns?

Let's say I have the text from a bunch of articles. I want to be be able to grep for patterns related to COVID-19. How would I search for such a thing considering that some people call it Cov2, CoV-2, COVID-2, COVID19, COVID-19, COVID 19, etc...
Basically, that pattern I have so far is
grep "[Cc][Oo][Vv].{0,3}2\|[Cc][Oo][Vv].{0,3]19" file.txt
But this isn't working. I'm pretty sure the problem is the ".{0,3}" part. I'm not sure how to tell the computer to match up to 3 free characters, followed by 2 or 19, and preceded by [Cc][Oo][Vv]

Assuming you have a GNU grep, your pattern contains several mistakes:
{0,3} - in a POSIX BRE pattern, a range quantifier is defined with a pair of escaped braces, \{0,3\}
{0,3] - same comment, just the closing braces got replaced with ].
You can use
grep -i -E "COV.{0,3}(2|19)" file
Or, a bit more precise:
grep -i -E "COV(ID)?[-[:space:]]?(2|19)"
See an online grep demo #1 and a demo #2.
Details
-i - case insensitive mode
-E - POSIX ERE syntax enabled (to avoid extra \ symbols in the regex pattern)
COV.{0,3}(2|19) - COV substring (case insensitive), then any zero to three chars, and then either 2 or 19
(ID)?[-[:space:]]? - matches an optional ID substring, and then an optional - or a whitespace char.

search string preceded by either a space or a slash

I've files with below content:
76a6f0f631888fbd359420796093d19a3928123d remotes/origin/feature/ASC-122356
417435aceb671e41213697055b86d860d9a9a61c remotes/origin/feature/ASC-122356-3762
ae863a41fef068215be1529216e9dbba1314fa6f remotes/origin/master
I want to search if origin/master pattern is there or not in the file.
I'm currently doing like grep -e '^\S\+ origin/master$' but it's not correct. How can I do it?

Following would work with grep. Positive number of non-space characters, followed by a space, followed by a possibly empty sequence of non-space characters and followed by the expected string.
grep -P '\S+ \S*origin/master$' test
Can be improved to make sure the origin is either at the begining of the second column or preceded by a / to eliminate strings like remotes/backup-origin/master
grep -P '^\S+ (|\S*/)origin/master$' test
Note those expressions require (-P) - perl compatible regexes.

The pattern is uses '^\S+ ' to request that ALL characters before origin/master will be non-space (because of the '^').
Consider using similar version, which will ask for ONE space
grep -e ' \S\+origin/master$'

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.

You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.

grep and sed regular expressions meaning - extracting urls from a web page

grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
After trawling the internet finding the answer to my homework question, I finally got the above. But I don't completely understand the meaning of the two regular expressions used with sed and grep. Can somebody please shed some light on me? Thanks in advance.

The grep command looks for any lines that include a match to
'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'
which is
<a the characters <a
[^>] not followed by a close '>'
\+ the last thing one or more times (this is really not necessary I think.
with this, it would be "not followed by exactly one '>' which would be fine
href followed by the string 'href'
[ ]* followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
= followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
" followed by open quote (but only a double quote...)
\( open bracket (grouping)
ht characters 'ht'
\| or
f character f
\) close group (of the either-or)
tp characters 'tp'
s\? optionally followed by s
Note - the last few lines combined means 'http or https or ftp or ftps'
: character :
[^"]\+ one or more characters that are not a double quote
this is "everything until the next quote"
Does that get you started? You can do the same for the next bit...
Note to confuse you - the backslash is used to change the meaning of some special characters like ()+; just to keep everyone on their toes, whether these have special meaning with or without the backslash is not something that is defined by the regular expression syntax, but rather by the command in which you use it (and its options). For example, sed changes the meaning of things depending on whether you use the -E flag.

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?

There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.

If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)

If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/

#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl not matching regex? - regex

Related

How to match strings with at most n free characters between two well-defined patterns?

search string preceded by either a space or a slash

How to grep for this pattern in Unix

grep and sed regular expressions meaning - extracting urls from a web page

Vim regex backreference

Categories

Resources