How to check for one Perl regex pattern while excluding another - regex

As an example, if I need to filter the following text and search for the word example, but omit results in which the line begin with a hash #
# filler text example
example
example 2
# test example 3
I've tried a few different combinations, but cant seem to get this right.
Update
I've tried /^[^#].*example/g and /^(?!#).*example.*/g but didn't seem to get any results

It is strangely common to attempt to bundle far too much functionality into a single regex, while people don't seem to do the same thing with any other operator.
There is nothing wrong with writing
if ( /example/ and not /^#/ ) {
print;
}
and it is far clearer than any single equivalent regular expression
You could change this to multiple statements if you wish; something like
while (<>) {
next if /^#/;
print if /example/;
}
Or you could allow comments to start in the middle of a line by creating a temporary variable that contains the text with all characters from the hash # onwards removed, and process that instead
while (<>) {
my ($trimmed) = /^([^#]*)/;
print if $trimmed =~ /example/;
}
Note that if you are hoping to process Perl code using this, then there are cases which will have to receive special treatment where a hash doesn't denote the start of a comment, such as the $#array construct, or an alternative pattern delimiter like m#example#

^((?!#).*example.*)$ will work better with your regex101 tester. Also use the flags gm instead of just g. The tester is processing the sample text as a single string, but devnull's answer works if you're processing the text line by line.

Related

Powershell Regex - Replace between string A and string B only if contains string C

I have a file which looks like this
ABC01|01
Random data here 2131233154542542542
More random data
STRING-C
A bit more random stuff
&(%+
ABC02|01
Random data here 88888888
More random data 22222
STRING-D
A bit more random stuff
&(%+
I'm trying to make a script to Find everything between ABC01 and &(%+ ONLY if it contains STRING-C
I came up with this for regex ABC([\s\S]*?)STRING-C(?s)(.*?)&\(%\+
I'm getting this content from a text file with get-content.
$bad_content = gc $bad_file -raw
I want to do something like ($bad_content.replace($pattern,"") to remove the regex match.
How can I replace my matches in the file with nothing? I'm not even sure if my regex is correct but on regex101 it seems to find the strings I'm needing.
Your regex works with the sample input given, but not robustly, because if the order of blocks were reversed, it would mistakenly match across the blocks and remove both.
Tim Biegeleisen's helpful answer shows a regex that fixes the problem, via a negative lookahead assertion ((?!...)).
Let me show how to make it work from PowerShell:
You need to use the regex-based -replace operator, not the literal-substring-based .Replace() method:[1] to apply it.
To read the input string from a file, use Get-Content's -Raw switch to ensure that the file is read as a single, multi-line string; by default, Get-Content returns an array (stream) of lines, which would cause the -replace operation to be applied to each line individually.
(Get-Content -Raw file.txt) -replace '(?s)ABC01(?:(?!&\(%\+).)*?STRING-C.*?&\(%\+'
Not specifying replacement text (as the optional 2nd RHS operand to -replace) replaces the match with the empty string and therefore effectively removes what was matched.
The regex borrowed from Tim's answer is simplified a bit, by using the inline method of specifying matching options to tun on the single-line option ((?s)) at the start of the expression, which makes subsequent . instances match newlines too (a shorter and more efficient alternative to [\s\S]).
[1] See this answer for the juxtaposition of the two, including guidance on when to use which.
We can use a tempered dot trick when matching between the two markers to ensure that we don't cross the ending marker before matching STRING-C:
ABC01(?:(?!&\(%\+)[\s\S])*?STRING-C[\s\S]*?&\(%\+
Demo
Here is an explanation of the regex pattern:
ABC01 match the starting marker
(?:(?!&\(%\+)[\s\S])*? without crossing the ending marker
STRING-C match the nearest STRING-C marker
[\s\S]*? then match all content, across lines, until reaching
&\(%\+ the ending marker

Force first letter of regex matched value to uppercase

I am trying to get better at regular expressions. I am using regex101.com. I have a regular expression that has two capturing groups. I am then using substitution to incorporate my captured values into another location.
For example I have a list of values:
fat dogs
thin cats
skinny cows
purple salamanders
etc...
and this captures them into two variables:
^([^\s]+)\s+([^\s;]+)?.*
which I then substitute into new sentences using $1 and $2. For example:
$1 animals like $2 are a result of poor genetics.
(obviously this is a silly example)
This works and I get my sentences made but I'm stumped trying to force $1 to have an uppercase first letter. I can see all sorts of examples on MATCHING uppercase or lowercase but not transforming to uppercase.
It seems I need to do some sort of "function" processing. I need to pass $1 to something that will then break it into two pieces...first letter and all the other letters....transform piece one to uppercase...then smash back together and return the result.
Add to that error checking...and while it is unlikely $1 will have numeric values we should still do a safety check of some sort.
So if someone can just point me to the reading material I would appreciate it.
A regular expression will only match what is there. What you are doing is essentially:
Match item
Display matches
but what you want to be doing is:
Match item
Modify matches
Display modified matches
A regular expression doesn't do any 'processing' on the matches, it is just a syntax for finding the matches in the first place.
Most languages have string processing, for instance, if you had you matches in the variables $1 and $2 as above, you would want to do something along the lines of:
$1 = upper(substring($1, 0, 1)) + substring($1, 1)
assuming the upper() function if you language's strung uppercasing function, and substring() returns a sub-string (zero indexed).
Put very simply, regex can only replace from what is in your original string. There is no capital F in fat dogs so you can't get Fat dogs as your output.
This is possible in Perl, however, but only because Perl processes the text after the regex substitution has finished, it is not a feature of the regex itself. The following is a short Perl program (sans regex) that performs case transformation if run from the command line:
#!/usr/bin/perl -w
use strict;
print "fat dogs\n"; # fat dogs
print "\ufat dogs\n"; # Fat dogs
print "\Ufat dogs\n"; # FAT DOGS
The same escape sequences work in regexs too:
#!/usr/bin/perl -w
use strict;
my $animal = "fat dogs";
$animal =~ s/(\w+) (\w+)/\u$1 \U$2/;
print $animal; # Fat DOGS
Let me repeat though, it is Perl doing this, not the regex.
Depending on your real world example you may not have to change the case of the letter. If your input is Fat dogs then you will get the desired result. Otherwise, you will have to process $1 yourself.
In PHP you can use preg_replace_callback() to process the entire match, including captured groups, before returning the substitution string. Here is a similar PHP program:
<?php
$animal = "fat dogs";
print(preg_replace_callback('/(\w+) (\w+)/', 'my_callback', $animal)); // Fat DOGS
function my_callback($match) {
return ucfirst($match[1]) . ' ' . strtoupper($match[2]);
}
?>
I think it can be very simple based on your language of choice. You can firs loop over the list of values and find your match then put the groups within your string by using a capitalize method for first matched :
for val in my_list:
m = match(^([^\s]+)\s+([^\s;]+)?.*,val)
print "%sanimals like %s are a result of poor genetics."%(m.group(1).capitalize(), m.group(1))
But if you want to dot it all with regex It's very unlikely to be possible because you need to modify your string and this is generally not a regex a suitable task for regex.
So in the end the answer is that you CAN'T use regex to transform...that's not it's job. Thanks to the input by others I was able to adjust my approach and still accomplish the objective of this self inflicted academic assignment.
First from the OP you'll recall that I had a list and I was capturing two words from that list into regex variables. Well I modified that regex capture to get three capture groups. So for example:
^(\S)(\S+)\s+_(\S)?.*
//would turn fat dogs into
//$1 = f, $2 = at, $3 = dogs
So then using Notepad++ I then replaced with this:
\u$1$2 animals like $3 are a result of poor genetics.
In this way I was able to transform the first letter to uppercase..but as others pointed out this is NOT regex doing the transform but another process. (In this case notepad ++ but could be your c#, perl, etc).
Thank You everyone for helping the newbie.

Inquiry about Perl Regexes

This question is related to one I asked yesterday. I'm new to Perl and am still getting the hang of things*. In the code, I am trying to replace right single quotation marks with apostrophes. However, I do not want to replace the right single quotation on singly quoted words. An example being:
He said the movie was 'magnificent.'
Here's the code I'm currently working with:
#!/usr/bin/perl
use strict;
use warnings;
# Subroutine prototype
sub problem_character();
my $previousPosition=0;
my $currentPosition=0;
#Locates problematic apostrophes and replaces them with properly encoded apostrophes
sub problem_character(){
while($_[0]=~m/\x{2019}/g){
$currentPosition=pos($_[0]);
pos($_[0])=$previousPosition;
unless(....){
$_[0]=~s/\x{2019}/\x{0027}/g;
}
$previousPosition=$currentPosition;
}
}
First off, I'm not sure what I would put in the unless check. I want to be able to check if the matched right single quote is part of a singly quoted word. Also, in the Perl documentation, it was the pos function the offset where the last m//q search left off. Does the replacement search also fall under this category? Finally, is there a simpler way of writing this type of code? Thanks.
*Does anyone know of a good book I could pick up that explains Peril in detail? I found the online resources to be quite confusing.
You posted you have the following:
He said the movie was 'magnificent.'
But you said you were trying to replace ’ which aren't present in that string. Do you actually have the following?
He said the movie was ‘magnificent.’
If so, the simple solution would be to replace all ’ that aren't matched by a preceding ‘. It's a bit tricky to implement, though.
s{
\G
(?: [^\x{2018}\x{2019}]++
| \x{2018} [^\x{2018}\x{2019}]*+ \x{2019}?+
)*+
\K
\x{2019}
}{'}xg;
Simpler (but a little less efficient) implementation:
$_ = reverse($_);
s/\x{2019}(?![^\x{2018}\x{2019}]*\x{2018})/'/g;
$_ = reverse($_);
By the way, you can actually use the characters ‘ and ’ in the regex pattern if you want. Just make sure to encode your file using UTF-8 and tell Perl you did that using use utf8;
use utf8; # Source code is encoded using UTF-8.
$_ = reverse($_);
s/’(?![^‘’]*‘)/'/g;
$_ = reverse($_);

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###