Perl RegExp optional search grouping - regex

Why doesn't this regexp fill the variable $1. And how can i make this possible?
my $txt = "abc def ghi jkl mnop";
$txt =~ /(def)?/;
$1 is undef
my $txt = "abc def ghi jkl mnop";
$txt =~ /(abc)?/;
This works instead.
It only works as soon as a non-optional part exists, like "\s(def)?" but this not hit the first part.
The Thing ist that i need a regexp which Returns Always true and fill $1 aso.
EDIT:
Thank you very much for your support. I'll give you a deeper insight into the problem.
In my code, $txt and the RegExp are user input. It should be given the possibility that individual words or pairs are picked out, however, regardless of the order.
My idea was to split the RegExp into order independent parts and then test each one for themselves.
example
/(ghi) jkl (def)? (ABC)?/
Should be successful. So after splitting them into parts, and then several tests.
/(ghi) jkl/ && /(def)?/ && /(abc)?/
For each test, the particular scalars are added to an array. For this reason (order independence), it was natural that this completely optional RegExp arise.
Please excuse my english.

The pattern (def)? matches either of two things: "def" or the empty string. It will try "def" first, but if it doesn't find it, it will succeed and match nothing.
It's possible to match the empty string at any location in any string, which means that it's possible to match the empty string at the very first position in your string, which means that there's no reason for the engine to look at any later position to see if it can find a "def" instead.
Without a better example of what you're trying to do it's hard to give advice, but you need to either modify the pattern so that it doesn't match the empty string, or provide some context to force it to attempt the match of "def" in the correct position.

Per a comment, you want the regex to always be true but capture the word if it exists.
This would be:
/^(?:.*?(def))?/s

The regx used is not capturing all the expected scenarios.'(def)?' will match only 'def' char followed by 0 or more characters.
whereas, you are trying to match wherever 'def' present.
Below code will help to match 'def' in the string irrespective of the position.
$txt =~ /(.*)?(def)(.*)?/;
print "$2";

Related

Force first letter of regex matched value to uppercase

I am trying to get better at regular expressions. I am using regex101.com. I have a regular expression that has two capturing groups. I am then using substitution to incorporate my captured values into another location.
For example I have a list of values:
fat dogs
thin cats
skinny cows
purple salamanders
etc...
and this captures them into two variables:
^([^\s]+)\s+([^\s;]+)?.*
which I then substitute into new sentences using $1 and $2. For example:
$1 animals like $2 are a result of poor genetics.
(obviously this is a silly example)
This works and I get my sentences made but I'm stumped trying to force $1 to have an uppercase first letter. I can see all sorts of examples on MATCHING uppercase or lowercase but not transforming to uppercase.
It seems I need to do some sort of "function" processing. I need to pass $1 to something that will then break it into two pieces...first letter and all the other letters....transform piece one to uppercase...then smash back together and return the result.
Add to that error checking...and while it is unlikely $1 will have numeric values we should still do a safety check of some sort.
So if someone can just point me to the reading material I would appreciate it.
A regular expression will only match what is there. What you are doing is essentially:
Match item
Display matches
but what you want to be doing is:
Match item
Modify matches
Display modified matches
A regular expression doesn't do any 'processing' on the matches, it is just a syntax for finding the matches in the first place.
Most languages have string processing, for instance, if you had you matches in the variables $1 and $2 as above, you would want to do something along the lines of:
$1 = upper(substring($1, 0, 1)) + substring($1, 1)
assuming the upper() function if you language's strung uppercasing function, and substring() returns a sub-string (zero indexed).
Put very simply, regex can only replace from what is in your original string. There is no capital F in fat dogs so you can't get Fat dogs as your output.
This is possible in Perl, however, but only because Perl processes the text after the regex substitution has finished, it is not a feature of the regex itself. The following is a short Perl program (sans regex) that performs case transformation if run from the command line:
#!/usr/bin/perl -w
use strict;
print "fat dogs\n"; # fat dogs
print "\ufat dogs\n"; # Fat dogs
print "\Ufat dogs\n"; # FAT DOGS
The same escape sequences work in regexs too:
#!/usr/bin/perl -w
use strict;
my $animal = "fat dogs";
$animal =~ s/(\w+) (\w+)/\u$1 \U$2/;
print $animal; # Fat DOGS
Let me repeat though, it is Perl doing this, not the regex.
Depending on your real world example you may not have to change the case of the letter. If your input is Fat dogs then you will get the desired result. Otherwise, you will have to process $1 yourself.
In PHP you can use preg_replace_callback() to process the entire match, including captured groups, before returning the substitution string. Here is a similar PHP program:
<?php
$animal = "fat dogs";
print(preg_replace_callback('/(\w+) (\w+)/', 'my_callback', $animal)); // Fat DOGS
function my_callback($match) {
return ucfirst($match[1]) . ' ' . strtoupper($match[2]);
}
?>
I think it can be very simple based on your language of choice. You can firs loop over the list of values and find your match then put the groups within your string by using a capitalize method for first matched :
for val in my_list:
m = match(^([^\s]+)\s+([^\s;]+)?.*,val)
print "%sanimals like %s are a result of poor genetics."%(m.group(1).capitalize(), m.group(1))
But if you want to dot it all with regex It's very unlikely to be possible because you need to modify your string and this is generally not a regex a suitable task for regex.
So in the end the answer is that you CAN'T use regex to transform...that's not it's job. Thanks to the input by others I was able to adjust my approach and still accomplish the objective of this self inflicted academic assignment.
First from the OP you'll recall that I had a list and I was capturing two words from that list into regex variables. Well I modified that regex capture to get three capture groups. So for example:
^(\S)(\S+)\s+_(\S)?.*
//would turn fat dogs into
//$1 = f, $2 = at, $3 = dogs
So then using Notepad++ I then replaced with this:
\u$1$2 animals like $3 are a result of poor genetics.
In this way I was able to transform the first letter to uppercase..but as others pointed out this is NOT regex doing the transform but another process. (In this case notepad ++ but could be your c#, perl, etc).
Thank You everyone for helping the newbie.

adding regex pattern in string in perl

I want to compare string from one file to another. but another file may contains some element and that element can occur anywhere and it can occur many times also.
Note : these tags needs to be retain in final output.
For e.g.:
I want to compare word ‘scripting’.. tag indicates the word to be matched from str2.
$str1 = “perl is an <match>scripting</match> language”;
$str2 = “perl is an s<?..?>cr<?..?>ipti<?..?>ng langu<?..?>age”;
Output required :
perl is an <match>s<?..?>cr<?..?>ipti<?..?>ng</match> langu<?..?>age
I am adding pattern after each character:
$str1 =~ {(.)}
{
‘$&(?:(?:<?...?>|\n)+)?’
}esgi;
These works for few case but for few its goes on running. Please suggest.
(?:(?:<?...?>|\n)+)? is the same as (?:<?...?>|\n)* Also you dont want to add the pattern after each character; just in between the characters of the matched part of $str1. So no pattern before the first character and no pattern after the last. Otherwise the replace statement will around those tags, and you want them to around the front and back of the words. My guess is that if you are runnging that first replace command over all of $str1 you may end up with quite a large string. Also see my answer for related question here

Match everything except every given combination

Given string, for example abbbabf
given piece, for example ab
Needed, that remove all characters, except every pieces, that is from abbbabf must get result: abab
How should be regex pattern for this ?
Edit
Lets take php as example
Its simply to remove everyting, except piece, if piece is just one symbol, that is if piece is a, must do
$str = "abbbabf";
echo preg_replace("#[^a]#", "", $str);
and result is aa
But how to make this when piece is more than one symbol, I have no idea...
Please dont give solutions such as:
preg_match_all("#ab#", $str, $a);
echo implode($a[0]);
Thanks
PS. I need make this In ORACLE database and if I find solution (one pattern) without procedure handling, will be cool.
The following can do it using capture groups rather than assertions:
$str = "helloababblolobbbabf";
^^^^ ^^
echo preg_replace("#.*?(ab|$)#", "$1", $str);
// Output: ababab
RegExr
Since you say you're actually working in Oracle, you can use REGEXP_REPLACE:
REGEXP_REPLACE(input, '.*?(ab|$)', '\1')
SQLFiddle
The expression you need to use is this:
((?<=ab|^).*?(?=ab|$))
From the string, abbbabfasdfsdfsdfab ababab is returned.
See it in action: http://regex101.com/r/nT8mC1
Caveat as Bart points out in a comment, Oracle doesn't implement much of the PCRE standard, and as such this simply won't work. You'll have to look at implementing some sort of capture set where you can capture the string you want and rebuild it with implode (which you don't want to do apparently).
Edit added suggestion for conditional from comments.

Regex to match suffixes to english words

I'm searching for the word "move" and i want to match "moved" as well when I print.
The way I'm going about this is:
if ($sentence =~ /($search_key)d$/i) {
$search_key = $search_keyd;
}
$subsentences[$i] =~ s/$search_key/ **$search_key** /i;
$subsentences[$i] =~ s/\b$parsewords[1]_\w+/ --$parsewords[1]--/i;
print "MATCH #$count\n",split(/_\S+/,$subsentences[$i]), "\n";
$count++;
This is part of a longer code so if anything is unclear let me know. The _ is because the words in the sentence are tagged (ex. I_NN move_VB to_PREP ....).
Where $search_keyd will be $search_key."d", which worked!
A nice addition would be to check if the word ended in e and therefore only a d would need to be appended. I'd guess it'd look something like this: e?$/d$
Even a general answer will suffice.
I'm new to Perl. So sorry if this is elementary. Thanks in advance!!!
If I understand you correctly, you want to search for "move" and add a highlight, but also include any variation of the basic word, such as "moves" "moved".
When you are replacing words in a text like this, you usually want to replace all the words, and then you need the /g operator on the regex, like so:
$subsentences[$i] =~ s/$search_key/ **$search_key** /ig
Also, you should make sure to not match partials of words. E.g. you want to match "move", but not perhaps "remove". For this, you can use \b to mark word boundry:
$subsentences[$i] =~ s/\b$search_key/ **$search_key** /ig
In order to match certain suffixes, you need a character class with valid characters or combination of characters. move[sd] will find "moves" and "moved". However, for a word like "jump", you would need to be a bit more specific: "jump(s|ed)". Note that [sd] can be replaced with (s|d). So barring any bad spelling in your text, you can get away with:
$subsentences[$i] =~ s/\b$search_key(s|d|ed)/ **$search_key$1** /ig
Note that $1 matches whatever is found inside the first matching parenthesis.
To find the number of matching words:
my $matches = $subsentences[$i] =~ s/\b$search_key(s|d|ed)/ **$search_key$1** /ig
If you want to be more specific with the suffixes, i.e. make it not match badly spelled words like "moveed", you'd need to do some special matching. Something like:
if ($search_key =~ /e$/i) { $suffix = '(s|d)' }
else { $suffix = '(s|ed)' }
my $matches = $subsentences[$i] =~ s/\b$search_key$suffix/ **$search_key$1** /ig
It can probably become very complicated the more search words you add.
Some help about regexes here
If what you want is to match all complete words which begin with your search term, i.e. 'move' matches 'move', 'moved', 'movers', etc, then you want to use a character class to detect the end of the word.
So, instead of:
if ($sentence =~ /($search_key)d$/i)
Try using:
if ($sentence =~ /($search_key\w*)\W$/i)
The \w* will match any number of standard word characters and the \W should prevent you from including other characters, such as whitespace or punctuation.

regex string does not contain substring

I am trying to match a string which does not contain a substring
My string always starts "http://www.domain.com/"
The substring I want to exclude from matches is ".a/" which comes after the string (a folder name in the domain name)
There will be characters in the string after the substring I want to exclude
For example:
"http://www.domain.com/.a/test.jpg" should not be matched
But "http://www.domain.com/test.jpg" should be
Use a negative lookahead assertion as:
^http://www\.domain\.com/(?!\.a/).*$
Rubular Link
The part (?!\.a/) fails the match if the URL is immediately followed with a .a/ string.
My advise in such cases is not to construct overly complicated regexes whith negative lookahead assertions or such stuff.
Keep it simple and stupid!
Do 2 matches, one for the positives, and sort out later the negatives (or the other way around). Most of the time, the regexes become easier, if not trivial.
And your program gets clearer.
For example, to extract all lines with foo, but not foobar, I use:
grep foo | grep -v foobar
I would try with
^http:\/\/www\.domain\.com\/([^.]|\.[^a]).*$
You want to match your domain, plus everything that do not continue with a . and everything that do continue with a . but not a a. (Eventually you can add you / if needed after)
If you don't use look ahead, but just simple regex, you can just say, if it matches your domain but doesn't match with a .a/
<?php
function foo($s) {
$regexDomain = '{^http://www.domain.com/}';
$regexDomainBadPath = '{^http://www.domain.com/\.a/}';
return preg_match($regexDomain, $s) && !preg_match($regexDomainBadPath, $s);
}
var_dump(foo('http://www.domain.com/'));
var_dump(foo('http://www.otherdomain.com/'));
var_dump(foo('http://www.domain.com/hello'));
var_dump(foo('http://www.domain.com/hello.html'));
var_dump(foo('http://www.domain.com/.a'));
var_dump(foo('http://www.domain.com/.a/hello'));
var_dump(foo('http://www.domain.com/.b/hello'));
var_dump(foo('http://www.domain.com/da/hello'));
?>
note that http://www.domain.com/.a will pass the test, because it doesn't end with /.