Find words in between two words in regular expressions - regex

I have a file where I want to match a certain word between keywords using regular expressions. For example, lets say I want to match every occurrence of the word "dog" AFTER the keyword "start" and BEFORE the keyword "end".
dog horse animal cat dog // <-- don't match
random text dog // <-- don't match
start
brown dog
black dog
cat horse animals
end
dog cat // <-- don't match
good dog // <-- don't match
Maybe regex has a pipe feature where I can get the text after the word "start" and before the word "end", then pipe it into a new regular expression? Then I could just search for "dog" in the second regular expression. I am new to regular expressions and have been struggling to come up with a solution. Thanks

When you are matching "globally" (ie. collecting several matches that are non-contiguous) and you provide a stipulation such as "matches must all exist in a container" (in this case, between "start" and "end"), this generally calls for a construct such as PCRE's '\G', which matches only at the first attempted position:
(?:\G(?!\A)|start)(?:(?!end).)*?\Kdog
See it in action at: https://regex101.com/r/uV7EjE/1
It's important to note that this uses some constructs that are not universally supported, and one specific to PCRE ('\K'). An explanation of each part:
/(?:
\G(?!\A) # Match only at the first position, since the usual behaviour of regex is to attempt to match at each position. In effect, this ensures we only match immediately after the last valid "dog".
|start # Or match "start".
)
(?:(?!end).)*? # Match as few characters as possible, making sure we don't encounter "end".
\K # Reset the consumption counter so everything before this isn't matched.
dog # Match what we want.
/gmsx
If instead you need something with wider support for more basic regex engines, then you do indeed need to pipe a simpler expression, for instance start.*?end to match a complete group, then check its contents for all occurrences of "dog".

Update:
start(.?)(dog)+(.?)end
Test on the below link, here is a screen:
previous:
(please, note this might not answer exactly your case because it heavily depends on what language you are working)
Ref. 1 link
Ref. 2 link
It also depends on the language you are developing as the other comments are saying. If you can let me know where are you developing I might give you a better answer.
Also you can use this to debug https://regex101.com/

I know you're asking for regex, but if you're using a certain language there may be more apt solutions. For example, in PHP this function would work:
function getStringBetween($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}

Related

Force first letter of regex matched value to uppercase

I am trying to get better at regular expressions. I am using regex101.com. I have a regular expression that has two capturing groups. I am then using substitution to incorporate my captured values into another location.
For example I have a list of values:
fat dogs
thin cats
skinny cows
purple salamanders
etc...
and this captures them into two variables:
^([^\s]+)\s+([^\s;]+)?.*
which I then substitute into new sentences using $1 and $2. For example:
$1 animals like $2 are a result of poor genetics.
(obviously this is a silly example)
This works and I get my sentences made but I'm stumped trying to force $1 to have an uppercase first letter. I can see all sorts of examples on MATCHING uppercase or lowercase but not transforming to uppercase.
It seems I need to do some sort of "function" processing. I need to pass $1 to something that will then break it into two pieces...first letter and all the other letters....transform piece one to uppercase...then smash back together and return the result.
Add to that error checking...and while it is unlikely $1 will have numeric values we should still do a safety check of some sort.
So if someone can just point me to the reading material I would appreciate it.
A regular expression will only match what is there. What you are doing is essentially:
Match item
Display matches
but what you want to be doing is:
Match item
Modify matches
Display modified matches
A regular expression doesn't do any 'processing' on the matches, it is just a syntax for finding the matches in the first place.
Most languages have string processing, for instance, if you had you matches in the variables $1 and $2 as above, you would want to do something along the lines of:
$1 = upper(substring($1, 0, 1)) + substring($1, 1)
assuming the upper() function if you language's strung uppercasing function, and substring() returns a sub-string (zero indexed).
Put very simply, regex can only replace from what is in your original string. There is no capital F in fat dogs so you can't get Fat dogs as your output.
This is possible in Perl, however, but only because Perl processes the text after the regex substitution has finished, it is not a feature of the regex itself. The following is a short Perl program (sans regex) that performs case transformation if run from the command line:
#!/usr/bin/perl -w
use strict;
print "fat dogs\n"; # fat dogs
print "\ufat dogs\n"; # Fat dogs
print "\Ufat dogs\n"; # FAT DOGS
The same escape sequences work in regexs too:
#!/usr/bin/perl -w
use strict;
my $animal = "fat dogs";
$animal =~ s/(\w+) (\w+)/\u$1 \U$2/;
print $animal; # Fat DOGS
Let me repeat though, it is Perl doing this, not the regex.
Depending on your real world example you may not have to change the case of the letter. If your input is Fat dogs then you will get the desired result. Otherwise, you will have to process $1 yourself.
In PHP you can use preg_replace_callback() to process the entire match, including captured groups, before returning the substitution string. Here is a similar PHP program:
<?php
$animal = "fat dogs";
print(preg_replace_callback('/(\w+) (\w+)/', 'my_callback', $animal)); // Fat DOGS
function my_callback($match) {
return ucfirst($match[1]) . ' ' . strtoupper($match[2]);
}
?>
I think it can be very simple based on your language of choice. You can firs loop over the list of values and find your match then put the groups within your string by using a capitalize method for first matched :
for val in my_list:
m = match(^([^\s]+)\s+([^\s;]+)?.*,val)
print "%sanimals like %s are a result of poor genetics."%(m.group(1).capitalize(), m.group(1))
But if you want to dot it all with regex It's very unlikely to be possible because you need to modify your string and this is generally not a regex a suitable task for regex.
So in the end the answer is that you CAN'T use regex to transform...that's not it's job. Thanks to the input by others I was able to adjust my approach and still accomplish the objective of this self inflicted academic assignment.
First from the OP you'll recall that I had a list and I was capturing two words from that list into regex variables. Well I modified that regex capture to get three capture groups. So for example:
^(\S)(\S+)\s+_(\S)?.*
//would turn fat dogs into
//$1 = f, $2 = at, $3 = dogs
So then using Notepad++ I then replaced with this:
\u$1$2 animals like $3 are a result of poor genetics.
In this way I was able to transform the first letter to uppercase..but as others pointed out this is NOT regex doing the transform but another process. (In this case notepad ++ but could be your c#, perl, etc).
Thank You everyone for helping the newbie.

RegEx: Word immediately before the last opened parenthesis

I have a little knowledge about RegEx, but at the moment, it is far above of my abilities.
I'm needing help to find the text before the last open-parenthesis that doesn't have a matching close-parenthesis.
(It is for CallTip of a open source software in development.)
Below some examples:
--------------------------
Text I need
--------------------------
aaa( aaa
aaa(x) ''
aaa(bbb( bbb
aaa(y=bbb( bbb
aaa(y=bbb() aaa
aaa(y <- bbb() aaa
aaa(bbb(x) aaa
aaa(bbb(ccc( ccc
aaa(bbb(x), ccc( ccc
aaa(bbb(x), ccc() aaa
aaa(bbb(x), ccc()) ''
--------------------------
Is it possible to write a RegEx (PCRE) for these situations?
The best I got was \([^\(]+$ but, it is not good and it is the opposite of what I need.
Anyone can help please?
Take a look at this JavaScript function
var recreg = function(x) {
var r = /[a-zA-Z]+\([^()]*\)/;
while(x.match(r)) x = x.replace(r,'');
return x
}
After applying this you are left with all unmatched parts which don't have closing paranthesis and we just need the last alphabetic word.
var lastpart = function(y) { return y.match(/([a-zA-Z]+)\([^(]*$/); }}
The idea is to use it like
lastpart(recreg('aaa(y <- bbb()'))
Then check if the result is null or else take the matching group which will be result[1]. Most of the regex engines don't support ?R flag which is needed for recursive regex matching.
Note that this is a sample JavaScript representation which simulated recursive regex.
Read http://www.catonmat.net/blog/recursive-regular-expressions/
This works correctly on all your sample strings:
\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)
The most interesting part is this:
(?:[^()]*\([^()]*\))*
It matches zero or more balanced pairs of parentheses along with the non-paren characters before and between them (like the y=bbb() and bbb(x), ccc() in your sample strings). When that part is done, the final [^()]*$ ensures that there are no more parens before the end of the string.
Be aware, though, that this regex is based on the assumption that there will never be more than one level of nesting. In other words, it assumes these are valid:
aaa()
aaa(bbb())
aaa(bbb(), ccc())
...but this isn't:
aaa(bbb(ccc()))
The string ccc(bbb(aaa( in your samples seems to imply that multi-level nesting is indeed permitted. If that's the case, you won't be able to solve your problem with regex alone. (Sure, some regex flavors support recursive patterns, but the syntax is hideous even by regex standards. I guarantee you won't be able to read your own regex a week after you write it.)
A partial solution - this is assuming that your regex is called from within a programming language that can loop.
1) prune the input: find matching parentheses, and remove them with everything in between. Keep going until there is no match. The regex would look for ([^()]) - open parenthesis, not a parenthesis, close parenthesis. It has to be part of a "find and replace with nothing" loop. This trims "from the inside out".
2) after the pruning you have either no parentheses left, or only leading/trailing ones. Now you have to find a word just before an open parenthesis. This requires a regex like \w(. But that won't work if there are multiple unclosed parentheses. Taking the last one could be done with a greedy match (with grouping around the last \w): ^.*\w( "as many characters as you can up to a word before a parenthesis" - this will find the last one.
I am saying "approximate" solution because, depending on the environment you are using, how you say "this matching group" and whether you need to put a backslash before the () varies. I left that detail out as its hard to check on my iPhone.
I hope this inspires you or others to come up with a complete solution.
Not sure which regex langage/platform you're using for this and don't know if subpatterns are allowed in your platform or not. However following 2 step PHP code will work for all the cases you listed above:
$str = 'aaa(bbb(x), ccc()'; // your original string
// find and replace all balanced square brackets with blank
$repl = preg_replace('/ ( \( (?: [^()]* | (?1) )* \) ) /x', '', $str);
$matched = '';
// find word just before opening square bracket in replaced string
if (preg_match('/\w+(?=[^\w(]*\([^(]*$)/', $repl, $arr))
$matched = $arr[0];
echo "*** Matched: [$matched]\n";
Live Demo: http://ideone.com/evXQYt

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/

Regex with exception of particular words

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)
If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/
I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...
This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)
\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.
Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.