Match everything except every given combination - regex

Given string, for example abbbabf
given piece, for example ab
Needed, that remove all characters, except every pieces, that is from abbbabf must get result: abab
How should be regex pattern for this ?
Edit
Lets take php as example
Its simply to remove everyting, except piece, if piece is just one symbol, that is if piece is a, must do
$str = "abbbabf";
echo preg_replace("#[^a]#", "", $str);
and result is aa
But how to make this when piece is more than one symbol, I have no idea...
Please dont give solutions such as:
preg_match_all("#ab#", $str, $a);
echo implode($a[0]);
Thanks
PS. I need make this In ORACLE database and if I find solution (one pattern) without procedure handling, will be cool.

The following can do it using capture groups rather than assertions:
$str = "helloababblolobbbabf";
^^^^ ^^
echo preg_replace("#.*?(ab|$)#", "$1", $str);
// Output: ababab
RegExr
Since you say you're actually working in Oracle, you can use REGEXP_REPLACE:
REGEXP_REPLACE(input, '.*?(ab|$)', '\1')
SQLFiddle

The expression you need to use is this:
((?<=ab|^).*?(?=ab|$))
From the string, abbbabfasdfsdfsdfab ababab is returned.
See it in action: http://regex101.com/r/nT8mC1
Caveat as Bart points out in a comment, Oracle doesn't implement much of the PCRE standard, and as such this simply won't work. You'll have to look at implementing some sort of capture set where you can capture the string you want and rebuild it with implode (which you don't want to do apparently).
Edit added suggestion for conditional from comments.

Related

Regex - skip over expressions and parse the rest

I use regular expressions for sorting data into groups. The lines look somewhat like:
testword test
test testword
tes.w. tes.
tes tes.w.
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
The word test is to be found as well as othertexttobefound and sometexttobefound.
Now I am trying to tell my parser that it is supposed to plainly ignore testword and its derivatives while searching and focus on the rest of my data entries. The "good words" and the "bad words" can be anywhere in each line.
I have tried [^w] which is fine for the beginning of strings, but in my versions not for the other cases. Also (?:w) didn't do the trick. I cannot use lookarounds as these would keep the whole line from being detected.
After long searches on the internet I am hoping for help here!
After much appreciated help from Naxos84, I am adding some German real life examples:
sozialabgabe sozialarbeiter
soz.abg. sozialarbeiter
sozarbeiter soz.abg.
sozialarbeiter otherirrelevantstuff
otherirrelevantstuff soz abg
otherirrelevantstuff sozabg
otherirrelevantstuff sozialabgabe
If I search with:
sozial["^\ab"]|soz["^\ab"]|sometexttobefound|othertexttobefound
Lines 6 and 7 get marked as well, but I don't want those.
What am I doing wrong?
A link:
regexr
To find all the matches you want: any occurence of "test" and "sometexttobefound" and "othertexttobefound you can try the following regex:
test[^\w]|sometexttobefound|othertexttobefound
This regex means:
Find every "test" that is not followed by a word OR sometexttobefound OR othertexttobefound
I tried this regex with the follow text (I added a few 'test's)
testword test
test testword
tes.w. testtes.
tes tes.w. test
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
at regexr (when using the global flag)
If you also want to find things like "tes" I guess you should add it. (I'm not a regex expert)
Like:
test[^\w]|tes[^\w]|sometexttobefound|othertexttobefound
If you want to get all words from the text except from some special words, you could use:
#words = grep{$_ ne 'testword'} split /\P{L}+/, $str;
(if $str is your complete string)
See perl docs for \P{...}. Instead of \P{L}, you could also use \W, but those are locale-dependent.
But if you need to use regexps only, then you could use
#words = $str =~ /\b(?!testword)\p{L}+\b/g;
But again, \b is locale-dependent again, so you might want to use \b{...} or rebuild the word boundary matches with \p{L}:
#words = $str =~ /
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
(?!testword)\p{L}+
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
/gx;

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.

Regex which ignores comments

being a regex beginner, I need some help writing a regex. It should match a particular pattern, lets say "ABC". But the pattern shouldn't be matched when it is used in comment (' being the comment sign). So XYZ ' ABC
shouldn't match. x("teststring ABC") also shouldn't match. But ABC("teststring ' xxx") has to match to end, that is xxx not being cut off.
Also does anybody know a free Regex application that you can use to "debug" your regex? I often have problems recognizing whats wrong with my tries. Thanks!
Some will swear by RegexBuddy. I've never used the debugger, but I advise you to steer away from the regex generator it provides. It's just a bad idea.
You may be able to pull this off with whatever regex flavor you're using, but in general I think you're going to find it easier and more maintainable to do this the "hard" way. Regular expressions are for regular languages, and nested anything usually means that regexes aren't a good idea. Modern extensions to regex syntax means it may be doable, but it's not going to be pretty, and you sure won't remember what happened in the morning. And one place where regular expressions fail quite spectacularly (even with modern non-regular extensions) is parsing nested structures - trying to parse any mixture comments, quoted strings, and parenthesis quickly devolves into an incomprehensible and unmaintainable mess. Don't get me wrong - I'm a fan of regular expressions in the right places. This isn't one of them.
On the topic of good regex tools, I really like RegexBuddy, but it's not free.
Other than that, a regex is the wrong tool for the job if you need to check inside string delimiters and all sorts too. You need a finite-state machine.
Odd that lots of people recommend their favorite tools, but nobody provides a solution for the problem at hand. (I'm the developer of RegexBuddy, so I'll refrain from recommending any tools.)
There's no good way of matching Y unless it's part of XYZ with a single regular expression. What you can do is write a regex that matches both Y and XYZ: Y|XYZ. Then use a bit of extra code to process the matches for Y, and ignore those for XYZ. One way to do that is with a capturing group: (Y)|XYZ. Now you can process the matches of the first capturing group. When XYZ matches, the capturing group doesn't match anything.
To do this for your VB-style comments, you can use the regex:
'.*|(ABC)
This regex matches a single quote and everything up to the end of the line, or ABC. This regex will match all comments (whether those include ABC or not). The capturing group will match all occurrences of ABC, except those in comments.
If you want your regex to both skip comments and strings, you can add strings to your regex:
'.*|"[^"\r\n]*"|(ABC)
I find the best 'debugger' for regexes is just messing around in an interactive environment trying lots of small bits out. For Python, ipython is great; for Ruby, irb, for command-line type stuff, sed...
Just try out little pieces at a time, make sure you understand them, then add an extra little bit. Rinse and repeat.
For NET development you might as well try RegexDesigner, this tool can generate code(VB/C#) for you. It is a very good tool for us Regex starters.
link text
Here is my solution to this problem:
1. Find a store all your comments in hash
2. Do your regexp replacement
3. Bring comments back to file
Save your time :-)
string fileTextWithComments = "Some tetx file contents";
Dictionary<string, string> comments = new Dictionary<string, string>();
// 1. Find a store all your comments in hash
Regex rc = new Regex("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)");
MatchCollection matches = rc.Matches(fileTextWithComments);
int index = 0;
foreach (Match match in matches)
{
string key = string.Format("/*Comment#{0}*/", index++);
comments.Add(key, match.Value);
fileTextWithComments = fileTextWithComments.Replace(match.Value, key);
}
// 2. Do your regexp replacement
Regex r = new Regex("YOUR REGEXP PATTERN");
fileTextWithComments = r.Replace(fileTextWithComments, "NEW STRING");
// 3. Bring comments back to file :-)
foreach (string key in comments.Keys)
{
string comment = comments[key];
fileTextWithComments = fileTextWithComments.Replace(key, comment);
}
Could you clarify? I read it thrice, and I think you want to match a given pattern when it appears as a literal. As in not as part of a comment or a string.
What your asking for is pretty tricky to do as a single regexp. Because you want to skip strings. Multiple strings in one line would complicate matters.
I wouldn't even try to do it in one regexp. Instead, I'd pass each line through a filter first, to remove strings, and then comments in that order. And then try and match your pattern.
In Perl because of it's regexp processing power. Assuming #lines is a list of lines you want to match, and $pattern is the pattern you want to match.
#matches =[];
for (#lines){
$line = $_;
$line ~= s/"[^"]*?(?<!\)"//g;
$line ~= s/'.*//g;
push #matches, $_ if $line ~= m/$pattern/;
}
The first substitution finds any pattern that starts with a double quotation mark and ends with the first unescaped double quote. Using the standard escape character of a backspace.
The next strips comments. If the pattern still matches, it adds that line to the list of matches.
It's not perfect because it can't tell the difference between "a\\" and "a\" The first is usually a valid string, the later is not. Either way these substitutions will continue to look for another ", if one isn't found the string isn't thrown out. We could use another substitution to replace all double backslashes with something else. But this will cause problems if the pattern you're looking for contains a backslash.
You can use a zero width look-behind assertion if you only have single line comments, but if you're using multi-line comments, it gets a little trickier.
Ultimately, you really need to solve this kind of issue with some sort of parser, given that the definition of a comment is really driven by a grammar.
This answer to a different but related question looks good too...
If you have Emacs, there is a built-in regex tool called "regexp-builder". I don't really understand the specifics of your regex question well enough to suggest an answer to that.
RegEx1: (-user ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -user "test user"
Regex2: (-daterange ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -daterange "1/4/13 1/20/13"
RegEx3: (-date )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -date 1/4/13 -
RegEx4: (-day )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -day monday -
Search for the quoted value first if not found, search for the no quotes parameter. This expects only one occurrence of the parameter. It also expects the command to either; use quotes to encapsulate a string with no quotes inside, or; use any character other than a quote in the first position, have no occurrence of ' -' until the next parameter, and have a trailing ' -' (add it onto the string before the regex).

Regular expression replace a word by a link

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.
Example:
i'm living in Paris, near Paris Gare du Nord, i love Paris.
would become
i'm living.........near Paris..........i love Paris.
This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="...">Paris</a>), and eliminate the inner link.
Regex for step one is dead-simple:
\bParis\b
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.
The approach assumes these side conditions:
Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:
in the <b>capital of France</b>, <a href="">Paris</a>
The surplus link comes from step one, replacement result of step 2 will be:
in the <b>capital of France</b>, Paris
You could search for this regular expression:
(<a[^>]*>.*?</a>)|Paris
This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.
Replace the match with your link only if the capturing group did not match anything.
E.g. in C#:
resultString =
Regex.Replace(
subjectString,
"(<a[^>]*>.*?</a>)|Paris",
new MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match m) {
if (m.groups(1).Success) {
return m.groups(1).Value;
} else {
return "Paris";
}
}
Traditional answer for such question: use a real HTML parser. Because REs aren't really good at operating in a context. And HTML is complex, a 'a' tag can have attributes or not, in any order, can have HTML in the link or not, etc.
Regular expression:
!(<a.*</a>.*)*Paris!isU
Replacement:
$1Paris
$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.
This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".
PHP example:
<?php
$s = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
$regex = '!(<a.*</a>.*)*Paris!isU';
$replace = '$1Paris';
$result = preg_replace( $regex, $replace, $s);
?>
Addition:
This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want.
Nevertheless I see no way to solve your problem completely with a simple regular expression.
If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.
You define two templates:
One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.
$pattern = 'Paris';
$text = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
// 1. Define 2 arrays:
// $matches[1] - array of links with our keyword
// $matches[2] - array of keyword
preg_match_all('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)#', $text, $matches);
// Exists keywords for replace? Define first keyword without tag <a>
$number = array_search($pattern, $matches[2]);
// Keyword exists, let's go rock
if ($number !== FALSE) {
// Replace all link with temporary value
foreach ($matches[1] as $k => $tag) {
$text = preg_replace('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)#', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
}
// Replace our keywords with link
$text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', ''.$pattern.'', $text);
// Return link
foreach ($matches[1] as $k => $tag) {
$text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
}
// It's work!
echo $text;
}
Regexes don't replace. Languages do.
Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)
s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i
Proper names might work better:
s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;
Of course "Baton Rouge" would become two links for:
Baton
Rouge
In Perl, you can do this:
my $barred_list_of_cities
= join( '|'
, sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
);
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;
But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.