The regex characters ?, $, | - regex

I have the following data:
abc def; ghi.
This regex will match:
([a-z0-9A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöùúûüýÿ ]*)\W (.*)( (\w\.))?
This regex will also match
([a-z0-9A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöùúûüýÿ ]*)\W (.*)$
I'm still quite new to regex's, but I thought | stood for OR, () grouped and ? stood for 0 or one occurence. So i thought when combining above queries it would still match. However the following will not match:
([a-z0-9A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöùúûüýÿ ]*)\W (.*)( (\w\.))|$
What am I doing wrong?
ps.
I am using the following for testing my regex.
http://regexpal.com/
EDIT:
I didn't use the code tag, so a character disappeared
EDIT2:
What I am trying to match is the following, the data will be a name.
So "abc def" is the surname. ghi the salutation (english is not my native language, is that the correct term for words like sir. ?). It's however possible that the first letter of the first name. That's why it should either be the end of the line, or that letter.
The data when there is a first name involved would be:
abc; def. G.

Operator precedence for the | operator is a little tricky. It's usually a good idea to explicitly wrap its two operands in parentheses.
Also, be careful about inserting spaces into your regexes. It looks like you want to match a literal period in the \w. fragment, to match "G."
So I think what you want for the combined expressions is something like
((.*)( (\w\.))?)|(.*)$
But since ? means 0 or more, as you have learned, this can be rewritten as
(.*)( (\w\.))?)$
And, to add the rest of the expression back in, we have
^[a-z0-9A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöùúûüýÿ ]*)\W (.*)( (\w\.))?)$
And, yes, "salutation" or "greeting" is a good word for "Mr.", "Ms.", "Dr.", etc.

Related

Regex expression to match word in a string

I've been banging my head with this issue as I can't get this expression to work.
I'm trying to match and output a specific word from a string, so for example, take this string:
<ANIMALS>
<value>DOG CAT COW</value>
</ANIMALS>
And now I want to match any one of them and return that value otherwise none, let's say, COW.
I've tried a lot of varying expressions with no luck such as:
IF(VALUE == "/(^|\W)COW($|\W)/", "COWVALUE", "NONE")
This doesn't work, nor do any other variants I've tried. If I keep the original string as a single word and no actual calling expression, just a word, then it always works. As soon as I introduce a string of words then I can't make it happen.
Could anyone help please?
Thanks!
Get rid of (^|\W) and ($|\W), since that only matches at word boundaries, but you want to match COW when it's not by itself as a word. The regular expression should just be /COW/, it will match that string wherever it appears in the string.
BTW, to match word boundaries you can use \b rather than those alternations.
Not sure what programming language you're using but following is a simple example based in Javascript to do achieve your goals.
regex demo

Regex: how to match all character classes and not just one or more [duplicate]

Obviously, you can use the | (pipe?) to represent OR, but is there a way to represent AND as well?
Specifically, I'd like to match paragraphs of text that contain ALL of a certain phrase, but in no particular order.
Use a non-consuming regular expression.
The typical (i.e. Perl/Java) notation is:
(?=expr)
This means "match expr but after that continue matching at the original match-point."
You can do as many of these as you want, and this will be an "and." Example:
(?=match this expression)(?=match this too)(?=oh, and this)
You can even add capture groups inside the non-consuming expressions if you need to save some of the data therein.
You need to use lookahead as some of the other responders have said, but the lookahead has to account for other characters between its target word and the current match position. For example:
(?=.*word1)(?=.*word2)(?=.*word3)
The .* in the first lookahead lets it match however many characters it needs to before it gets to "word1". Then the match position is reset and the second lookahead seeks out "word2". Reset again, and the final part matches "word3"; since it's the last word you're checking for, it isn't necessary that it be in a lookahead, but it doesn't hurt.
In order to match a whole paragraph, you need to anchor the regex at both ends and add a final .* to consume the remaining characters. Using Perl-style notation, that would be:
/^(?=.*word1)(?=.*word2)(?=.*word3).*$/m
The 'm' modifier is for multline mode; it lets the ^ and $ match at paragraph boundaries ("line boundaries" in regex-speak). It's essential in this case that you not use the 's' modifier, which lets the dot metacharacter match newlines as well as all other characters.
Finally, you want to make sure you're matching whole words and not just fragments of longer words, so you need to add word boundaries:
/^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b).*$/m
Look at this example:
We have 2 regexps A and B and we want to match both of them, so in pseudo-code it looks like this:
pattern = "/A AND B/"
It can be written without using the AND operator like this:
pattern = "/NOT (NOT A OR NOT B)/"
in PCRE:
"/(^(^A|^B))/"
regexp_match(pattern,data)
The AND operator is implicit in the RegExp syntax.
The OR operator has instead to be specified with a pipe.
The following RegExp:
var re = /ab/;
means the letter a AND the letter b.
It also works with groups:
var re = /(co)(de)/;
it means the group co AND the group de.
Replacing the (implicit) AND with an OR would require the following lines:
var re = /a|b/;
var re = /(co)|(de)/;
You can do that with a regular expression but probably you'll want to some else. For example use several regexp and combine them in a if clause.
You can enumerate all possible permutations with a standard regexp, like this (matches a, b and c in any order):
(abc)|(bca)|(acb)|(bac)|(cab)|(cba)
However, this makes a very long and probably inefficient regexp, if you have more than couple terms.
If you are using some extended regexp version, like Perl's or Java's, they have better ways to do this. Other answers have suggested using positive lookahead operation.
Is it not possible in your case to do the AND on several matching results? in pseudocode
regexp_match(pattern1, data) && regexp_match(pattern2, data) && ...
Why not use awk?
with awk regex AND, OR matters is so simple
awk '/WORD1/ && /WORD2/ && /WORD3/' myfile
The order is always implied in the structure of the regular expression. To accomplish what you want, you'll have to match the input string multiple times against different expressions.
What you want to do is not possible with a single regexp.
If you use Perl regular expressions, you can use positive lookahead:
For example
(?=[1-9][0-9]{2})[0-9]*[05]\b
would be numbers greater than 100 and divisible by 5
In addition to the accepted answer
I will provide you with some practical examples that will get things more clear to some of You. For example lets say we have those three lines of text:
[12/Oct/2015:00:37:29 +0200] // only this + will get selected
[12/Oct/2015:00:37:x9 +0200]
[12/Oct/2015:00:37:29 +020x]
See demo here DEMO
What we want to do here is to select the + sign but only if it's after two numbers with a space and if it's before four numbers. Those are the only constraints. We would use this regular expression to achieve it:
'~(?<=\d{2} )\+(?=\d{4})~g'
Note if you separate the expression it will give you different results.
Or perhaps you want to select some text between tags... but not the tags! Then you could use:
'~(?<=<p>).*?(?=<\/p>)~g'
for this text:
<p>Hello !</p> <p>I wont select tags! Only text with in</p>
See demo here DEMO
You could pipe your output to another regex. Using grep, you could do this:
grep A | grep B
((yes).*(no))|((no).*(yes))
Will match sentence having both yes and no at the same time, regardless the order in which they appear:
Do i like cookies? **Yes**, i do. But milk - **no**, definitely no.
**No**, you may not have my phone. **Yes**, you may go f yourself.
Will both match, ignoring case.
Use AND outside the regular expression. In PHP lookahead operator did not not seem to work for me, instead I used this
if( preg_match("/^.{3,}$/",$pass1) && !preg_match("/\s{1}/",$pass1))
return true;
else
return false;
The above regex will match if the password length is 3 characters or more and there are no spaces in the password.
Here is a possible "form" for "and" operator:
Take the following regex for an example:
If we want to match words without the "e" character, we could do this:
/\b[^\We]+\b/g
\W means NOT a "word" character.
^\W means a "word" character.
[^\We] means a "word" character, but not an "e".
see it in action: word without e
"and" Operator for Regular Expressions
I think this pattern can be used as an "and" operator for regular expressions.
In general, if:
A = not a
B = not b
then:
[^AB] = not(A or B)
= not(A) and not(B)
= a and b
Difference Set
So, if we want to implement the concept of difference set in regular expressions, we could do this:
a - b = a and not(b)
= a and B
= [^Ab]

PowerGREP - regular expression

I have log of Apache and each line of file looks like:
script.php?variable1=value1&variable2=value2&variable3=value3&.........................
I need to take out this part of string:
variable1=value1&variable2=value2
and ignore the rest of line. How I can do this in PowerGREP?
I tried:
variable1=(.*)&variable2=(.*)&
But I get rest of line after value2.
Please help me, sorry for my english.
Contrary to what Ed Cottrell wrote about his second example, the first one works better (i. e. correctly); this is because if the subexpression for value2 is made non-greedy, it matches as few characters as possible, i. e. not any.
If you wouldn't mind having the & after value2 included in the match, you could as well hone your try by making the subexpression for value2 non-greedy, so that it only extends to the next &:
variable1=(.*)&variable2=(.*?)&
Replace . with [^&] and drop the final &, like this:
variable1=(.*)&variable2=([^&]*)
. will match anything it can (any character except for the newline character, basically). [^&], on the other hand, matches only characters that are not &.
For even better results and faster performance, you can also replace the first . in the same way and add ? (the non-greedy qualifier), like so:
variable1=([^&]*?)&variable2=([^&]*?)
Here's a working demo.

Regex: match everything before FIRST underscore and everything in between AFTER

I have an expression like
test_abc_HelloWorld_there could be more here.
I'd like a regex that takes the first word before the first underscore. So get "test"
I tried [A-Za-z]{1,}_ but that didn't work.
Then I'd like to get "abc" or anything in between the first 2 underscores.
2 Separate Regular expressions, not combined
Any help is very appreciated!
Example:
for 1) the regex would match the word test
for 2) the regex would match the word abc
so any other match for either case would be wrong. As in, if I were to replace what I matched on then I would get something like this:
for case 1) match "test" and replace "test" with "Goat".
'Goat_abc_HelloWorld_there could be more here'
I don't want a replace, I just want a match on a word.
In both case you can use assertions.
^[^_]+(?=_)
will get you everything up to the first underscore of the line, and
(?<=_)[^_]+(?=_)
will match whatever string is located between two unserscores.
Step back and consider that maybe you're overengineering the solution here. Ruby has a split method for this, other languages probably have their own equivalents
given something like this "AAPL_annual_i.xls", you could just do this and take advantage of the fact that your data is already structured
string_object = "AAPL_annual_i.xls"
ary = string_object.split("_")
#=> ["AAPL", "annual", "i.xls"]
extension = ary.split(".")[1]
#=> ["xls"]
filetype = ary[3].split(".")[0] #etc
'doh!
But seriously, I've found that leaning on the split method is not only easier on me, it's easier on my associates who have to read my code and understand what it does.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.