regex: how to include the first occurrence before the pattern? - regex

My text is:
120 something 130 somethingElse Paris
My goal is to capture 130 somethingElse Paris which means only the last occurrence of number BEFORE Paris
I tried:
\d+.*Paris
But this captures the WHOLE string (from first occurrence of digit)
The rule is:
Capture everything before Paris until first occurrence of digit is found.
Any clue ?
regards

Try this regex:
/(\d+[^\d]*Paris)/gi
http://jsfiddle.net/rooseve/XDgxL/

less tracebacks and without relying on greediness:
\d+[^0-9]*Paris

for last occurrence
^code:[ ]([0-9a-f-]+)(?:(?!^code:[ ])[\s\S])*Paris
you have to customize with your text.
Please refer this:
Regex match everything from the last occurrence of either keyword
Match from last occurrence using regex in perl
RegExp: Last occurence of pattern that occurs before another pattern
Regex get last occurrence of the pattern

You should add a ? after the * to make it un-greedy. Like this:
\d+.*?Paris

You can use this pattern:
(\d+\D*?)+Paris
other occurences of the capturing group are overwritten by the last.
The lazy quantifier *? is used to force the pattern to stop at the first word "Paris". Otherwise, in a string with more than one word "Paris", the pattern will return the last group after the last word "Paris" with a greedy quantifier.

Related

RegEx, get very first match or very last match?

New to RegEx, PCRE(PHP), have a basic question:
Text String I'm working with is below, text is literal
us%3Aks%2Cus%3Aal%2Cus%3Aok%2Cus%3Aia%2Cus%3Ala%2Cus%3Asc%2Cus%3Aut%2Cus%3Act%2Cus%3Aor%2Cus%3Atn%2Cus%3Amo%2Cus%3Aaz%2Cus%3Ain%2Cus%3Amd%2Cus%3Aco%2Cus%3Awi%2Cus%3Awa
Goal for getting the first is to get everything up to the first %2C and the first %2C -> "us%3Aks%2C"
Goal for getting the last is to get the the last %2C and everything after it. -> "%2Cus%3Awa"
What am I doing wrong with my attempts?
1. ^(.+%2C)
2. (%2C.+)$
You may use this regex with a lazy match and a greedy match:
^(.*?%2C).+(%2C.*)$
RegEx Demo
RegEx Details:
^: Start
(.*?%2C): Match 0 or more characters followed by %2C (lazy match) in group #1
.+: Match 1 or more of any characters (greedy match)
(%2C.*): Match %2C followed by 0 or more characters in group #2
$: End
It's a matter of greediness, which controls how many characters the expression will gobble before being satisfied. So, instead of using .+, you could use .*?.
For your case (1), the expression becomes:
1. ^(.*?%2C)
For your second case, unfortunately, purely lazy matching will not help, but we will have to actually skip most of the string in advance, with a very greedy .+, so the second expression becomes something like:
2. .+(%2C.+)$

How to match text which the part of it is already matched previous?

I have a string like aaa**b***c****ddd, and I want to get a sequence of matched text of pattern [^*]\*+[^*], which should I thank be [a**b, b***c, c***d]. However, when I test this in text editor like vim or emacs, the second (b***c) is not matched.
aaa**b***c***ddd
|--| |---|
first third
|---|
second, which I think should be matched but not
How should I modify the regular expression to match the second?
Yes you can, the trick consists to put all in a capturing group inside a lookahead to allow overlapping results:
(?=([^*]\*+[^*]))
But you can't use this do to replacements since this pattern matches nothing. (or perhaps if you can get the capture group length and the current offset)
EDIT:
it seems to be possible to obtain the capture group length with vim with strlen(submatch(1))
#CommuSoft is correct. One way to approach this problem would be to match the whole string against this regex and then the second time around, you match this regex against the substring that starts at (index_of_first_previous_match + 1) until the end of the string. Hope that is clear.
So if the index of your first match above (a**b) was 2. Then the new substring that you match against the regex the second time should start from index 3 till the end of the string. This will give you the two results.
However, Casimir's answer is much simpler.

Greedy/non-greedy quantifiers in ABAP regular expressions

I would like to extract 2 things from this string: | 2013.10.10 FEL felsz
regex -> Date field -> the needed value will be only the 2013.10.10 (in this case)
regex -> String between 2013.10.10 and felsz string -> the needed value will be only the FEL string (in this case).
I tried with the following regexes as with not too much success:
(.*?<p/\s>.*?)(?=\s)
(.*?<p/("[0-9]+">.*?)(?=\s)
Do you have any suggestions?
As mentioned in comments, since ABAP doesn't allow non-greedy match with *?, if you can count on felsz occurring only immediately after the second portion you want to match you could use:
(\d{4}\.\d\d\.\d\d) (.*) felsz
(PS: Invalidated first answer: in non-ABAP systems where *? is supported, the following regex will get both values into submatches. The date will be in submatch 1 and the other value (FEL in this case) will be in submatch 2 : `(\d{4}.\d\d.\d\d) (.*?) felsz)
Is "felsz" variable? Can the white space vary? Can your date format vary? If not:
\| (\d{4}\.\d{2}\.\d{2}) (.*?) felsz
Otherwise:
\|\s+?(\d{4}\.\d{2}\.\d{2})\s+?(.*?)\s+?[a-z]+
Then access capture groups 1/2.
The regex
\d+\.\d+\.\d+
matches 2013.10.10 in the given string. Explanation and demonstration: http://regex101.com/r/bL7eO0
(?<=\d ).*(?= felsz)
should work to match FEL. Explanation and demonstration: http://regex101.com/r/pV2mW5
If you want them in capturing groups, you could use the regex:
\| (\d+\.\d+\.\d+) (.+?) .*
Explanation and demonstration: http://regex101.com/r/rQ6uU4
How about:
(?:\d+\.\d+\.\d+\s)(.*)\s See it in action.
This matches FEL
Some things I took for granted:
the date always comes first and is a mix of numbers and periods
the date is always followed by a space
the word to capture is always followed by a space
the word to capture never contains a space
Assuming that FEL is always a single word (that is, delimited by a space), you could use the following expression:
(\d{4}\.\d\d\.\d\d) ([^\s]+) (.*)

Regex to match first word in sentence

I am looking for a regex that matches first word in a sentence excluding punctuation and white space. For example: "This" in "This is a sentence." and "First" in "First, I would like to say \"Hello!\""
This doesn't work:
"""([A-Z].*?(?=^[A-Za-z]))""".r
(?:^|(?:[.!?]\s))(\w+)
Will match the first word in every sentence.
http://rubular.com/r/rJtPbvUEwx
This is an old thread but people might need this like I did.
None of the above works if your sentence starts with one or more spaces.
I did this to get the first (non empty) word in the sentence :
(?<=^[\s"']*)(\w+)
Explanation:
(?<=^[\s"']*) positive lookbehind in order to look for the start of the string, followed by zero or more spaces or punctuation characters (you can add more between the brackets), but do not include it in the match.
(\w+) the actual match of the word, which will be returned
The following words in the sentence are not matched as they do not satisfy the lookbehind.
You can use this regex: ^[^\s]+ or ^[^ ]+.
You can use this regex: ^\s*([a-zA-Z0-9]+).
The first word can be found at a captured group.
[a-z]+
This should be enough as it will get the first a-z characters (assuming case-insensitive).
In case it doesn't work, you could try [a-z]+\b, or even ^[a-z]\b, but the last one assumes that the string starts with the word.

Matching Conditions in Regex

Just a note upfront: I'm a bit of a regex newbie. Perhaps a good answer to this question would involve linking me to a resource that explains how these sorts of conditions work :)
Lets say that I have a street name, like 23rd St or 5th St. I'd like to get rid of the proceeding "th", "rd", "nd", and "st". How can this be done?
Right now I have the expression: (st|nd|rd|th) . The problem with this is that it will also match street names that contain a "st", "nd", "rd", or "th". So what I really need is a conditional match that looks for a minimum of one number before itself (ie; 1st and not street).
Thank you!
It sounds like you just want to match the ordinal suffix (st|nd|rd|th), yes?
If your regex engine supports it, you could use a lookbehind assertion.
/(?<=\d)(st|nd|rd|th)/
That matches (st|nd|rd|th) only if preceded by a digit \d, but the match does not capture the digit itself.
What you really want are anchors.
Try and replace globally:
\b(\d+)(?:st|nd|rd|th)\b
with the first group.
Explanation:
\b --> matches a position where either a word character (digit, letter, underscore) is followed by a non word character (none of the previous group), or the reverse;
(\d+) --> matches one or more digits, and capture them in first group ($1);
(?:st|nd|rd|th) --> matches any of st, etc... wihtout capturing it ((?:...) is a non capturing group);
\b --> see above.
Demonstration using perl:
$ perl -pe 's/\b(\d+)(?:st|nd|rd|th)\b/$1/g' <<EOF
> Mark, 23rd street, New Hampshire
> I live on the 7th avenue
> No match here...
> azoiu32rdzeriuoiu
> EOF
Mark, 23 street, New Hampshire
I live on the 7 avenue
No match here...
azoiu32rdzeriuoiu
Try using this regex:
(\d+)(?:st|nd|rd|th)
I don't know ruby. In PHP I would use something like:
preg_replace('/(\d+)(?:st|nd|rd|th) /', '$1', 'South 2nd Street');
to remove suffix
To remove the ordinal:
/(\d+)(?:st|nd|rd|th)\b/$1/
You must capture the number so you can replace the match with it. You can capture the ordinal or not, it doesn't matter unless you want to output it somewhere else.
http://www.regular-expressions.info/javascriptexample.html