Regex: find last occurance of word in string - regex

I need to find last occurance of a word in a string (and replace it). So in following sentence I would be looking for the second "chocolate".
I love milk chocolate but I hate white chocolate.
How can that be achieved with regular expression? Could you please give me some explanation?
Thanks.

If you want to use a regex you could use something like this:
(.*)chocolate
And the replacement string would be:
$1banana
^-- whatever you want
working demo
Update: as Lucas pointed out in his comment, you can improve the regex by using:
(.*)\bchocolate\b
This allows you to avoid false positives like chocolateeejojo

PCRE would look like this:
/^(.*)chocolate/$1replace/sm

If you want to match the second occurrence of any distinct word, you may be able to use a backreference, depending on the language and regex implementation you're in.
For example, in sed, you might do the following:
sed 's/\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\)/\1russians\4/'
Breaking this down for easier reading, it looks like this:
s/ - substitute in sed
\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\) - the search RE. Not really so complex....
[[:<:]] and [[:>:]] are portable word boundaries,
[[:alpha:]] is the class of alphabetical characters (words)
\( and \) surround atoms for use in backreferences, in BRE (this is sed, remember)
\1russians\4 - replacement string consists of the first (outer) parenthesized backreference from the RE, followed by the replacement word, followed by the trailing characters.
For example:
$ t="I love milk chocolate but I hate white chocolate."
$ sed 's/\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\)/\1russians\4/' <<<"$t"
I love milk chocolate but I hate white russians.
$ t="In a few years, your twenty may be worth twenty bucks."
$ sed 's/\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\)/\1fifty\4/' <<<"$t"
In a few years, your twenty may be worth fifty bucks.
$

Related

Remove dashes surrounded by numbers on both sides

I'm trying to search and replace using regex in TextWrangler (https://gist.github.com/ccstone/5385334, http://www.barebones.com/products/textwrangler/textwranglerpower.html)
I have rows like this
56-84 29 STRINGOFLETTERS -2.54
I´d like to replace the dash in "56-84" with a tab, so I get
56 84 29 STRINGOFLETTERS -2.54
But without replacing the dash in "-2.54"
How do I specifically only remove dashes surrounded by numbers on both sides?
My regex knowledge is extremelly small, I tried to find [0-9]-[0-9] and replace with [0-9][0-9] but that didnt work.
Your link says "The PCRE engine (Perl Compatible Regular Expressions) is what BBEdit and TextWrangler use". So hopefully you can use lookaround with your regex.
replace regex:
(?<=\d)-(?=\d)
replace with tab(\t).
If it's plain text, not sure you need TextWrangler. You can just use the "sed" command of unix:
$ sed 's/\d-\d/\d\d/g' a.txt > b.txt
You actually need to capture the numbers you want. So the regex would be:
^([0-9])-([0-9])
I'm assuming here that the numbers start at the beginning of the line. If not, you can remove the ^.
Based on your link, the flavor of regex is PCRE, so backreferences look like \1, and \2 in the replacement pattern. So your replacement pattern simply becomes:
\1\t\2
Here \1 refers to the first group (so the first number) and \2 refers to the second group (so the second number).

Regex to extract first 3 words from a string

I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?
Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);
EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)
Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.

grep for words ending in 'ing' immediately after a comma

I am trying to grep files for lines with a word ending in 'ing' immediately after a comma, of the form:
... we gave the dog a bone, showing great generosity ...
... this man, having no home ...
but not:
... this is a great place, we are having a good time ...
I would like to find instances where the 'ing' word is the first word after a comma. It seems like this should be very doable in grep, but I haven't figured out how, or found a similar example.
I have tried
grep -e ", .*ing"
which matches multiple words after the comma. Commands like
grep -i -e ", [a-z]{1,}ing"
grep -i -e ", [a-z][a-z]+ing"
don't do what I expect--they don't match phrases like my first two examples. Any help with this (or pointers to a better tool) would be much appreciated.
Try ,\s*\S+ing
Matches your first two phrases, doesn't match in your third phrase.
\s means 'any whitespace', * means 0 or more of that, \S means 'any non-whitespace' (capitalizing the letter is conventional for inverting the character set in regexes - works for \b \s \w \d), + means 'one or more' and then we match ing.
You can use the \b token to match on word boundaries (see this page).
Something like the following should work:
grep -e ".*, \b\w*ing\b"
EDIT: Except now I realised that the \b is unnecessary, and .*,\s*\w*ing would work, as Patashu pointed out. My regex-fu is rusty.

How can I remove hashes from inside a string?

I want to transform a line that looks like this:
any text #any text# ===#text#text#text#===#
into:
any text #any text# ===#texttexttext===#
As you can see above I want to remove the # between ===# and ===#
The number of # that are supposed to be removed can be any number.
Can I do this with sed?
Give this a try:
sed 'h;s/[^=]*=*=#\(.*\)/\1/;s/\([^=]\)#/\1/g;x;s/\([^=]*=\+#\).*/\1/;G;s/\n//g' inputfile
It splits the line in two at the first "=#", then deletes all "#" that aren't preceded by an "=", then recombines the lines.
Let me know if there are specific cases where it fails.
Edit:
This version, which is increasingly fragile, works for your new example as well as the original:
sed 'h;s/[^=]*=[^=]*=*=#\(.*\)$/\1/;s/\([^=]\)#/\1/g;x;s/\([^=]*=[^=]*=\+#\).*/\1/;G;s/\n//g' inputfile
sed uses the GNU BRE engine (GNU Basic Regular Expressions), which doesn't have many features that "newer" regex engines have, such as lookaround which would be very handy in solving this.
I'd say you'd have to first match ===#\(.\+\)===# (note that GNU BRE use backslashes to denote capturing groups and quantifiers, and also does not support lazy quantifiers). Then remove any # found in the captured group (a literal search/replace would be enough), and then put the result back into the string. But I'm not a Unix guy, so I don't know if/how that could be done in sed.

Need to test for a "\\" (backslash) in this Reg Ex

Currently I use this reg ex:
"\bI([ ]{1,2})([a-zA-Z]|\d){2,13}\b"
It was just brought to my attention that the text that I use this against could contain a "\" (backslash). How do I add this to the expression?
Add |\\ inside the group, after the \d for instance.
This expression could be simplified if you're also allowing the underscore character in the second capture register, and you are willing to use metacharacters. That changes this:
([a-zA-Z]|\d){2,13}
into this ...
([\w]{2,13})
and you can also add a test for the backslash character with this ...
([\w\x5c]{2,13})
which makes the regex just a tad easier to eyeball, depending on your personal preference.
"\bI([\x20]{1,2})([\w\x5c]{2,13})\b"
See also:
WP Metacharacter
Metacharacters
Shorthand character class
Both #slavy13 and #dreftymac give you the basic solution with pointers, but...
You can use \d inside a character class to mean a digit.
You don't need to put blank into a character class to match it (except, perhaps, for clarity, though that is debatable).
You can use [:alpha:] inside a character class to mean an alpha character, [:digit:] to mean a digit, and [:alnum:] to mean an alphanumeric (specifically not including underscore, unlike \w). Note that these character classes might mean more characters than you expect; think of accented characters and non-arabic digits, especially in Unicode.
If you want to capture the whole of the information after the space, you need the repetition inside the capturing parentheses.
Contrast the behaviour of these two one-liners:
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]){2,13}\b/'
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]{2,13})\b/'
Given the input line "I a123", the first prints "3" and the second prints "a123". Obviously, if all you wanted was the last character of the second part of the string, then the original expression is fine. However, that is unlikely to be the requirement. (Obviously, if you're only interested in the whole lot, then using '$&' gives you the matched text, but it has negative efficiency implications.)
I'd probably use this regex as it seems clearest to me:
m/\bI( {1,2})([[:alnum:]\\]{2,13})\b/
Time for the obligatory plug: read Jeff Friedl's "Mastering Regular Expressions".
As I pointed out in my comment to slavy's post, \\ -> \b as a backslash is not a word character. So my suggestion is
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?:[^\w\\]|$)/
I assumed that you wanted to capture the whole 2-13 characters, not just the first one that applies, so I adjusted my RE.
You can make the last capture a lookahead if the engine supports it and you don't want to consume it. That would look like:
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?=[^\w\\]|$)/