Regex to extract first 3 words from a string

Regex to extract first 3 words from a string - regex

I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?

Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);

EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)

Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.

Related

Seperate backreference followed by numeric literal in perl regex

I found this related question : In perl, backreference in replacement text followed by numerical literal
but it seems entirely different.
I have a regex like this one
s/([^0-9])([xy])/\1 1\2/g
^
whitespace here
But that whitespace comes up in the substitution.
How do I not get the whitespace in the substituted string without having perl confuse the backreference to \11?
For eg.
15+x+y changes to 15+ 1x+ 1y.
I want to get 15+1x+1y.

\1 is a regex atom that matches what the first capture captured. It makes no sense to use it in a replacement expression. You want $1.
$ perl -we'$_="abc"; s/(a)/\1/'
\1 better written as $1 at -e line 1.
In a string literal (including the replacement expression of a substitution), you can delimit $var using curlies: ${var}. That means you want the following:
s/([^0-9])([xy])/${1}1$2/g
The following is more efficient (although gives a different answer for xxx):
s/[^0-9]\K(?=[xy])/1/g

Just put braces around the number:
s/([^0-9])([xy])/${1}1${2}/g

Explain difference between foo and \(foo\)

grep "http:\/\/.*\.jpg" index.html -o
Gives me text starting with http:// and ending with .jpg
So does: grep "http:\/\/.*\.\(jpg\)" index.html -o
What is the difference? And is there any condition where this might fail?
I got it to match either jpg,png or gif using this regex:
http:\/\/.*\.\(jpg\|png\|gif\)
Something to do with backreference or regex grouping that I read. Cannot understand this part \(\)

Grouping is used for two purposes in regular expressions.
One uses is to delimit parts of the regexp when using alternatives. That's the case in your third regexp, it allows you to say that the extension can be any of jpg, png, or gif.
The other use is for backreferences. This allows you to refer to the text that matched an earlier part of the regexp later in the regexp. For instance, the following regexp matches any letter that appears twice in a row:
\([a-z]\)\1
The backreference \1 means "match whatever matched the first group in the regexp".

( and ) are metacharacters. i.e. they don't match themselves, but mean something to grep.
From here:
Grouping is performed with backslashes followed by parentheses ‘(’,
‘)’.
so in the above the \( and \) define within them a group of possibilities to match separated by the | character. i.e. your filename extensions.

deciphering vim regex

I'm playing with vim-ruby indent, and there are some pretty complex regexes there:
" Regex used for words that, at the start of a line, add a level of indent.
let s:ruby_indent_keywords = '^\s*\zs\<\%(module\|class\|def\|if\|for' .
\ '\|while\|until\|else\|elsif\|case\|when\|unless\|begin\|ensure' .
\ '\|rescue\):\#!\>' .
\ '\|\%([=,*/%+-]\|<<\|>>\|:\s\)\s*\zs' .
\ '\<\%(if\|for\|while\|until\|case\|unless\|begin\):\#!\>'
With the help of vim documentation I deciphered it to mean:
start-of-line <any number of spaces> <start matching> <beginning of a word> /atom
<one of provided keywords> <colon character> <nothing> <end of word> ...
I have some doubts:
Is it really matching ':'? Doesn't seem to work like that, but I don't see anything about colon being some special character in regexes.
why is there \zs (start of the match) and no \ze (end of the match)?
what does \%() do? Is it just some form of grouping?

:\#! says to match only if there is not a colon, if I read it correctly. I am not familiar with the ruby syntax that this is matching against so this may not be quite correct. See :help /\#! and the surrounding topics for more info on lookarounds.
You can have a \zs with no \ze, it just means that the end of the match is at the end of the regex. The opposite is also true.
\%(\) just creates a grouping just as \(\) would except that the group is not available as a backreference (like would be used in a :substitute command).

you can check about matching ':' or any other string by copying the regex and using it to perform a search with / on the code you are working. Using :set incsearch may help you to see what is being matched while you type the regex.
the \zs and \ze don't affect what is matched, but instead determine which part of matched text is used in functions as :s/substitute(). You can check that by performing searches with / and 'incsearch' option set - you can start a search for a string in the text, which will be highlighted, then adding \zsand \ze will change the highlight on the matched text. There is no need to "close" \zsand \ze, as one can discard only the start or the end of the match.
It is a form of grouping that is not saved in temporary variables for use with \1, \2 or submatch(), as stated in :h \%():
\%(\) A pattern enclosed by escaped parentheses.
Just like \(\), but without counting it as a sub-expression. This
allows using more groups and it's a little bit faster.

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?

There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.

If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)

If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/

#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.

Need to test for a "\\" (backslash) in this Reg Ex

Currently I use this reg ex:
"\bI([ ]{1,2})([a-zA-Z]|\d){2,13}\b"
It was just brought to my attention that the text that I use this against could contain a "\" (backslash). How do I add this to the expression?

Add |\\ inside the group, after the \d for instance.

This expression could be simplified if you're also allowing the underscore character in the second capture register, and you are willing to use metacharacters. That changes this:
([a-zA-Z]|\d){2,13}
into this ...
([\w]{2,13})
and you can also add a test for the backslash character with this ...
([\w\x5c]{2,13})
which makes the regex just a tad easier to eyeball, depending on your personal preference.
"\bI([\x20]{1,2})([\w\x5c]{2,13})\b"
See also:
WP Metacharacter
Metacharacters
Shorthand character class

Both #slavy13 and #dreftymac give you the basic solution with pointers, but...
You can use \d inside a character class to mean a digit.
You don't need to put blank into a character class to match it (except, perhaps, for clarity, though that is debatable).
You can use [:alpha:] inside a character class to mean an alpha character, [:digit:] to mean a digit, and [:alnum:] to mean an alphanumeric (specifically not including underscore, unlike \w). Note that these character classes might mean more characters than you expect; think of accented characters and non-arabic digits, especially in Unicode.
If you want to capture the whole of the information after the space, you need the repetition inside the capturing parentheses.
Contrast the behaviour of these two one-liners:
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]){2,13}\b/'
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]{2,13})\b/'
Given the input line "I a123", the first prints "3" and the second prints "a123". Obviously, if all you wanted was the last character of the second part of the string, then the original expression is fine. However, that is unlikely to be the requirement. (Obviously, if you're only interested in the whole lot, then using '$&' gives you the matched text, but it has negative efficiency implications.)
I'd probably use this regex as it seems clearest to me:
m/\bI( {1,2})([[:alnum:]\\]{2,13})\b/
Time for the obligatory plug: read Jeff Friedl's "Mastering Regular Expressions".

As I pointed out in my comment to slavy's post, \\ -> \b as a backslash is not a word character. So my suggestion is
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?:[^\w\\]|$)/
I assumed that you wanted to capture the whole 2-13 characters, not just the first one that applies, so I adjusted my RE.
You can make the last capture a lookahead if the engine supports it and you don't want to consume it. That would look like:
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?=[^\w\\]|$)/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to extract first 3 words from a string - regex

Related

Seperate backreference followed by numeric literal in perl regex

Explain difference between foo and \(foo\)

deciphering vim regex

Vim regex backreference

Need to test for a "\\" (backslash) in this Reg Ex

Categories

Resources