how to use regex to grab the eighth word

how to use regex to grab the eighth word - regex

New to Regex
Examples I've seen show searching for very specific exceptions ie. specific letter combos.
What I want is to grab the 8th word no matter what comes before,no matter what those words are.
So the spaces are what designates 'words'
Sample line would be
Sep 20 11:13:18 10.50.3.100 Sep 20 11:13:15 DC1ASM1.dcl.greendotcorp.com Blah Blah Blah
I want to extract the host name, in this case "DC1ASM1.dcl.greendotcorp.com", which is always preceded by "Month, Day, Timestamp, IP, Month, Day, Timestamp" pattern.
Thanks
Rex

I'm not 100% sure what version or flavor of regex you're using, so I'll avoid the look-behind and use a non-capturing group instead:
^(?:\S+?\s){7}(\S+)
That binds to the beginning of the line, ignores 7 consecutive patterns of [any character but whitespace] 1+ times] then [one single whitespsace character].
You can be more specific about "words" by using \w instead of \S if you so chose, though.

This expression will capture the host name in the named group HostName. It assumes there are always only single spaces.
^([^ ]+ ){7}(?<HostName>[^ ]+)
Two handle multiple spaces use the following expression.
^([^ ]+ +){7}(?<HostName>[^ ]+)
To also support tabs use the following expression.
^([^ \t]+[ \t]+){7}(?<HostName>[^ \t]+)

Try something like:
regex = "([^\s]+\s+){7}(?<eighthword>[^\s]+)"

Related

Regular expression search everything before a certain separator

I'm trying to create a regular expression that can find strings between two separators. I have log data that looks like this:
1234 ^||^ 5678 ^||^ 127.0.0.1 ^|x|x|^
It's like a CSV, although the data is separated with ^||^ and the lines are terminated by ^|x|x|^. I have no control over this, this is the way the data is being sent to us by a third party.
I'm trying to capture all the data between the separators. I came up with this regex using a positive lookahead for either the separator or the line end:
[^\^]+(?=(\s\^\|\|\^\s|\s\^\|x\|x\|\^))
This comes close, but the problem is that as soon as ^ appears in the text, there is no match. If I replace the [^\^]+ with .+, the regex becomes too greedy and matches everything up until the last field, including the separators itself.
What would I need to change to match everything between the ^||^ separators, including ^?

If your language supports regex positive lookbehind (ex: PCRE), you can use this one, else you can use #degant one:
(?<=^|\^\|\|\^\s).+?(?=\s\^\|x?\|x?\|?\^)
Demo
Explanation
(?<=^|\^\|\|\^\s) Precedeed by start anchor or a ^||^
.+ At least one character
(?=\s\^\|x?\|x?\|?\^) Followed by ^|, optional x, | , optional x, optional |, ^
Demo

How about the below regex, which will capture anything (including text that contains ^ or even |):
(.+?)(?:\s\^\|x?\|x?\|?\^\s?)
and using capturing group 1 to get just the text that you are looking for.
Regex101 Demo
For test string 1^2|34 ^||^ 56|7|8 ^||^ 6^9 ^|x|x|^
it extracts 1^2|34, 56|7|8 and 6^9
EDIT: Improvements as pointed out by #stej4n.

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?

Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.

I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1

Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3

Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Regular expression to match text after 3 character from string

I have a string like
January
February
March
I want a regex which matches only uary(January), ruary(February) and ch(march) i.e string after 3 character
I have tried this [a-zA-Z]{1,3}(.*?)$
Its working but giving match in group. I don't want in group. I want pure match

Your regex is actually the kind of thing that would be used for this ($ aside), and the "uary" and what not would be called with $1.
(?<=[a-zA-Z]{3}).*(?=\s|$) will do in non-javascript languages, without any capture groups.
https://regex101.com/r/iV0tR3/1

Use lookbehind.
(?<=^[a-zA-Z]{3}).*
or \K
^[a-zA-Z]{3}\K.*

You can use:
\b\w{3}
\b is word boundary, then 3 alphanumerics (plus any underscore).
Here'a a demo: https://regex101.com/r/dX8eC7/1

Matching Conditions in Regex

Just a note upfront: I'm a bit of a regex newbie. Perhaps a good answer to this question would involve linking me to a resource that explains how these sorts of conditions work :)
Lets say that I have a street name, like 23rd St or 5th St. I'd like to get rid of the proceeding "th", "rd", "nd", and "st". How can this be done?
Right now I have the expression: (st|nd|rd|th) . The problem with this is that it will also match street names that contain a "st", "nd", "rd", or "th". So what I really need is a conditional match that looks for a minimum of one number before itself (ie; 1st and not street).
Thank you!

It sounds like you just want to match the ordinal suffix (st|nd|rd|th), yes?
If your regex engine supports it, you could use a lookbehind assertion.
/(?<=\d)(st|nd|rd|th)/
That matches (st|nd|rd|th) only if preceded by a digit \d, but the match does not capture the digit itself.

What you really want are anchors.
Try and replace globally:
\b(\d+)(?:st|nd|rd|th)\b
with the first group.
Explanation:
\b --> matches a position where either a word character (digit, letter, underscore) is followed by a non word character (none of the previous group), or the reverse;
(\d+) --> matches one or more digits, and capture them in first group ($1);
(?:st|nd|rd|th) --> matches any of st, etc... wihtout capturing it ((?:...) is a non capturing group);
\b --> see above.
Demonstration using perl:
$ perl -pe 's/\b(\d+)(?:st|nd|rd|th)\b/$1/g' <<EOF
> Mark, 23rd street, New Hampshire
> I live on the 7th avenue
> No match here...
> azoiu32rdzeriuoiu
> EOF
Mark, 23 street, New Hampshire
I live on the 7 avenue
No match here...
azoiu32rdzeriuoiu

Try using this regex:
(\d+)(?:st|nd|rd|th)
I don't know ruby. In PHP I would use something like:
preg_replace('/(\d+)(?:st|nd|rd|th) /', '$1', 'South 2nd Street');
to remove suffix

To remove the ordinal:
/(\d+)(?:st|nd|rd|th)\b/$1/
You must capture the number so you can replace the match with it. You can capture the ordinal or not, it doesn't matter unless you want to output it somewhere else.
http://www.regular-expressions.info/javascriptexample.html

Matching on repeated substrings in a regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.

Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.

You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.

For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.

This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to use regex to grab the eighth word - regex

Try something like: regex = "([^\s]+\s+){7}(?<eighthword>[^\s]+)"

Related

Regular expression search everything before a certain separator

Capture number between two whitespaces (RegEx)

Regular expression to match text after 3 character from string

Matching Conditions in Regex

Matching on repeated substrings in a regex

Categories

Resources