Using regular expression to extract full words from text - regex

I have been working with parsing data, I got a string like:
"Scottish Premier League (click here to open|close this coupon)"
I would like to extract "Scottish Premier League" with Scottish Matching Group 1 and Premier League Matching Group 2, using regular expression.
Please show me the way to do that using regular expression.
MatchCollection matchCol = reg.Matches("Scottish Premier League (click here to open|close this coupon)");

If you just want to match each specific word then your regex could be something like:
(Scottish) (Premier League)
If you want to match the first word then the next two:
([\w]+) ([\w]+ [\w]+)
Another way of writing this that accounts for multiple spaces between words is:
(\w+)\s+(\w+\s+\w+)

/(Scottish) (Premier League)/

Basic and direct:
$s = "Scottish Premier League (click ... coupon)";
$s =~ m/(Scottish) (Premier League)/;
print "Match groups one and two: '$1' '$2'\n";
You probably wanted more generalized matching:
$s = "Generalized Matching on a string (click ... coupon)";
$s =~ m/^(\S+)\s(.+)\s+\(click/;
print "Match groups one and two: '$1' '$2'\n";
These are Perl; be more specific next time.
Also, help yourself, use a tool, like RegexBuddy or Expresso.

Given that you only gave one string to which the regex would be applied, it is hard to tell if this solution would work for your various other cases:
/^(\w*) (.*) \(/

Related

How to filter unwanted parts of a PowerShell string with Regex and replace?

I am confused about the workings of PowerShell's -replace operator in regards to its use with regex. I've looked for documentation online but can't find any that goes into more detail than basic use: it looks for a string, and replaces that string with either another string (if defined) or nothing. Great.
I want to do the same thing as the person in this question where the user wants to extract a simple program name from a complex string. Here is the code that I am trying to replicate:
$string = '% O0033(SUB RAD MSD 50R III) G91G1X-6.4Z-2.F500 G3I6.4Z-8.G3I6.4 G3R3.2X6.4F500 G91G0Z5. G91G1X-10.4 G3I10.4 G3R5.2X10.4 G90G0Z2. M99 %'
$program = $string -replace '^%\sO\d{4}\((.+?)\).+$','$1'
$program
SUB RAD MSD 50R III
As you can see the output string is the string that the user wants, and everything else is filtered out. The only difference for me is that I want a string that is composed of six digits and nothing else. However when I attempt to do it on a string with my regex, I get this:
$string2 = '1_123456_1'
$program2 = $string -replace '(\d{6})','$1'
$program2
1_123456_1
There is no change. Why is this happening? What should my code be instead? Furthermore, what is the $1 used for in the code?
The -replace operator only replaces the part of the string that matches. A capture group matches some subset of the match (or all of it), and the capture group can be referenced in the replace string as you've seen.
Your second example only ever matches that part you want to extract. So you need to ensure that you match the whole string but only capture the part you want to keep, then make the replacement string match your capture:
$string2 = '1_123456_1'
$program2 = $string -replace '\d_(\d{6})_\d','$1'
$program2
How you match "the rest of the string" is up to you; it depends on what could be contained in it. So what I did above is just one possible way. Other possible patterns:
1_(\d{6})_1
[^_]*_(\d{6})_[^_]*
^.*?(\d{6}).*?$
Capturing groups (pairs of unescaped parentheses) in the pattern are used to allow easy access to parts of a match. When you use -replace on a string, all non-overlapping substrings are matched, and these substrings are replaced/removed.
In your case, -replace '(\d{6})', '$1' means you replace the whole match (that is equal to the first capture, since you enclosed the whole pattern with a capturing group) with itself.
Use -match in cases like yours when you want to get a part of the string:
PS> $string2 = '1_123456_1'
PS> $string2 -match '[0-9]{6}'
PS> $Matches[0]
123456
The -match will get you the first match, just what you want.
Use -replace when you need to get a modified string back (reformatting a string, inserting/removing chars and suchlike).

How can I match multiple hits between 2 delimiters?

Hi, my fellow RegEx'ers ;)
I'm trying to match multiple Texts between every two quotes
Here's my text:
...random code
someArray[] = ["Come and",
"get me,",
"or fail",
"trying!",
"Yours truly"]
random code...
So far, I managed to get the correct matches with two patterns, executed after each other:
(?s)someArray\[\].*?=.*?\[(.*?)\]
this extracts the text between the two brackets and on the result, I use this one:
"(.*?)"
This is working just fine, but I'd love to get the Texts in one regex.
Any help is highly appreciated!
Consider using \G. With its help, you may match "(.*?)" preceded by either someArray[] = [ or previous match of "(.*?)" (well, strictly speaking previous match of entire regex). Then just grab first capture groups from all matches:
(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"
Demo: https://regex101.com/r/eBQWdU/3
How you grab the first capture groups from depends on the language you're using regex in. For example in PHP you may do something like this:
preg_match_all('/(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"/', $input, $matches);
$array_items = $matches[1];
Demo: https://ideone.com/mZgU1x

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

Regex position string

For example i have this string.
$string = 'test***bas';
How can I display text before the stars with Regex?
You could use a regular expression which makes use of Capture Groups. Once that you have matched your input, you could then access the captured group and print the output.
The following pattern
^(.+?)\*\*\*
will create a group match using the parenthesis operators. See http://gskinner.com/RegExr/ for testing your regular expressions (there are many ways of testing online)
The language you use around your regular expression will have different ways of capturing groups so you will need to better explain what language you are using for any further advice.
Example for before and after asterix
^(.+?)\*\*\*(.+)$
If tou also want what is located after the ***, you can use the following:
$string = 'test***bas';
$pattern = '/(.+)\*{3}(.+)/';
preg_match($pattern, $string, $matches);
$matches will contain the results:
$matches[1] will be "test"
$matches[2] will be "bas"

Regular expression replace a word by a link

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.
Example:
i'm living in Paris, near Paris Gare du Nord, i love Paris.
would become
i'm living.........near Paris..........i love Paris.
This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="...">Paris</a>), and eliminate the inner link.
Regex for step one is dead-simple:
\bParis\b
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.
The approach assumes these side conditions:
Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:
in the <b>capital of France</b>, <a href="">Paris</a>
The surplus link comes from step one, replacement result of step 2 will be:
in the <b>capital of France</b>, Paris
You could search for this regular expression:
(<a[^>]*>.*?</a>)|Paris
This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.
Replace the match with your link only if the capturing group did not match anything.
E.g. in C#:
resultString =
Regex.Replace(
subjectString,
"(<a[^>]*>.*?</a>)|Paris",
new MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match m) {
if (m.groups(1).Success) {
return m.groups(1).Value;
} else {
return "Paris";
}
}
Traditional answer for such question: use a real HTML parser. Because REs aren't really good at operating in a context. And HTML is complex, a 'a' tag can have attributes or not, in any order, can have HTML in the link or not, etc.
Regular expression:
!(<a.*</a>.*)*Paris!isU
Replacement:
$1Paris
$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.
This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".
PHP example:
<?php
$s = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
$regex = '!(<a.*</a>.*)*Paris!isU';
$replace = '$1Paris';
$result = preg_replace( $regex, $replace, $s);
?>
Addition:
This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want.
Nevertheless I see no way to solve your problem completely with a simple regular expression.
If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.
You define two templates:
One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.
$pattern = 'Paris';
$text = 'i\'m living in Paris, near Paris Gare du Nord, i love Paris.';
// 1. Define 2 arrays:
// $matches[1] - array of links with our keyword
// $matches[2] - array of keyword
preg_match_all('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)#', $text, $matches);
// Exists keywords for replace? Define first keyword without tag <a>
$number = array_search($pattern, $matches[2]);
// Keyword exists, let's go rock
if ($number !== FALSE) {
// Replace all link with temporary value
foreach ($matches[1] as $k => $tag) {
$text = preg_replace('#(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)#', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
}
// Replace our keywords with link
$text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', ''.$pattern.'', $text);
// Return link
foreach ($matches[1] as $k => $tag) {
$text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
}
// It's work!
echo $text;
}
Regexes don't replace. Languages do.
Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)
s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i
Proper names might work better:
s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;
Of course "Baton Rouge" would become two links for:
Baton
Rouge
In Perl, you can do this:
my $barred_list_of_cities
= join( '|'
, sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
);
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;
But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.