Using regex to find data in between certain data [duplicate] - regex

How can I extract a substring from within a string in Ruby?
Example:
String1 = "<name> <substring>"
I want to extract substring from String1 (i.e. everything within the last occurrence of < and >).

"<name> <substring>"[/.*<([^>]*)/,1]
=> "substring"
No need to use scan, if we need only one result.
No need to use Python's match, when we have Ruby's String[regexp,#].
See: http://ruby-doc.org/core/String.html#method-i-5B-5D
Note: str[regexp, capture] → new_str or nil

String1.scan(/<([^>]*)>/).last.first
scan creates an array which, for each <item> in String1 contains the text between the < and the > in a one-element array (because when used with a regex containing capturing groups, scan creates an array containing the captures for each match). last gives you the last of those arrays and first then gives you the string in it.

You can use a regular expression for that pretty easily…
Allowing spaces around the word (but not keeping them):
str.match(/< ?([^>]+) ?>\Z/)[1]
Or without the spaces allowed:
str.match(/<([^>]+)>\Z/)[1]

Here's a slightly more flexible approach using the match method. With this, you can extract more than one string:
s = "<ants> <pants>"
matchdata = s.match(/<([^>]*)> <([^>]*)>/)
# Use 'captures' to get an array of the captures
matchdata.captures # ["ants","pants"]
# Or use raw indices
matchdata[0] # whole regex match: "<ants> <pants>"
matchdata[1] # first capture: "ants"
matchdata[2] # second capture: "pants"

A simpler scan would be:
String1.scan(/<(\S+)>/).last

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

golang replacing substring of regexp

I am trying to find all occurrences of the following regex \%\%comp\.[^%]+\%\% and replace them with \%comp\.[^%]+\% (i.e reducing one % from both ends).
What is the easiest way to do this in go, aside from using findallindex matches and cleaning up the string in reverse order?
You can use Regexp.ReplaceAll method for that. Example:
re := regexp.MustCompile(`\%(\%comp\.([^%]+)\%)\%`)
fmt.Printf("%s\n", re.ReplaceAll([]byte("test%%comp.test%%"), []byte("$1")))
>>> OUTPUT: test%comp.test%
Notice, that $n is used in replacement string to expand value from nth group in regexp.

re.sub (python) substitute part of the matched string

I have a series of strings which are identifiable by finding a substring "p" tag followed by at least two CAPITAL letters.
Input:
<p>JIM <p>SALLY <p>ROBERT <p>Eric
I want to change the "p" tag to an "i" tag if it's followed by those two capital letters (so not the last one, 'Eric').
Desired output:
<i>JIM <i>SALLY <i>ROBERT <p>Eric
I've tried this using regular expressions in Python:
import re
Mytext = "<p>JIM <p>SALLY <p>ROBERT <p>Eric"
changeTags = re.sub('<p>[A-Z]{2}', '<i>' + re.search('<p>[A-Z]{2}', Mytext).group()[-2:], Mytext)
print changeTags
But the output uses "i" tag + JI in every instance, rather than interating through to use SA and then RO in entries 2 and 3.
<i>JIM <i>JILLY <i>JIBERT <p>Eric
I believe the problem is that I don't understand the .group() method properly. Can anyone advise what I've done wrong?
Thank you.
Another way using look-ahead assertion:
re.sub(r'<p>(?=[A-Z]{2,})','<i>',MyText)
Your inner re.search is only evaluted once, and the result is passed as one of the parameters to re.sub. This can't possible capture all the capital-letters-pairs, only the first one. This means your approach cannot work, not merely your understanding of groups.
Furthermore, using groups is unnecessary.
You need to capture the capital letters using parenthesis, and reference it as \1 in the substitution expression:
re.sub('<p>([A-Z]{2})', r'<i>\1', Mytext)
\1 here means: replace with the substring matched by the first (...) in the regular expression. (docs)
Note the leading r in front of the substitution string, to make it raw.

Regex match between two tags or else match everything

I have a list of email addresses which take various forms:
john#smith.com
Angie <angie#aol.com>
"Mark Jones" <mark#jones.com>
I'm trying to cut only the email portion from each. Ex: I only want the angie#aol.com from the second item in the list. In other words, I want to match everything between < and > or match everything if it doesn't exist.
I know this can be done in 2 steps:
Capture on (?<=\<)(.*)(?=\>).
If there is no match, use the entire text.
But now I'm wondering: Can both steps be reduced into one simple regular expression?
What about:
(?<=\<).*(?=\>)|^[^<]*$
^[^>]*$ will match the entire string, but only if it doesn't contain a <. And that's OR'ed (|) with what you had.
Explanation:
^ - start of string
[^<] - not-< character
[^<]* - zero or more not-< characters
$ - end of string
You're after an exclusive or operator. Have a look here.
(\<.+\#.+\..+\>) matches those email addresses in side <> only...
(\<.+\#.+\..+\>)|(.+) matches everything instead of matching the first condition in the OR then skipping the second.
Depending on what language you are using to implement this regex, you might be able to use an inbuilt exclusive or operator. Otherwise, you might need to put a bit of logic in there to use the string if no matches are found. E.g. (pseudo type code):
string = 'your data above';
if( regex_finds_match ( '(\<.+\#.+\..+\>)', string ) ) {
// found match, use the match
str_to_use = regex_match(es);
} else {
// didn't find a match:
str_to_use = string;
}
It is possible, but your current logic is probably simpler. Here is what I came up with, email address will always be in the first capturing group:
^(?:.*<|)(.*?)(?:>|$)
Example: http://rubular.com/r/8tKHaYYY4T

Capturing a repeated group

I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.
BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";
I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}
After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.
Why use Regex? If the groups are always split by a -, can't you use Split()?
Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")
What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.
If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value
Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])
You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.