Evaluate second regex after one regex pass - regex

I want to grab value after one regex is passed. The sample is
My test string is ##[FirstVal][SecondVal]##
I want to grab FirstVal and SecondVal.
I have tried \#\#(.*?)\#\# pattern but only return [FirstVal][SecondVal].
Is it possible to evaluate result of one regex and apply another regex?

In .NET, you may use a capture stack to grab all the repeated captures.
A regex like
##(?:\[([^][]*)])+##
will find ##, then match and capture any amount of strings like [ + texts-with-no-brackets + ] and all these texts-with-no-brackets will be pushed into a CaptureCollection that is associated with capture group 1.
See the regex demo online
In C#, you would use the following code:
var s = "My test string is ##[FirstVal][SecondVal]##";
var myvalues = Regex.Matches(s, #"##(?:\[([^][]*)])+##")
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures
.Cast<Capture>()
.Select(t => t.Value));
Console.WriteLine(string.Join(", ", myvalues));
See the C# demo
Mind you can do a similar thing with Python PyPi regex module.

It will make a difference as to what programming language you are using as the modules might vary slightly. I used python for my solution, since you didn't specify what language you were using, and you could use two parentheses with the #'s on either side and using an escape character to make regex not match the square braces (ie. \[(\w+)\]. Where in the python re module the \w represents the denotation for a-zA-Z0-9_.
import re
data = "##[FirstVal][SecondVal]##"
x = re.search(r'##\[(\w+)\]\[(\w+)\]', data)
print(x.groups())
Which prints ('FirstVal', 'SecondVal')

Related

Building a Regex String - Any assistance provided

Im very new to REGEX, I understand its purpose, but Im struggling to yet fully comprehend how to use it. Im trying to build a REGEX string to pull the A8OP2B out from the following (or whatever gets dumped in that 5th group).
{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}}
The other items in above line, will change in character length, so I cannot say the 51st to the 56th character. It will always be the 5th group in quotation marks though that I want to pull out.
Ive tried building various regex strings up, but its still mostly a foreign language to me and I still have much reading to do on it.
Could anyone provide me a working example with the above, so I can reverse engineer and understand better?
Thanks
Demo 1: Reference the JSON to a var, then use either dot or bracket notation.
Demo 2: Using RegEx is not recommended, but here's one in JavaScript:
/\b(\w{6})(?=","RfKey":)/g
First Match
non-consuming match: :"A
meta border: \b: A non-word=:, any char=", and a word=A
consuming match: A8OP2B
begin capture: (, Any word =\w, 6 times={6}
end capture: )
non-consuming match: ","RfKey":
Look ahead: (?= for: ","RfKey": )
Demo 1
var obj = {"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}};
var dataDot = obj.RfReceived.Data;
var dataBracket = obj['RfReceived']['Data'];
console.log(dataDot);
console.log(dataBracket)
Demo 2
Note: This is consuming a string of 3 consecutive patterns. 3 matches are expected.
var rgx = /\b(\w{6})(?=","RfKey":)/g;
var str = `{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}},{"RfReceived":{"Sync":8080,"Low":102,"High":1200,"Data":"PFN07U","RfKey":"None"}},{"RfReceived":{"Sync":7580,"Low":471,"High":360,"Data":"XU89OM","RfKey":"None"}}`;
var res = str.match(rgx);
console.log(res);

Regex in a String [Python]

So, there is this string:
str= u'(DESCRIPTION=(ENABLE=broken)(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.102.46)(PORT=1521))(CONNECT_DATA=(UR=A)(SERVICE_NAME=SPA1_HJY)))'
From which I have to extract the values of HOST, PORT and SERVICE_NAME.
I used the following regex for all three respectively:
re_exp1 = re.search(r"HOST=\w+.\w+.\w+.\w+", str_utf)
re_exp2 = re.search(r"(PORT=[1-9][0-9]*)", str_utf)
re_exp3 = re.search(r"(SERVICE_NAME=\w+_\w+)", str_utf)
And it gives me following output:
HOST=172.16.102.46
PORT=1521
SERVICE_NAME=SPA1_HJY
Of course, I can remove "HOST=", "PORT=" and "SERVICE_NAME=" from the obtained results and be left with only values;
But is there a better a regex which I can use here which will give only the values?
Hope this makes sense. :-)
You can use a positive lookbehind in Python Regex to look for a pattern before the capture group.
An example pattern for your first regex could be:
"(?<=HOST=)(\w+.\w+.\w+.\w+)"
Where (?<=HOST=) is a positive lookbehind. There are also negative lookbehinds as well as positive and negative lookaheads.
A useful website I use to test regex patterns is:
https://regexr.com/
Use a dict comprehension in combination with
(?P<key>\w+)=(?P<value>[^()]+)
In Python:
import re
rx = re.compile(r'(?P<key>\w+)=(?P<value>[^()]+)')
string = u'(DESCRIPTION=(ENABLE=broken)(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.102.46)(PORT=1521))(CONNECT_DATA=(UR=A)(SERVICE_NAME=SPA1_HJY)))'
result = {m.group('key'): m.group('value') for m in rx.finditer(string)}
print(result['HOST'], result['PORT'], result['SERVICE_NAME'])
Which yields
172.16.102.46 1521 SPA1_HJY
See a demo for the regular expression on regex101.com.
Assuming all these informations appear only once and always in the same order, I would use a single regex as follows :
HOST=(?P<host>(?:\d+\.\d+\.\d+\.\d+).*PORT=(?P<port>\d+).*SERVICE_NAME=(?P<serviceName>\w+)
Note the following improvments :
host search : the . are escaped, otherwise they'd match any character ; the \w is restricted to \d instead (you could also use [\d.]+ to match the whole IP address more concisely)
port search : since you're extracting rather than validating, I didn't bother with validating that the port didn't start with a 0 (which I'm not sure would pose a problem anyway)
service name search : I didn't bother validating that the service name had a _ in the middle for the same reason (note that \w matches underscores).
the three informations are matched in one pass by the regex, which defines 3 named groups "host", "port" and "serviceName"
You can use the regex with re.search(pattern, input), then access the 3 informations by using the .group(groupName) method on the resulting object :
patternStr = "HOST=(?P<host>(?:\d+\.){3}\d+).*PORT=(?P<port>\d+).*SERVICE_NAME=(?P<serviceName>\w+)"
result = re.search(patternStr, input)
if (result) :
print("host : " + result.group("host"))
print("port : " + result.group("port"))
print("serviceName : " + result.group("serviceName"))
You can see it in action here.

Cannot get `Regex::replace()` to replace a numbered capture group

I'm porting a pluralizer to Rust and I'm having some difficulty with regular expressions. I can't get the Regex::replace() method to replace a numbered capture group as I would expect. For example, the following displays an empty string:
let re = Regex::new("(m|l)ouse").unwrap();
println!("{}", re.replace("mouse", "$1ice"));
I would expect it to print "mice", as it does in JavaScript (or Swift, Python, C# or Go)
var re = RegExp("(m|l)ouse")
console.log("mouse".replace(re, "$1ice"))
Is there some method I should be using instead of Regex::replace()?
Examining the Inflector crate, I see that it extracts the first capture group and then appends the suffix to the captured text:
if let Some(c) = rule.captures(&non_plural_string) {
if let Some(c) = c.get(1) {
return format!("{}{}", c.as_str(), replace);
}
}
However, given that replace works in every other language I've used regular expressions in, I would expect it work in Rust as well.
As mentioned in the documentation:
The longest possible name is used. e.g., $1a looks up the capture group named 1a and not the capture group at index 1. To exert more precise control over the name, use braces, e.g., ${1}a.
And
Sometimes the replacement string requires use of curly braces to
delineate a capture group replacement and surrounding literal text.
For example, if we wanted to join two words together with an
underscore:
let re = Regex::new(r"(?P<first>\w+)\s+(?P<second>\w+)").unwrap();
let result = re.replace("deep fried", "${first}_$second");
assert_eq!(result, "deep_fried");
Without the curly braces, the capture group name first_ would be used,
and since it doesn't exist, it would be replaced with the empty
string.
You want re.replace("mouse", "${1}ice")

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Parsing of a string with the length specified within the string

Example data:
029Extract this specific string. Do not capture anything else.
In the example above, I would like to capture the first n characters immediately after the 3 digit entry which defines the value of n. I.E. the 29 characters "Extract this specific string."
I can do this within a loop, but it is slow. I would like (if it is possible) to achieve this with a single regex statement instead, using some kind of backreference. Something like:
(\d{3})(.{\1})
With perl, you can do:
my $str = '029Extract this specific string. Do not capture anything else.';
$str =~ s/^(\d+)(.*)$/substr($2,0,$1)/e;
say $str;
output:
Extract this specific string.
You can not do it with single regex, while you can use knowledge where regex stop processing to use substr. For example in JavaScript you can do something like this http://jsfiddle.net/75Tm5/
var input = "blahblah 011I want this, and 029Extract this specific string. Do not capture anything else.";
var regex = /(\d{3})/g;
var matches;
while ((matches = regex.exec(input)) != null) {
alert(input.substr(regex.lastIndex, matches[0]));
}
This will returns both lines:
I want this
Extract this specific string.
Depending on what you really want, you can modify Regex to match only numbers starting from line beginning, match only first match etc
Are you sure you need a regex?
From https://stackoverflow.com/tags/regex/info:
Fools Rush in Where Angels Fear to Tread
The tremendous power and expressivity of modern regular expressions
can seduce the gullible — or the foolhardy — into trying to use
regular expressions on every string-related task they come across.
This is a bad idea in general, ...
Here's a Python three-liner:
foo = "029Extract this specific string. Do not capture anything else."
substr_len = int(foo[:3])
print foo[3:substr_len+3]
And here's a PHP three-liner:
$foo = "029Extract this specific string. Do not capture anything else.";
$substr_len = (int) substr($foo,0,3);
echo substr($foo,3,substr_len+3);