Regex in a String [Python] - regex

So, there is this string:
str= u'(DESCRIPTION=(ENABLE=broken)(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.102.46)(PORT=1521))(CONNECT_DATA=(UR=A)(SERVICE_NAME=SPA1_HJY)))'
From which I have to extract the values of HOST, PORT and SERVICE_NAME.
I used the following regex for all three respectively:
re_exp1 = re.search(r"HOST=\w+.\w+.\w+.\w+", str_utf)
re_exp2 = re.search(r"(PORT=[1-9][0-9]*)", str_utf)
re_exp3 = re.search(r"(SERVICE_NAME=\w+_\w+)", str_utf)
And it gives me following output:
HOST=172.16.102.46
PORT=1521
SERVICE_NAME=SPA1_HJY
Of course, I can remove "HOST=", "PORT=" and "SERVICE_NAME=" from the obtained results and be left with only values;
But is there a better a regex which I can use here which will give only the values?
Hope this makes sense. :-)

You can use a positive lookbehind in Python Regex to look for a pattern before the capture group.
An example pattern for your first regex could be:
"(?<=HOST=)(\w+.\w+.\w+.\w+)"
Where (?<=HOST=) is a positive lookbehind. There are also negative lookbehinds as well as positive and negative lookaheads.
A useful website I use to test regex patterns is:
https://regexr.com/

Use a dict comprehension in combination with
(?P<key>\w+)=(?P<value>[^()]+)
In Python:
import re
rx = re.compile(r'(?P<key>\w+)=(?P<value>[^()]+)')
string = u'(DESCRIPTION=(ENABLE=broken)(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.102.46)(PORT=1521))(CONNECT_DATA=(UR=A)(SERVICE_NAME=SPA1_HJY)))'
result = {m.group('key'): m.group('value') for m in rx.finditer(string)}
print(result['HOST'], result['PORT'], result['SERVICE_NAME'])
Which yields
172.16.102.46 1521 SPA1_HJY
See a demo for the regular expression on regex101.com.

Assuming all these informations appear only once and always in the same order, I would use a single regex as follows :
HOST=(?P<host>(?:\d+\.\d+\.\d+\.\d+).*PORT=(?P<port>\d+).*SERVICE_NAME=(?P<serviceName>\w+)
Note the following improvments :
host search : the . are escaped, otherwise they'd match any character ; the \w is restricted to \d instead (you could also use [\d.]+ to match the whole IP address more concisely)
port search : since you're extracting rather than validating, I didn't bother with validating that the port didn't start with a 0 (which I'm not sure would pose a problem anyway)
service name search : I didn't bother validating that the service name had a _ in the middle for the same reason (note that \w matches underscores).
the three informations are matched in one pass by the regex, which defines 3 named groups "host", "port" and "serviceName"
You can use the regex with re.search(pattern, input), then access the 3 informations by using the .group(groupName) method on the resulting object :
patternStr = "HOST=(?P<host>(?:\d+\.){3}\d+).*PORT=(?P<port>\d+).*SERVICE_NAME=(?P<serviceName>\w+)"
result = re.search(patternStr, input)
if (result) :
print("host : " + result.group("host"))
print("port : " + result.group("port"))
print("serviceName : " + result.group("serviceName"))
You can see it in action here.

Related

Python Regex - How to extract the third portion?

My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.
You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)
The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'
If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester

Evaluate second regex after one regex pass

I want to grab value after one regex is passed. The sample is
My test string is ##[FirstVal][SecondVal]##
I want to grab FirstVal and SecondVal.
I have tried \#\#(.*?)\#\# pattern but only return [FirstVal][SecondVal].
Is it possible to evaluate result of one regex and apply another regex?
In .NET, you may use a capture stack to grab all the repeated captures.
A regex like
##(?:\[([^][]*)])+##
will find ##, then match and capture any amount of strings like [ + texts-with-no-brackets + ] and all these texts-with-no-brackets will be pushed into a CaptureCollection that is associated with capture group 1.
See the regex demo online
In C#, you would use the following code:
var s = "My test string is ##[FirstVal][SecondVal]##";
var myvalues = Regex.Matches(s, #"##(?:\[([^][]*)])+##")
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures
.Cast<Capture>()
.Select(t => t.Value));
Console.WriteLine(string.Join(", ", myvalues));
See the C# demo
Mind you can do a similar thing with Python PyPi regex module.
It will make a difference as to what programming language you are using as the modules might vary slightly. I used python for my solution, since you didn't specify what language you were using, and you could use two parentheses with the #'s on either side and using an escape character to make regex not match the square braces (ie. \[(\w+)\]. Where in the python re module the \w represents the denotation for a-zA-Z0-9_.
import re
data = "##[FirstVal][SecondVal]##"
x = re.search(r'##\[(\w+)\]\[(\w+)\]', data)
print(x.groups())
Which prints ('FirstVal', 'SecondVal')

Regexr expression doesn't work in groovy

I'm looking to get 4 digits that will be surrounded by spaces.
e.g. foo 2420 blah
using regexr i got this pattern \b\d{4}\b
i translated this to groovy as
def courseNum = course.text =~ $/\b\d{4}\b/$
System.out.print(courseNum.group())
this is returning no matches even though I am positive the string does contain 4 digits by themselves.
What am i doing wrong?
The .group() you are using causes the java.lang.IllegalStateException: No match found exception. You just need to access the match value via the 0th index, courseNum[0].
Also, I would use a simple slashy string here, since it is enough and convenient enough to define a regular expression.
def text = "New 7234 pcs"
def courseNum = text =~ /\b\d{4}\b/
print(courseNum[0])
See this Groovy demo
However, since you want to get 4 digits that will be surrounded by spaces, you do not have to rely on \b word boundaries, use lookarounds to require string start/end or whitespace around the 4 digits:
/(?<!\S)\d{4}(?!\S)/
See the regex demo.
Another good way to do this is with the findAll(regex) method.
​def text = "CSE 2443, MATH 5003"
text.findAll(/\b\d{4}\b/).each {
println it
}
Resulting in ([2443, 5003])
2443
5003
Even if it doesn't match, it will not error like your current instantiation. Find all basically returns all matches as an array list and is therefore safer.

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Get second part of a string using RegEx

I have string like this "first#second", and I wonder how to get "second" part without "#" symbol as result of RegEx, not as match capture using brackets
upd: I forgot to add one more special char at the end of string, real string is "first#second*"
Simple regex:
/#(.*)$/
If you really don't want it to be a match capture, and you know there's a # in the string but none in the part you want, you can do
/[^#]*$/
and the whole regex is what you want.
If you must use regex, and you insist on not using capturing groups, you can use lookbehind in flavors that support them like this:
(?<=#).*
Or you can also capture just anything but #, to the end of the string, so something like this:
[^#]*$
The capturing group option, of course, is:
#(.*)
\__/
1
This matches the # too, but group 1 captures the part that you want.
Lastly, a non-regex alternative may look something like this:
secondPart = wholeString.substring( wholeString.indexOf("#") + 1 )
There may be issues with some of these solutions if # can also appear (perhaps escaped) anywhere else in the string.
References
regular-expressions.info
Lookarounds, Brackets for Capturing, Anchors
/[a-z]+#([a-z]+)/
You can use lookaround to exclude parts of an expression.
http://www.regular-expressions.info/lookaround.html
if your using java then
you can consider using Pattern & Matcher class. Pattern gives you a compiled, optimizer version of Regular expression. Matcher gives a complete internals of RE Matches.
Both Pattern.match & String.spilt gives same result where in first is compartively faster.
for e.g)
String s = "first#second#third";
String re = "#";
Pattern p = Pattern.compile(re);
Matcher m = p.matcher();
int ms = 0;
int me = 0;
while( m.find() ) {
System.out.println("start "+m.start()+" end "+ m.end()+" group "+m.group());
me = m.start();
System.out.println(s.substring(ms,me));
ms = m.end();
}
if other language u can consider using back-reference & groups also. if you find any repetitions.