How to extract parameter names and values using regular expressions - regex

I would like to know how to extract values of all this parameters.
My regular expression:
([\w]+)(\s*=\s*)(['|"|\w])(.+)['|"|\w]
Parameter names and values that should match:
name='John Doe'
name=John Doe
organization=Acme Widgets Inc.
server=192.0.2.62
port='143'
file="payroll.dat"
DOS=HIGH,UMB
DEVICE=C:\DOS\HIMEM.SYS
DEVICE=C:\DOS\EMM386.EXE RAM
DEVICEHIGH=C:\DOS\ANSI.SYS
FILES=30
SHELL=C:\DOS\COMMAND.COM C:\DOS /E:512 /P
When i run my expression in regex101.com it only finds the first parameter that matches. In this case being: name='John Doe'
Desired output is name John Doe
I am having extra trouble understanding how to find and extract parameter names and values without parantesis and equals signs.

Try this:
(\w+)\s*=\s*['"]?([^'"\n]+)
The keyword will be in capture group 1 (there's no need for [] around \w).
There's no need for a capture group around the equal sign.
[^"]? allows an optional quote after the equal sign. There's no need to put it in a capture group.
([^'"\n]+) then matches everything that isn't another quote or newline. So it will capture everything until either a quote or the end of the line. This value will be put into group 2.
DEMO

i hope this will useful for you:
/^.*?=.*?$/gm
I test the pattern in: https://regexr.com/
var str = `
name='John Doe'
name=John Doe
organization=Acme Widgets Inc.
server=192.0.2.62
port='143'
file="payroll.dat"
DOS=HIGH,UMB
DEVICE=C:\DOS\HIMEM.SYS
DEVICE=C:\DOS\EMM386.EXE RAM
DEVICEHIGH=C:\DOS\ANSI.SYS
FILES=30
SHELL=C:\DOS\COMMAND.COM C:\DOS /E:512 /P`;
console.log( str.match(/^.*?=.*?$/gm).map(str => str.replace(/("|')/g, '').replace(/=/g, ' ') ) )

Related

How to capture text between a specific word and the semicolon immediately preceding it with regex?

I have many rows of people and titles in Excel, and am looking to filter out certain people by title. For example, cells may contain the following:
John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder
These cells are varying lengths and have varying numbers of people and titles. My plan is to add semicolons at the beginning and end to standardize it. This would give me:
;John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;
Currently, I have a code that can iterate through and uses the following regex Founder.*?; which will return each instance of founder based on my code (i.e. Founder;Founder;) but the trouble is that I can't seem to figure out how to also capture the names of the people. I would think I would need to designate the semicolon immediately preceding "Founder" but so far I have not been able to get this. My ultimate goal would be to return something like the following, which I have the code for with the exception of the correct regular expression.
;John Smith, Co-Founder;James Jackson, Co-Founder;
Depending on your version of Excel, you could also do this with a formula:
=FILTERXML("<t><s>" & SUBSTITUTE(A1,";","</s><s>")&"</s></t>","//s[contains(.,'Co-Founder')]")
However, for a regex, you could use
(?:^|;)([^;]*?Co-Founder)
which will return the Co-Founders in capturing group 1.
There is no need for leading/trailing semicolons.
Even though VBA regex does not support look-behind, you can work with that limitation.
the Co-Founders Regex
(?:^|;)([^;]*?Co-Founder)
Options: Case sensitive (or not, as you prefer); ^$ match at line breaks
Match the regular expression below (?:^|;)
Match this alternative ^
Assert position at the beginning of the string ^
Or match this alternative ;
Match the character “;” literally ;
Match the regex below and capture its match into backreference number 1 ([^;]*?Co-Founder)
Match any character that is NOT a “;” [^;]*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the character string “Co-Founder” literally Co-Founder
Created with RegexBuddy
Split the whole string combined with a positive filtering and the getCoFounders() function will return an array of findings:
Sub ExampleCall()
Dim s As String
s = ";John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;"
Debug.Print Join(getCoFounders(s), "|")
End Sub
Function getCoFounders(s As String)
getCoFounders = Filter(Split(s, ";"), "Co-Founder", True, vbTextCompare)
End Function
Results in VB Editor's immediate window
John Smith, Co-Founder|James Jackson, Co-Founder

Python Regex - How to extract the third portion?

My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.
You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)
The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'
If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester

Go ReplaceAllString

I read the example code from golang.org website. Essentially the code looks like this:
re := regexp.MustCompile("a(x*)b")
fmt.Println(re.ReplaceAllString("-ab-axxb-", "T"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
The output is like this:
-T-T-
--xx-
---
-W-xxW-
I understand the first output, but I don't understand the the rest three. Can someone explain to me the results 2,3 and 4. Thanks.
The most intriguing is the fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W")) line. The docs say:
Inside repl, $ signs are interpreted as in Expand
And Expand says:
In the template, a variable is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores.
A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.
In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.
So, in the 3rd replacement, $1W is treated as ${1W} and since this group is not initialized, an empty string is used for replacement.
When I say "the group is not initialized", I mean to say that the group is not defined in the regex pattern, thus, it was not populated during the match operation. Replacing means getting all matches and then they are replaced with the replacement pattern. Backreferences ($xx constructs) are populated during the matching phase. The $1W group is missing in the pattern, thus, it was not populated during matching, and only an empty string is used when replacing phase occurs.
The 2nd and 4th replacements are easy to understand and have been described in the above answers. Just $1 backreferences the characters captured with the first capturing group (the subpattern enclosed with a pair of unescaped parentheses), same is with Example 4.
You can think of {} as a means to disambiguate the replacement pattern.
Now, if you need to make the results consistent, use a named capture (?P<1W>....):
re := regexp.MustCompile("a(?P<1W>x*)b") // <= See here, pattern updated
fmt.Println(re.ReplaceAllString("-ab-axxb-", "T"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
Results:
-T-T-
--xx-
--xx-
-W-xxW-
The 2nd and 3rd lines now produce consistent output since the named group 1W is also the first group, and $1 numbered backreference points to the same text captured with a named capture $1W.
$number or $name is index of subgroup in regex or subgroup name
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
$1 is subgroup 1 in regex = x*
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
$1W no subgroup name 1W => Replace all with null
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
$1 and ${1} is the same. replace all subgroup 1 with W
for more information : https://golang.org/pkg/regexp/
$1 is a shorthand for ${1}
${1} is the value of the first (1) group, e.g. the content of the first pair of (). This group is (x*) i.e. any number of x.
ReplaceAllString replaces every match. There are two matches. The first is ab, the second is axxb.
No 2. replaces any match with the content of the group: This is "" in the first match and "xx" in the second.
No 4. adds a "W" after the content of the group.
No 3. Is left as an exercise. Hint: The twelfth capturing group would be $12.

Regex in Postgres: replacing part of found pattern

I'm looking to do some simple partial redaction for addresses. Basically I'd like to replace the street name before the street suffix with ### while keeping the street suffix.
Examples:
Cherry Street -> ### Street
America Lane -> ### Lane
The full list of suffixes will be known at some point soon, and I could end up replacing the entirety of the address (getting rid of Street and Lane as well as the street name) with something like
regexp_replace(col1, '(\w) (Street|Lane)', '###', 'g')
but I can't figure out how to just replace just the word before the street suffix.
What you need is a positive look ahead (?= )
try the regex
(\w)* (?=Street|Lane)
see how the regex works http://regex101.com/r/pS9oV3/2
Explanation
(\w)* matches anything followed by a space
(?=Street|Lane) asserts if the following regex can be matched. successfull if it can be matched
regexp_replace(col1, '(\w)* (?=Street|Lane)', '###', 'g')
would produce
### Street
### Lane
I'm not familiar with PostgreSQL's exact syntax, but you need to reference the capture group you wanna keep. Something like:
regexp_replace(col1, '(\w) (Street|Lane)', '### $2', 'g')
or
regexp_replace(col1, '(\w) (Street|Lane)', '### \2', 'g')
Where $2 or \2 references the second capture group, so whatever was in the 2nd pair of parentheses will be placed there.
I wouldn't expect all street names to be that simple. Therefore:
SELECT col1
, regexp_replace(col1, '.+(?=\m(Street|Lane)\M)', '### ', 'i')
FROM (
VALUES
('Cherry Street'::text)
, ('America Lane')
, ('Weird name Lane')
, ('Lanex Lane') -- 'Lane' included in street name
, ('Buster-Keaton-Lane')
, ('White Space street') -- with tab
) t(col1);
SQL Fiddle.
Explain
.+ ... any string of one or more characters (including white space)
(?=\m(Street|Lane)\M) ... that is followed by 'Street' or 'Lane'
- (?=) ... positive look-ahead
- \m\M ... begin and end of word
The additional parameter 'i' switches to case-insensitive matching. No point in adding 'g' (replace "globally"), since only a single replacement should happen.
Details in the manual.

Regex nested optional groups

I am trying to capture the bold part of strings like this:
'capture a year range at the end of a string 1995-2010'
'if there's no year range just capture the single year 2005'
'capture a year/year range followed by a parenthesis, including the parenthesis 2007-2012 (58 months)'
This regex works for 1 and 2, but I can't get it to work for 3:
/(\d+([-–— ]\d+( \(\d+ months\))?)?$)/
What am I doing wrong?
Try this regex:
/\d{4}(?:[-–— ]\d{4})?(?:\s*\([^)]+\))?$/gm
This one captures everything in the brackets.
If you need a regex specific to the text "(number) months" in the brackets, then you can use this: \d{4}(?:[-–— ]\d{4})?(?:\s+\(\d+\smonths\))?$
Link to test: RegexPal or RegExr
Sample text:
capture a year range at the end of a string 1995-2010
if there's no year range just capture the single year 2005
capture a year/year range followed by a parenthesis, including the
parenthesis 2007-2012 (58 months)
trying out another example 1990 (23 weeks)
trying out another example 1995-2002 (x days)
trying out another example 2050 (blah blah)
trying out another example 2050—3000
trying out another example 2050-3000
trying out another example 2050–3000
And the JavaScript code:
var regex = /\d{4}(?:[-–— ]\d{4})?(?:\s*\([^)]+\))?$/gm; //multiline enabled
var input = "your input string";
if(regex.test(input)) {
var matches = input.match(regex);
for(var match in matches) {
alert(matches[match]);
}
} else {
alert("No matches found!");
}
This Regex works nicely. :)
/(?:(?:\d{4}[-–— ])?\d{4})(?: \(\d+ months\))?$/
The main difference between my Regex and Jonah's is that mine contains ?: which means not to capture the sub-groups. When you group in a Regex it automatically returns what is in that group unless you tell it not to, and I've found that sometimes when those groups get captured when using methods such as replace or split, that it can be a little buggy which may be your problem as well.
The following regex works for me in a sample Perl script. It should be workable in JavaScript:
/(\d{4}([-–— ]\d{4})?( \(\d+ months\))?)$/
We first match a 4-digit year: \d{4}
Then we match an optional separator followed by another 4-digit year: ([-–— ]\d{4})?
Finally, we match the optional months portion: ( \(\d+ months\))?
You may need to insert whitespace matches (\s*) where needed, if your data doesn't always follow this strict template.
It actually works fine here, if I understand your needs correctly: Gskinner RegExr
Just alternate which sentence is the last, as $ will not count for newlines, just the end of the string.