extract a variable value from the middle of a string - regex

I have been trying to figure out for quite sometime. how do I get the PID value from the following string using powershell? I thought REGEX was the way to go but I can't quite figure out the syntax.
For what it is worth everything except for the PID will remain the same.
$foo = <VALUE>I am just a string and the string is the thing. PID:25973. After this do that and blah blah.</VALUE>
I have tried the following in regex
[regex]::Matches($foo, 'PID:.*') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>(.+).') | % {$_.Captures[0].Groups[1].value}

For your regex you'll want to indicate what's before and after the portion you're looking for. PID:.* will find everything from the PID to the end of the string.
And to use a capture group you'll want to have some ( and ) in your regex, which defines a group.
So try this on for size:
[regex]::Matches($foo,'PID:(\d+)') | % {$_.Captures[0].Groups[1].value}
I'm using a regex of PID:(\d+). The \d+ means "one or more digits". The parentheses around that (\d+) identifies it as a group I can access using Captures[0].Groups[1].

Here's another option. Basically it replaces everything with the first capture group (which is the digits after 'pid:':
$foo -replace '^.+PID:(\d+).+$','$1'

Related

Powershell Regex question. Escape parenthesis

Been beating my head around this one all day and I'm getting close but not quite getting there. I have a small subset of my much larger script for just the regex part. Here is the script so far:
$CCI_ID = #(
"003417 AR-2.1"
"003425 AR-2.9"
"003392 AP-1.12"
"009012 APP-1(21).1"
)
[regex]::matches($CCI_ID, '(\d{1,})|([a-zA-Z]{2}[-][\d][\(?\){0,1}[.][\d]{1,})') |
ForEach-Object {
if($_.Groups[1].Value.length -gt 0){
write-host $('CCI-' + $_.Groups[1].Value.trim())}
else{$_.Groups[2].Value.trim()}
}
CCI-003417
AR-2.1
CCI-003425
AR-2.9
CCI-003392
AP-1.12
CCI-009012
PP-1(21
CCI-1
The output is correct for all but the last one. It should be:
CCI-009012
APP-1(21).1
Thanks for any advice.
Instead of describing and quantifying the (optional) opening and closing parenthesis separately, group them together and then make the whole group optional:
(?:\(\d+\))?
The whole pattern thus ends up looking like:
[regex]::Matches($CCI_ID, '(\d{1,})|([a-zA-Z]{2,3}[-][\d](?:\(\d+\))?[.][\d]{1,})')
In your pattern you are using an alternation | but looking at the example data you can match 1 or more whitespaces after it instead.
If there is a match for the pattern, the group 1 value already contains 1 or more digits so you don't have to check for the Value.length
The pattern with the optional digits between parenthesis:
\b(\d+)\s+([a-zA-Z]{2,}-\d(?:\(\d+\))?\.\d+)\b
See a regex101 demo.
$CCI_ID = #(
"003417 AR-2.1"
"003425 AR-2.9"
"003392 AP-1.12"
"009012 APP-1(21).1"
)
[regex]::matches($CCI_ID, '\b(\d+)\s+([a-zA-Z]{2,}-\d(?:\(\d+\))?\.\d+)\b') |
ForEach-Object {
write-host $( 'CCI-' + $_.Groups[1].Value.trim() )
write-host $_.Groups[2].Value.trim()
}
Output
CCI-003417
AR-2.1
CCI-003425
AR-2.9
CCI-003392
AP-1.12
CCI-009012
APP-1(21).1
As you experiencing here, Regex expressions might become very complex and unreadable.
Therefore it is often an good idea to view your problem from two different angles:
Try matching the part(s) you want, or
Try matching the part(s) you don't want
In your case it is probably easier to match the part that you don't want: the delimiter, the space, and split your string upon that, which is apparently want to achieve:
$CCI_ID | Foreach-Object {
$Split = $_ -Split '\s+', 2
'CCI-' + $Split[0]
$Split[1]
}
$_ -Split '\s+', 2, Splits the concerned string based on 1 or more white-spaces (where you might also consider a literal space: -Split ' '). The , 2 will prevent the the string to split in more than 2 parts. Meaning that the second part will not be further split even if it contains a spaces.

Match the word "bar" if found anywhere in a field

I am trying to use a CASE statement in Google Data Studio to return a Boolean result if a given string is found within an existing field.
As Google Data Studio uses RE2 RegEx syntax, I believe the following would work, but it returns a could not parse formula error:
CASE
WHEN REGEXP_MATCH(Foo, '(\W|^)bar(\W|$)') THEN 1
ELSE 0
END
I have tried many different combinations of RegEx syntax, but can't work it out. Any help would be much appreciated as this should be a simple REGEXP_MATCH?
The Boolean result should be true if the string is found anywhere within the field:
+---------------------------+----------------+
| Foo | Boolean Result |
+---------------------------+----------------+
| blah bar / boo doo | True |
| but is / should not match | False |
| but match / here bar | True |
+---------------------------+----------------+
You need to make sure you match the whole string with the pattern that you want to use in a REGEXP_MATCH and when using regex escapes, make sure to double escape them:
CASE WHEN REGEXP_MATCH(Foo, '(.*\\W|^)bar(\\W.*|$)') THEN 1 ELSE 0 END
If there are line breaks in Foo, add (?s) at the start of the pattern.
Details
(.*\\W|^) - either any 0+ chars as many as possible followed with a non-word char or start of a string
bar - the word
(\\W.*|$) - either a non-word char followed with any 0+ chars as many as possible or end of a string
See the regex demo.
A Boolean field can be created using the single REGEXP_MATCH Calculated Field below, where \\b on either side of bar represents a Word Boundary thus matching bar but not bark, embark or embar:
REGEXP_MATCH(Foo, ".*(\\bbar\\b).*")
Google Data Studio Report and a GIF to elaborate:

regex to duplicate repeated patterns, substituting part of the pattern

I'd like to duplicate a multiple matches in a line, substituting part of the match, but keeping the runs of matches together (that seems to be the tricky part).
e.g.:
Regex:
(x(\d)(,)?)
Replacement:
X$2,O$2$3
Input:
x1,x2,Z3,x4,Z5,x6
Output: (repeated groups broken apart)
X1,O1,X2,O2,Z3,X4,O4,Z5,X6,O6
Desired output (repeated groups, "X1,X2" kept together):
X1,X2,O1,O2,Z3,X4,O4,Z5,X6,O6
Demo: https://regex101.com/r/gH9tL9/1
Is this possible with regex or do I need to use something else?
Update: Wills answer is what I expected. It occurs to me that it might be possible with multiple passes of regex.
You would have to capture the repeating patterns as one match and write out replacements for the whole repeating pattern at once. your current pattern cannot tell that your first and second matches, x1, and x2, respectively, are adjacent.
Im going to say no, this is not possible with one pure regex.
This is because of two important facts about capture groups and replacing.
Repeated capture groups will return the last capture:
Regex's are able to capture patterns which repeat an arbitrary amount of time by using the form <PATTERN>{1,},<PATTERN>+ or <PATTERN>*. However any capture group within <PATTERN> would only return the captures from the last iteration of the pattern. This would prevent your desired ability to capture matches that arbitrarily repeat.
"Hold on", you might say, "I only want to capture patterns that repeat one or two times, I could use (x(\d)(,)?)(x(\d)(,)?)?", which brings us to point 2.
There is no conditional replacement
Using the above pattern we could get your desired output for the repeated match, but not without mangling the solo match replacement.
See: https://regex101.com/r/gH9tL9/2 Without the ability to turn off sections of the replacement based on the existence of capture groups, we cannot achieve the desired output.
But "No, you can't do that" is a challenge to a hacker, I hope I am shown up by a true regex ninja.
Solution with 2 regexes and some code
There's definitely ways to achieve this goal with some code.
Here's a quick and dirty python hack using two regexes http://pythonfiddle.com/wip-soln-for-so-q/
This makes use of python's re.sub(), which can pass matches to one regex to a function ordered_repl which returns the replacement string. By using your original regex within the ordered_repl we can extract the information we want and get the right order by buffering our lists of Xs and Os.
import re
input_string="x1,x2,Z3,x4,Z5,x6"
re1 = re.compile("(?:x\d,?)+") # captures the general thing you want to match using a repeating non-capturing group
re2 = re.compile("(x(\d)(,)?)") # your actual matcher
def ordered_repl(m): # m is a matchobj
buf1 = []
buf2 = []
cap_iter = re.finditer(re2,m.group(0)) # returns an iterator of MatchObjects for all non-overlapping matches
for cap_group in cap_iter:
capture = cap_group.group(2) # capture the digit
buf1.append("X%s" % capture) # buffer X's of this submatch group
buf2.append("O%s" % capture) # buffer O's of this submatch group
return "%s,%s," % (",".join(buf1),",".join(buf2)) # concatenate the buffers and return
print re.sub(re1,ordered_repl,input_string).rstrip(',') # searches string for matches to re1 and passes them to the ordered_repl function
In my specific case I'm using powershell, so I was able to come up with the following:
(linebreaks added for readability)
("x1,x2,z3,x4,z5,x6"
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
| Foreach-Object {
if ($_ -match 'x') {
$_ + ',' + ($_ -replace 'x','y')
} else {$_}
}
) -join ''
Outputs:
x1,x2,y1,y2,z3,x4,y4,z5,x6,y6
Where:
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
breaks apart the string into these groups:
x1,x2
,
z3
,
x4
,
z5
,
x6
using positive and negative lookahead and lookbehind:
comma with x\d before and without x after:
(?<=x\d),(?!x)
comma without x\d before and with x after:
(?<!x\d),(?=x)

Regex to get password from a long string of mess

I am using power-shell and am getting the below output from my program.
I am having problems getting the password from the mess of other things. Ideally i need to get Hiva!!66 by itself. I am using reg-ex to accomplish this and its just not working. the password will always be 8 characters have an upper and a lowercase and a special character. I have created the split and everything else i need but the reg-ex part is messing with me.
I am away that there are a lot of questions around reg-ex and passwords but those don't seem to have a lot of mess before and after it.Any help would be appreciated.
My best attempt so far is:
"(?=.*\d)(?=.*[A-Z])(?=.*[!##\$%\^&\*\~()_\+\-={}\[\]\\:;`"'<>,./]).{8}$"
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:5:For intTmp = 1 To 4
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:8:cboCOMPort.SelectString 1, "1"
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:11:str2CRLF = Chr(13) & Chr(10) & Chr(13) & Chr(10)
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:14: & "include emulation type (currently Tandem), the I/O method (currently Async) and host connection information
for the session (currently COM9, 8N1)" _
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:15: & " to the correct values for your target host (e.g., TCP/IP and host IP name or address) and save the
IOSet "CHARSIZE", "8"
PASS="Hiva!!66" If DDEAppReturnCode() <> 0 Then
If DDEAppReturnCode() <> 0 Then
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\DDEtoXL.vcb:28: MsgBox "Could not load " & txtWorkSheet.text, 48
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\DDEtoXL.vcb:37:DDESheetChan = -1
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\DDEtoXL.vcb:38:DDESystemChan = -2
If you can't count on the quotes or the PASS= being there, you'll have to rely on the password's composition to do everything. The following regex matches a string of eight consecutive characters of the allowed types, with the lookahead and lookbehind to make sure there aren't more than eight.
$regex = [regex] #'
(?x)
(?<![!##$%^&*~()_+\-={}\[\]\\:;`<>,./A-Za-z0-9])
(?:
[!##$%^&*~()_+\-={}\[\]\\:;`<>,./]()
|
[A-Z]()
|
[a-z]()
|
[0-9]()
){8}
\1\2\3\4
(?![!##$%^&*~()_+\-={}\[\]\\:;`<>,./A-Za-z0-9])
'#
It also verifies that there's at least one of each character type: uppercase letter, lowercase letter, digit and special. The lookahead approach used in your regex won't work because it can look too far ahead, beyond the end of the word you're trying to match. Instead, I put an empty group in each branch to act like check boxes. If a backreference to one of those groups fails, it means that branch didn't participate in the match, meaning in turn that the associated character type was not present.
Did you try the following regex:
^PASS="(.{8})"
?
Just use this
(?<=PASS=").+(?=")
You can extract the password from that output with something like this:
... | ? { $_ -cmatch 'PASS="(.{8})"' | % { $matches[1] }
or like this (in PowerShell v3):
... | Select-String -Case 'PASS="(.{8})"' | % { $_.Matches.Groups[1].Value }
In PowerShell v2 you'll have to do something like this if you want to use Select-String:
... | Select-String -Case 'PASS="(.{8})"' | select -Expand Matches |
select -Expand Groups | select -Last 1 | % { $_.Value }

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.