Rematch same or part of previous matched group - regex

I'm looking for a way to match part of - or the whole - previously matched group. For instance, assume we've the following text:
this is a very long text "with" some quoted strings I "need" to match in their own context
A regex like (.{1,20})(".*?")(.{1,20}) gives the following output:
# | 1st group | 2nd group | 3rd group
------------------------------------------------------------------
1 | is a very long text | "with" | some quoted strings
2 | I | "need" | to extract in their
The goal's to force the regex to re-match part of the 3rd group from the 1st match - or the whole match when quoted strings are quite near - when is matching the 2nd one. Basically I'd like to have the following output instead:
# | 1st group | 2nd group | 3rd group
------------------------------------------------------------------
1 | is a very long text | "with" | some quoted strings
2 | me quoted strings I | "need" | to extract in their
Probably, a backreference support would do the trick but go regex engine lacks of it.

If you go back to the original problem, you need to extract the quotes in context.
Since you don't have lookahead, you could use regexp just to match quotes (or even just strings.Index), and just get byte ranges, then expand to include context yourself by expanding the range (this may require more work if dealing with complex UTF strings).
Something like:
input := `this is a very long text "with" some quoted strings I "need" to extract in their own context`
re := regexp.MustCompile(`(".*?")`)
matches := re.FindAllStringIndex(input, -1)
for _, m := range matches {
s := m[0] - 20
e := m[1] + 20
if s < 0 {
s = 0
}
if e >= len(input) {
e = -1
}
fmt.Printf("%s\n", input[s:e])
}
https://play.golang.org/p/brH8v6OM-Fx

Related

Match a word in a list of words regex

I want the user to only be able to enter the values in the following regex:
^[AB | BC | MB | NB | NL | NS | NT | NU | ON |QC | PE | SK | YT]{2}$
My problem is that words like : PP AA QQ are accepted.
I am not sure how i can prevent that ? Thank you.
Site i use to verify the expression : https://regex101.com/
In most RegExp flavors, square brackets [] denotate character classes; that is, a set of individual tokens that can be matched in a specific position.
Because P is included in this character class (along with a quantifier of {2}) PP is matched.
Instead, you seem to want a group with alternatives; for that, you'd use parenthesis () (while also eliminating the whitespace, something it doesn't appear was intentional on your part):
^(AB|BC|MB|NB|NL|NS|NT|NU|ON|QC|PE|SK|YT){2}$
RegEx101
This matches things like ABBC, ABAB, NLBC, etc.

Only output matching regex pattern

I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:
this is a row: http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row: http://yetanotherdomain.net
this is a row: https://hereisadomain.org | some_text
I'm currently accessing the data in this column this way:
for row in csv_reader:
the_url = row[3]
# this regex is used to find the hrefs
href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
for link in href_regex:
print (link)
Output from the print statement:
http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text
How do I obtain only the URLs?
http://somedomain.com
http://someanotherdomain.com
http://yetanotherdomain.net
https://hereisadomain.org
Just change your pattern to:
\b(?:http|ftp)s?://\S+
Instead of matching anything with .*, match any non-whitespace characters instead with \S+. You might want to add a word boundary before your non capturing group, too.
Check it live here.
Instead of repeating any character at the end
'(?:http|ftp)s?://.*'
^
repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:
'(?:http|ftp)s?://[^ ]*'
^^^^

Match the word "bar" if found anywhere in a field

I am trying to use a CASE statement in Google Data Studio to return a Boolean result if a given string is found within an existing field.
As Google Data Studio uses RE2 RegEx syntax, I believe the following would work, but it returns a could not parse formula error:
CASE
WHEN REGEXP_MATCH(Foo, '(\W|^)bar(\W|$)') THEN 1
ELSE 0
END
I have tried many different combinations of RegEx syntax, but can't work it out. Any help would be much appreciated as this should be a simple REGEXP_MATCH?
The Boolean result should be true if the string is found anywhere within the field:
+---------------------------+----------------+
| Foo | Boolean Result |
+---------------------------+----------------+
| blah bar / boo doo | True |
| but is / should not match | False |
| but match / here bar | True |
+---------------------------+----------------+
You need to make sure you match the whole string with the pattern that you want to use in a REGEXP_MATCH and when using regex escapes, make sure to double escape them:
CASE WHEN REGEXP_MATCH(Foo, '(.*\\W|^)bar(\\W.*|$)') THEN 1 ELSE 0 END
If there are line breaks in Foo, add (?s) at the start of the pattern.
Details
(.*\\W|^) - either any 0+ chars as many as possible followed with a non-word char or start of a string
bar - the word
(\\W.*|$) - either a non-word char followed with any 0+ chars as many as possible or end of a string
See the regex demo.
A Boolean field can be created using the single REGEXP_MATCH Calculated Field below, where \\b on either side of bar represents a Word Boundary thus matching bar but not bark, embark or embar:
REGEXP_MATCH(Foo, ".*(\\bbar\\b).*")
Google Data Studio Report and a GIF to elaborate:

Regex to get password from a long string of mess

I am using power-shell and am getting the below output from my program.
I am having problems getting the password from the mess of other things. Ideally i need to get Hiva!!66 by itself. I am using reg-ex to accomplish this and its just not working. the password will always be 8 characters have an upper and a lowercase and a special character. I have created the split and everything else i need but the reg-ex part is messing with me.
I am away that there are a lot of questions around reg-ex and passwords but those don't seem to have a lot of mess before and after it.Any help would be appreciated.
My best attempt so far is:
"(?=.*\d)(?=.*[A-Z])(?=.*[!##\$%\^&\*\~()_\+\-={}\[\]\\:;`"'<>,./]).{8}$"
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:5:For intTmp = 1 To 4
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:8:cboCOMPort.SelectString 1, "1"
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:11:str2CRLF = Chr(13) & Chr(10) & Chr(13) & Chr(10)
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:14: & "include emulation type (currently Tandem), the I/O method (currently Async) and host connection information
for the session (currently COM9, 8N1)" _
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\CONNECTEXP.VCB:15: & " to the correct values for your target host (e.g., TCP/IP and host IP name or address) and save the
IOSet "CHARSIZE", "8"
PASS="Hiva!!66" If DDEAppReturnCode() <> 0 Then
If DDEAppReturnCode() <> 0 Then
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\DDEtoXL.vcb:28: MsgBox "Could not load " & txtWorkSheet.text, 48
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\DDEtoXL.vcb:37:DDESheetChan = -1
C:\Users\<username>\AppData\Roaming\Crystal Point\OutsideView\Macro\DDEtoXL.vcb:38:DDESystemChan = -2
If you can't count on the quotes or the PASS= being there, you'll have to rely on the password's composition to do everything. The following regex matches a string of eight consecutive characters of the allowed types, with the lookahead and lookbehind to make sure there aren't more than eight.
$regex = [regex] #'
(?x)
(?<![!##$%^&*~()_+\-={}\[\]\\:;`<>,./A-Za-z0-9])
(?:
[!##$%^&*~()_+\-={}\[\]\\:;`<>,./]()
|
[A-Z]()
|
[a-z]()
|
[0-9]()
){8}
\1\2\3\4
(?![!##$%^&*~()_+\-={}\[\]\\:;`<>,./A-Za-z0-9])
'#
It also verifies that there's at least one of each character type: uppercase letter, lowercase letter, digit and special. The lookahead approach used in your regex won't work because it can look too far ahead, beyond the end of the word you're trying to match. Instead, I put an empty group in each branch to act like check boxes. If a backreference to one of those groups fails, it means that branch didn't participate in the match, meaning in turn that the associated character type was not present.
Did you try the following regex:
^PASS="(.{8})"
?
Just use this
(?<=PASS=").+(?=")
You can extract the password from that output with something like this:
... | ? { $_ -cmatch 'PASS="(.{8})"' | % { $matches[1] }
or like this (in PowerShell v3):
... | Select-String -Case 'PASS="(.{8})"' | % { $_.Matches.Groups[1].Value }
In PowerShell v2 you'll have to do something like this if you want to use Select-String:
... | Select-String -Case 'PASS="(.{8})"' | select -Expand Matches |
select -Expand Groups | select -Last 1 | % { $_.Value }

extract a variable value from the middle of a string

I have been trying to figure out for quite sometime. how do I get the PID value from the following string using powershell? I thought REGEX was the way to go but I can't quite figure out the syntax.
For what it is worth everything except for the PID will remain the same.
$foo = <VALUE>I am just a string and the string is the thing. PID:25973. After this do that and blah blah.</VALUE>
I have tried the following in regex
[regex]::Matches($foo, 'PID:.*') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>(.+).') | % {$_.Captures[0].Groups[1].value}
For your regex you'll want to indicate what's before and after the portion you're looking for. PID:.* will find everything from the PID to the end of the string.
And to use a capture group you'll want to have some ( and ) in your regex, which defines a group.
So try this on for size:
[regex]::Matches($foo,'PID:(\d+)') | % {$_.Captures[0].Groups[1].value}
I'm using a regex of PID:(\d+). The \d+ means "one or more digits". The parentheses around that (\d+) identifies it as a group I can access using Captures[0].Groups[1].
Here's another option. Basically it replaces everything with the first capture group (which is the digits after 'pid:':
$foo -replace '^.+PID:(\d+).+$','$1'