Not understanding group/value/capture attributes of Powershell object matches method

Not understanding group/value/capture attributes of Powershell object matches method - regex

Because of my lack of understanding of Powershell objects my question may not be worded accurately. I take it from the documentation Powershell 7.3 ForEach-Object that I am using a script block & utilizing the Powershell automatic variable $_ But that is about as relevant to my example that these docs get.
I'm trying to access each of two parts of a collection of text file type name/address listings. Namely the first three listings (001 - 003) or the second three (004 - 006)
Using $regexListings and $testListings I have tested that I can access, the first three or second three listings, using references to the capture groups e.g $1 $2 See this example running here: regex101
When I run the following Powershell code:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches}
Output is:
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 204
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
My interpretation of the Powershell output is:
there are 3 match groups?
no captures available
the value is all of it?
Why does the Powershell script output Captures {0} when the link page (regex101) above describes two capture groups which I can access?
The documentation Groups, Captures, and Substitutions is helpful but doesn't address this kind of issue. I have gone on using trial & error examples like:
ForEach-Object {$_.Matches.Groups}
ForEach-Object {$_.Matches.Captures}
ForEach-Object {$_.Matches.Value}
And I'm still none the wiser.

Information overflow. What's being output is what's relevant to us, the administrators. Capture group 0 is the entire value since $regexListings indeed matches the entire string. This is where PowerShell attempts to be helpful with it's rich type system and displays what we may find useful; although, this may just be the implementation of the creators of the cmdlet. So, you were on the right track with $_.Matches.Groups which should've exposed the capture groups and the values for the RegEx matching.
If you're looking to access those values, as mentioned above, you'd have to iterate over .Matches.Groups within that Foreach-Object. What you're passing isn't the individual captures to that cmdlet, but rather the captures of the expression as a whole. This is why you're better off saving to a variable and indexing through the group capture(s) such as: $var.Matches.Groups[0], or $var.Matches.Groups[1], etc.. You can also just use the automatic variable $matches to get some confusion out the way seeing as it's populated via the -Match operator, you can index through the captures with $matches[n] instead. Using your same example:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings -match $regexListings
$Matches
Which outputs:
True # this is output by -match letting you know it's succeeded in matching.
Name Value
---- -----
1 001 AALTON Alan 25 Every Street ...
0 001 AALTON Alan 25 Every Street ...
Now you have a hashtable with a more representable example of the pattern matching.

In order to access each of two parts of the listings I needed to be able to see them in the output using:
$regexListings = '(?ms)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 102
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
ValueSpan :
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 103
Length : 101
Value : 004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
The differences from the question code being:
using multi line modifier (?ms) instead of (?s) in the regex
using {$_.Matches.Captures} as the regex contains capture grouping
Access to these captures can be got from assigning a variable then indexing e.g:
$result = $testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
$result[1]
$result[0]

Related

Regular expression in Oracle to Filter particular charecters only

I have a scenario
Case 1: "NO 41 ABC STREET"
Case 2: "42 XYZ STREET"
For almost 100 000 data in my table.
I want a regexp that
omits 'NO 41' and leaves back ABC STREET as output in case 1, whereas
in case 2 I want '42 XYZ STREET' as output.

regexp_replace('NO 41 ABC STREET', 'NO [0-9]+ |([0-9]+)', '\1') outputs ABC STREET.
regexp_replace('42 XYZ STREET', 'NO [0-9]+ |([0-9]+)', '\1') outputs 42 XYZ STREET.

You have provided only 2 scenarios of your data in the table. Assuming that you only want to replace the characters in a column which starts with a "NO" followed by digit and then space before some other characters, you could use this.
SQL Fiddle
Query:
select s,REGEXP_REPLACE(s,'^NO +\d+ +') as r FROM data
Results:
| S | R |
|------------------|---------------|
| NO 41 ABC STREET | ABC STREET |
| 42 XYZ STREET | 42 XYZ STREET |
If you have more complex data to be filtered, please edit your question and describe it clearly.

I am trying to write a regex on powershell for the canadian addresses

this is the address method
the number might be different 12 or 412 and how many words for the finch ave east
1460 Finch Ave East, Toronto, Ontario, A1A1A1
so I try this
^[0-9]+\s+[a-zA-Z]+\s+[a-zA-Z]+\s+[a-zA-Z]+[,]{1}+\s[a-zA-Z]+[,]{1}+\s+[a-zA-Z]+[,]{1}+\s[A-Za-z]\d[A-Za-z][ -]?\d[A-Za-z]\d$

I usually recommend using regex capture-groups, so you can break and simplify your matching problem to smaller sets. For most cases I use \d and \w, s for matching numbers, standard letters and whitespaces.
I usually experiment on https://regex101.com before I put it into code, because it provides a nice interactive way to play with expressions and samples.
Regarding your question the expression that I came up is:
$regexp = "^(\d+)\s*((\w+\s*)+),\s*(\w+),\s*(\w+),\s*((\w\d)*)$"
In PowerShell I like to use the direct regex class, because it offers more granularity than the standard -match operator.
# Example match and results
$sample = "1460 Finch Ave East, Toronto, Ontario, A1A1A1"
$match = [regex]::Match($sample, $regexp)
$match.Success
$match | Select -ExpandProperty groups | Format-Table Name, Value
# Constructed fields
#{
number = $match.Groups[1]
street = $match.Groups[2]
city = $match.Groups[4]
state = $match.Groups[5]
areacode = $match.Groups[6]
}
So this will result in $match.Success $true and the following numbered capture-groups will be presented in the Groups list:
Name Value
---- -----
0 1460 Finch Ave East, Toronto, Ontario, A1A1A1
1 1460
2 Finch Ave East
3 East
4 Toronto
5 Ontario
6 A1A1A1
7 A1
For constructing the fields, you can ignore 3 and 7 as those are partial-groups:
Name Value
---- -----
areacode A1A1A1
street Finch Ave East
city Toronto
state Ontario
number 1460

To add to mákos excellent answer, I would suggest using named capture groups and the $Matches automatic variable. This makes it super easy to grab the individual fields and turn them into objects for multiple input strings:
function Split-CanadianAddress {
param(
[Parameter(Mandatory,ValueFromPipeline)]
[string[]]$InputString
)
$Pattern = "^(?<Number>\d+)\s*(?<Street>(\w+\s*)+),\s*(?<City>(\w+\s*)+),\s*(?<State>(\w+\s*)+),\s*(?<AreaCode>(\w\d)*)$"
foreach($String in $InputString){
if($String -match $Pattern){
$Fields = #{}
$Matches.Keys |Where-Object {$_ -isnot [int]} |ForEach-Object {
$Fields.Add($_,$Matches[$_])
}
[pscustomobject]$Fields
}
}
}
The $Matches hashtable will contain both the numbered and named capture groups, which is why I copy only the named entries to the $Fields variable before creating the pscustomobject
Now you can use it like:
PS C:\> $sample |Split-CanadianAddress
Street : Finch Ave East
State : Ontario
AreaCode : A1A1A1
Number : 1460
City : Toronto
I've update the pattern to allow for spaces in city and state names as well (think "New Westminster, British Columbia")

Grep Regex: How to find multiple area codes in phone number?

I have a file: each line consist of a name, room number, house address, phone number.
I want to search for the lines that have the area codes of either 404 or 202.
I did "(404)|(202)" but it also gives me lines that had the numbers in the phone number in general instead of from area code, example:
John Smith 300 123 N. Street 808-543-2029
I do not want the above, I am targeting lines like this, examples:
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200

Let's consider this test file:
$ cat addresses
John Smith 202 404 N. Street 808-543-2029
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
The distinguishing feature of area codes, as opposed to other three digit numbers, is that they have a space before them and a - after them. Thus, use:
$ grep -E ' (202|404)-' addresses
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
More complex example
Suppose that phone numbers appear at the end of lines but can have any of the three forms 808-543-2029, 8085432029, or 808 543 2029 as in the following example:
$ cat addresses
John Smith 202 404 N. Street 808-543-2029
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
To select the lines with 202 or 404 area codes:
$ grep -E ' (202|404)([- ][[:digit:]]{3}[- ][[:digit:]]{4}|[[:digit:]]{7})$' addresses
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
If it is possible that the phone numbers are followed by stray whitespaces, then use:
$ grep -E ' (202|404)([- ][[:digit:]]{3}[- ][[:digit:]]{4}|[[:digit:]]{7})[[:blank:]]*$' addresses
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200

You need to add a word boundary token \b right at the beginning of expression, such as \b(202|404).
Demo.

Regex to extract (german) street number

I have the following street constellations:
| Street name | extracted value |
| --------------------------------------- | --------------- |
| Lilienstr. 12a | 12a |
| Hagentorwall 3 | 3 |
| Seilerstr. 14 (Eingang Birkenstr.) | 14 |
| Guentherstr. 43 B | 43 B |
| Eberhard-Leibnitz Str. 1 WH 5B 241 | 1 |
| 1019-1781 Borderlinx C/O SEKO Logistics | - |
My Regex is partially working (https://regex101.com/r/KumamP/2):
\d+(?:[a-zA-Z]$|\s[a-zA-Z]$)?
Someone has got a better solution for me? Eberhard-Leibnitz Str. should only give me one result or none. 1019-1781 Borderlinx C/O SEKO Logistics should give me none result.

The following regex is working for your example
^[ \-a-zA-Z.]+\s+(\d+(\s?\w$)?)
https://regex101.com/r/KumamP/4
The basic assumption is (like your samples suggest), that valid "street constellations" always start with a street name followed by the street/house number.
The next regex is also working if there is an entry like Straße des 17. Juni 1:
^[ \-0-9a-zA-ZäöüÄÖÜß.]+?\s+(\d+(\s?[a-zA-Z])?)\s*(?:$|\(|[A-Z]{2})
https://regex101.com/r/KumamP/5
But as the commentators already wrote, it is difficult to distinguish via an regular expression between numerical street name parts and the street number. Even more if you allow "unspecified" suffixes like (Eingang Birkenstr.) or WH 5B 241 in your example.

Parsing address lines is not trivial. Many countries have their own special rules and Germany and Austria are really tricky.
To understand better the examples you provided, there's one in special that shows the point:
"Eberhard-Leibnitz Str. 1 WH 5B 241"
The "WH" here stands for "Wohnung", but they usually use just "W" (and use some separator like "//"). So it would be more like:
"Eberhard-Leibnitz Str. 1 // W 5B 241"
It's also common to find "co" or "c/o" or "z. H" (abbreviation for "zu Händen von"). And anything that follows it, it's just the mailbox's name.
And last but not least, the address line could also contain the zip code + city name. Depends on the API you're interacting with, or if it's user input (it can get very wild then!).
So, to properly parse address lines, you should first normalize them, by removing that extra information. Then you can use a regex. Take a look at this gem: https://github.com/matiasalbarello/address_line_divider
Some good reads about the topic:
https://www.german-way.com/germans-we-dont-need-apartment-numbers/
https://allaboutberlin.com/guides/addressing-a-letter-in-germany
http://interactive.zeit.de/strassennamen/

Returning and collating instances of arbitrary text in files (plus some other requirements)

I have a job that doesn't seem super difficult, but i'm not clever enough to know where to start. :)
So i have a number of files that contain lines with names on the end, something like this:
File 001:
blahblahblahblahblah:Mrs Jane Doe
blahblahblahblahblah:John Doe
blahblahblahblahblah:Joe Bloggs
File 002:
blahblahblahblahblah:Dr Jane Doe
blahblahblahblahblah:John Doe
blahblahblahblahblah:Joe Bloggs
blahblahblahblahblah:Fred Bloggs
...
And so on. What i would like to do is have a script go through all of these files and return output like this:
John Doe
001
002
...
Fred Bloggs
002
...
But i DON'T know what the names are, so it needs to find them (easy enough with regexp obviously) and then collate them on its own.
On top of that, it ALSO needs to take into account the case of Jane Doe above, where she appears in different places with different titles. Ideally i would like the output for her to be something like the following:
Jane Doe
001 - Mrs
002 - Dr
...
So this script would need to be able to exclude certain terms (which i'd provide, of course) for the purposes of collation, but be smart enough to add those terms back into my results so i can keep track of the changes.
Is this something i can do with a single tool like awk, or am i looking at like a Ruby/Python/perl script sort of job? If the latter, could someone point me to the functions or libraries or whatever i might want to use? (I'm kind of mediocre at scripting/programming but with sufficient documentation i can usually muddle through eventually)
Cheers!
Edit: Seems i've managed it on my own. :) Here's what i did, in case anyone runs across this in the future:
1. I merged all of my names into a single file and stripped all of the titles, like this:
cat * | \
perl -pe 's/^(Dr |Mrs |Mr |blahblah )//ig' | \
sort | \
uniq -i > \
notitles.txt
2. I looped through the resulting file with grep:
while read name; do
export fname="`echo $name | perl -pe 's/[\r\n]//g'`"
echo "\n=={$fname}=="
egrep -i "$name" * | \
sed -E 's/\.txt:/ /g' | \
perl -pe 's/(?<!==)$ENV{"fname"}//ig'
done < notitles.txt
And then i got my list of names like i wanted!

I think it's pretty easy to implement with the help of awk.
I create two files named by ID: 001, 002. And put them in a directory.
$ ls *
001 002
$ gawk -F: '{sub(/^(Dr|Mrs|Mr|Miss) +/, "", $2); a[$2]=a[$2]FILENAME":"}; END{for(i in a)printf("%s:%s\n", i, a[i])}' * | sort | tr : '\n'
Fred Bloggs
002
Jane Doe
001
002
Joe Bloggs
001
002
John Doe
001
002
$ gawk -F: '
{
if(match($2, /^(Dr|Mrs|Mr) +/, m)) {
t = " - " substr($2, m[1, "start"], m[1, "length"])
n = substr($2, m[0, "start"]+m[0, "length"])
}
else {
t = ""
n = $2
}
a[n] = a[n] FILENAME t ":"
}
END{
for(i in a)
printf("%s:%s\n", i, a[i])
}' * | sort | tr : '\n'
Fred Bloggs
002
Jane Doe
001 - Mrs
002 - Dr
Joe Bloggs
001
002
John Doe
001
002

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js