Powershell parsing regex in text file

Powershell parsing regex in text file - regex

I have a text file:
Text.txt
2015-08-31 05:55:54,881 INFO (ClientThread.java:173) - Login successful for user = Test, client = 123.456.789.100:12345
2015-08-31 05:56:51,354 INFO (ClientThread.java:325) - Closing connection 123.456.789.100:12345
I would like output to be:
2015-08-31 05:55:54 Login Test 123.456.789.100
2015-08-31 05:56:51 Closing connection 123.456.789.100
Code:
$files = Get-Content "Text.txt"
$grep = $files | Select-String "serviceClient:" , "Unregistered" |
Where {$_ -match '^(\S+)+\s+([^,]+).*?-\s+(\w+).*?(\S+)$' } |
Foreach {"$($matches[1..4])"} | Write-Host
How can I do it with the current code?

Add another group and make it optional.
^(\S+)+\s+([^,]+).*?-\s+(\w+)(?:.*?=\s+(\w+))?.*?(\S+?)(?::\d+)?$
^
( \S+ )+ # (1)
\s+
( [^,]+ ) # (2)
.*? - \s+
( \w+ ) # (3)
(?:
.*? = \s+
( \w+ ) # (4)
)?
.*?
( \S+? ) # (5)
(?: : \d+ )?
$
Output:
** Grp 0 - ( pos 0 , len 121 )
2015-08-31 05:55:54,881 INFO (ClientThread.java:173) - Login successful for user = Test, client = 123.456.789.100:12345
** Grp 1 - ( pos 0 , len 10 )
2015-08-31
** Grp 2 - ( pos 11 , len 8 )
05:55:54
** Grp 3 - ( pos 57 , len 5 )
Login
** Grp 4 - ( pos 85 , len 4 )
Test
** Grp 5 - ( pos 100 , len 15 )
123.456.789.100
----------------
** Grp 0 - ( pos 123 , len 97 )
2015-08-31 05:56:51,354 INFO (ClientThread.java:325) - Closing connection 123.456.789.100:12345
** Grp 1 - ( pos 123 , len 10 )
2015-08-31
** Grp 2 - ( pos 134 , len 8 )
05:56:51
** Grp 3 - ( pos 180 , len 7 )
Closing
** Grp 4 - NULL
** Grp 5 - ( pos 199 , len 15 )
123.456.789.100

Related

How to get specific error message to Data Validation using Regex with python 3

How do i get a specific output like the below examples:
example 1 - If the user inputs Alberta, UN. I want to be able to see the print result as I'm sorry, UN is an invalid province abbreviation.
I would love if the program can display an exact error in relation to user's input. Instead of an error saying I'm sorry this is an error, without any specific message to let the user know where his/her fault is.
I would really appreciate it if i could get some results, because i have been brainstorming on how to make it work
# Import
import re
# Program interface
print("====== POSTAL CODE CHECKER PROGRAM ====== ")
print("""
Select from below city/Province and enter it
--------------------------------------------
Alberta, AB,
British Columbia, BC
Manitoba, MB
New Brunswick, NB
Newfoundland, NL
Northwest Territories, NT
Nova Scotia, NS
Nunavut, NU
Ontario, ON
Prince Edward Island, PE
Quebec, QC
Saskatchewan, SK
Yukon, YT
""")
# User input 1
province_input = input("Please enter the city/province: ")
pattern = re.compile(r'[ABMNOPQSYabcdefhiklmnorstuvw]| [CBTSE]| [Idasln], [ABMNOPQSY]+[BCLTSUNEK]')
if pattern.match(province_input):
print("You have successfully inputted {} the right province.".format(province_input))
elif not pattern.match(province_input):
print("I'm sorry, {} is an invalid province abbreviation".format(province_input))
else:
print("I'm sorry, your city or province is incorrectly formatted.")

I tried to generalize your question, so it will check if the first part of the input is a valid city and the second is a valid state abbreviation, when "valid" means each of them appears in its relevant valid inputs list.
The core of the code is the regex ([A-Z][A-Za-z ]+), ([A-Z]{2}), which matches two groups: the first group contains the city - and after a comma and a space - the second group contains the state abbreviation (which must consist two capital letters).
Please notice there are 5 possible outputs, according to the validity of each part.
import re
cities = ["Alberta", "British Columbia", "Manitoba"]
states = ["AB", "BC", "MB"]
province_input = input("Please enter the city/province: ")
regexp = r"([A-Z][A-Za-z ]+), ([A-Z]{2})"
if re.compile(regexp).match(province_input):
m = re.search(regexp, province_input)
city_input = m.group(1)
state_input = m.group(2)
if city_input not in cities and state_input not in states:
print("Both '%s' and '%s' are valid" % (city_input, state_input))
elif city_input in cities:
if state_input in states:
print("Your input '%s, %s' was valid" % (city_input, state_input))
else:
print("'%s' is an invalid province abbreviation" % state_input)
else:
print("The city '%s' is invalid" % city_input)
else:
print("Wrong input format")
I tried to make the code as clear as possible, but please do let me know if anything is unclear.

Not a Python programmer but I think if you import regex
the new regex lib replacement, it will give you access to the
Branch Reset construct.
Using that, it's trivial to divide City and Province into 3 groups.
So, just match with this regex
[ \t]*(?|(Alberta)(?:[ \t]*,[ \t]*(?:(AB)|(\w+)))?|(British[ \t]+Columbia)(?:[ \t]*,[ \t]*(?:(BC)|(\w+)))?|(Manitoba)(?:[ \t]*,[ \t]*(?:(MB)|(\w+)))?|(New[ \t]+Brunswick)(?:[ \t]*,[ \t]*(?:(NB)|(\w+)))?|(Newfoundland)(?:[ \t]*,[ \t]*(?:(NL)|(\w+)))?|(Northwest[ \t]+Territories)(?:[ \t]*,[ \t]*(?:(NT)|(\w+)))?|(Nova[ \t]+Scotia)(?:[ \t]*,[ \t]*(?:(NS)|(\w+)))?|(Nunavut)(?:[ \t]*,[ \t]*(?:(NU)|(\w+)))?|(Ontario)(?:[ \t]*,[ \t]*(?:(ON)|(\w+)))?|(Prince[ \t]+Edward[ \t]+Island)(?:[ \t]*,[ \t]*(?:(PE)|(\w+)))?|(Quebec)(?:[ \t]*,[ \t]*(?:(QC)|(\w+)))?|(Saskatchewan)(?:[ \t]*,[ \t]*(?:(SK)|(\w+)))?|(Yukon)(?:[ \t]*,[ \t]*(?:(YT)|(\w+)))?|()(\w+)())
Check the groups in this order :
if NO match : Please enter 'City, Province' from the list
else if length $1 equals 0 : '$2' is not a valid City
else if length $3 > 0 : '$3' is not a valid Province
else if length $2 equals 0 : Please enter a Province
else : Thank you, your entry is valid '$1, $2'
Demo: https://regex101.com/r/MrlqEN/1
Expanded
[ \t]*
(?|
( Alberta ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( AB ) # (2)
| ( \w+ ) # (3)
)
)?
| ( British [ \t]+ Columbia ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( BC ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Manitoba ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( MB ) # (2)
| ( \w+ ) # (3)
)
)?
| ( New [ \t]+ Brunswick ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NB ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Newfoundland ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NL ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Northwest [ \t]+ Territories ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NT ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Nova [ \t]+ Scotia ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NS ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Nunavut ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NU ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Ontario ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( ON ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Prince [ \t]+ Edward [ \t]+ Island ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( PE ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Quebec ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( QC ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Saskatchewan ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( SK ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Yukon ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( YT ) # (2)
| ( \w+ ) # (3)
)
)?
| ( ) # (1)
( \w+ ) # (2)
( ) # (3)
)

How do named and unnamed PCRE capturing groups interact?

If the regular expression is e.g. ^(?<object>[\-\w]+)/([\-\w]+)$, will one invoke the second capturing group as $2 or as $1? In other words, are anonymous capturing groups absolutely or relatively numbered?

Use $2 to refer to the second numbered capturing group. Note I would not call it anonymous, maybe, "unnamed" would suit better here.
See a sample regex demo.
See PCRE docs:
PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still acquire numbers.

In PCRE, Capture groups are numbered sequentially in the order found.
Here is an example where the groups are annotated, indented and numbered (mixed with some conditionals).
# ==============================
# Variations of the same thing
# ==============================
1 ( a )?
2 ( b )?
3 ( c )?
c (?(1)
|
c (?(2)
|
c (?(3) | (*FAIL) )
)
)
# ==============================
4 (
5 ( a )?
6 ( b )?
7 ( c )?
4 )
c (?(2)
|
c (?(3)
|
c (?(4) | (*FAIL) )
)
)
# ==============================
8 (?<A> a )?
9 (?<B> b )?
10 (?<C> c )?
c (?(<A>)
|
c (?(<B>)
|
c (?(<C>) | (*FAIL) )
)
)
# ==============================
11 (?<M>
12 (?<A> a )?
13 (?<B> b )?
14 (?<C> c )?
c (?(<A>)
|
c (?(<B>)
|
c (?(<C>) | (*FAIL) )
)
)
11 )
# ==============================
The Branch Reset treats conditionals a little differently.
At the next group number where the BR starts, it numbers sequentially
at the start of each branch.
Going past the BR, the numbering starts 1+ after the largest count assigned
from a single branch.
Example:
# Super Branch with Conditional's
1 ( a ) # (1)
(?|
x
br 2 ( y ) # (2)
z
(?|
br 3 ( u ) # (3)
4 ( u ) # (4)
c (?(1)
5 ( R ) # (5)
| (?|
br 6 ( x ) # (6)
|
br 6 ( x ) # (6)
c (?(2)
a
|
7 ( b ) # (7)
)
8 ( c ) # (8)
)
)
9 ( u ) # (9)
10 ( u ) # (10)
|
br 3 ( e ) # (3)
4 ( e ) # (4)
5 ( e ) # (5)
|
br 3 ( c ) # (3)
)
11 ( K ) # (11)
|
br 2 ( # (2 start)
p
3 ( # (3 start)
q
(?|
br 4 ( M ) # (4)
5 ( M ) # (5)
6 ( M ) # (6)
7 ( M ) # (7)
(?|
br 8 ( T ) # (8)
9 ( T ) # (9)
10 ( T ) # (10)
|
br 8 ( D ) # (8)
9 ( D ) # (9)
)
12 ( R ) # (12)
13 ( R ) # (13)
|
br 4 ( B ) # (4)
5 ( B ) # (5)
6 ( B ) # (6)
|
br 4 ( v ) # (4)
)
3 ) # (3 end)
r
2 ) # (2 end)
14 ( o ) # (14)
15 ( i ) # (15)
|
br 2 ( t ) # (2)
s
3 ( w ) # (3)
)
16 ( Z ) # (16)
Addendum for Dot-Net counting
There are 2 options for counting Dot-Net captures.
Count named capture groups
Named groups last
Obviously, without 1 you don't get 2.
Example: Don't count named groups
1 ( # (1 start)
(?'overall'
^
(?= [^&] )
(?:
(?<scheme> [^:/?#]+ )
:
)?
(?:
//
2 ( ) # (2)
(?<authority> [^/?#]* )
)?
(?<path> [^?#]* )
(?:
\?
(?<query> [^#]* )
)?
3 ( ) # (3)
(?:
\#
(?<fragment> .* )
)?
)
1 ) # (1 end)
Example: Count named groups
1 ( # (1 start)
2 (?'overall' # (2 start)
^
(?= [^&] )
(?:
3 (?<scheme> [^:/?#]+ ) # (3)
:
)?
(?:
//
4 ( ) # (4)
5 (?<authority> [^/?#]* ) # (5)
)?
6 (?<path> [^?#]* ) # (6)
(?:
\?
7 (?<query> [^#]* ) # (7)
)?
8 ( ) # (8)
(?:
\#
9 (?<fragment> .* ) # (9)
)?
2 ) # (2 end)
1 ) # (1 end)
Example: Count named groups, and Named groups last
1 ( # (1 start)
4 (?'overall' #_(4 start)
^
(?= [^&] )
(?:
5 (?<scheme> [^:/?#]+ ) #_(5)
:
)?
(?:
//
2 ( ) # (2)
6 (?<authority> [^/?#]* ) #_(6)
)?
7 (?<path> [^?#]* ) #_(7)
(?:
\?
8 (?<query> [^#]* ) #_(8)
)?
3 ( ) # (3)
(?:
\#
9 (?<fragment> .* ) #_(9)
)?
4 ) #_(4 end)
1 ) # (1 end)

Perl nested parentheses expression

How do I use perl regex to extract the contents within the outermost parentheses?
text = (-(A + (B - C)))
output = -(A + (B - C))
Thanks

It can be done with this (\(((?:[^()]++|(?1))*)\)) and there are several
ways to do it.
Formatted and tested:
( # (1 start), Recursion code group
\( # Opening (
( # (2 start), Capture, inner core
(?: # Cluster group
[^()]++ # Possesive, not parenth's
| # or,
(?1) # Recurse to group 1
)* # End cluster, do 0 to many times
) # (2 end)
\) # Closing )
) # (1 end)
Output
** Grp 0 - ( pos 4 , len 16 )
(-(A + (B - C)))
** Grp 1 - ( pos 4 , len 16 )
(-(A + (B - C)))
** Grp 2 - ( pos 5 , len 14 )
-(A + (B - C))

I don't see that anything more than this is required
use strict;
use warnings 'all';
my $text = "(-(A + (B - C)))";
my ($result) = $text =~ / \( (.*) \) /x;
print $result, "\n";
output
-(A + (B - C))
The pattern captures everything from after the first opening parenthesis to before the last closing parenthesis. From your question, I don't think there's a need to check that the string is balanced

Perl: Keep only one of two consecutive characters

I'm having trouble applying a regex to keep only one of two specific consecutive characters in a column. I have the following file in which C-O appears for number 1 and number 2, as indicated. I would like to write a new file in which only C-O in number 1 is present. This functionality needs to be repeated throughout the file, for example between number 2 and 3 (keep number 2), and number 3 and 4 (keep number 3) etc .
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581
This is what I have so far, hope my logic is semi-clear. I'm still learning and any commentary is greatly appreciated!
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'data.txt';
open my $fh, '<', $file or die "Can't read $file: $!";
while (my $line = <fh>) {
chomp $line;
my #column = split(/\t/,$line);
if ($column[1] =~ s/COCO/\s+/g) {
print "#columns\n";
}
}

You could maybe do it all at once. Read the whole file into a string.
Then put it through this regex.
# s/(?m)(^\h+(\d+)\h+C.*\s+^\h+\2\h+O.*\n)\s*^\h+(?!\2)(\d+)\h+C.*\s+^\h+\3\h+O.*\n(?!\s*\z)/$1/g
(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
Perl code:
use strict;
use warnings;
$/ = "";
my $input = <DATA>;
print "Input:\n$input\n";
$input =~
s/(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
/$1/g;
print "Output:\n$input\n";
__DATA__
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Code output:
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581

How to match the main subgroup in a regular expression?

Having this string:
"example( other(1), 123, [25]).othermethod(456)"
How i can capture only the arguments of the main functions:
"other(1), 123, [25]" and "456"
I am trying this:
http://regex101.com/r/cR0uS9/2
In html example. Having this:
<div>
<div>
<div>12</div>
<div>34</div>
</div>
</div>
<div>56</div>
I want to get:
<div>
<div>12</div>
<div>34</div>
</div>
and 56 as second match.

Here's a pattern that doesn't use recursion:
\w+\s*\((?P<parameters>(?:(?:(?:[^()]*\([^()]*\))+|[^()]*)(?:,(?!\s*\))|(?=\))))*)\)
Caveats:
Does not support more than 2 levels of nested braces. e.g.
a(b(c()))
Strings containing ( or ) will trip it up. e.g.
a(")")
You'll find the parameters in the group called "parameters".
Demo.
Explanation:
\w+ # function name
\s* # white space
\(
(?P<parameters> # parameters:
(?:
# two possibilities: 1: a simple parameter, like "12", "'hello'", or "3*1+2"
# 2: the parameter contains braces.
# we'll try to consume pairs of braces. If that fails, we'll simply match a parameter.
(?:
(?: # match a pair of braces ()
[^()]*
\(
[^()]*
\)
)+ # consume as many pairs of braces as possible. Make sure there's at least one, though, because we can't go matching nothing.
|
[^()]* # since there are no more (pairs of) braces, simply consume the function's parameters.
)
# next, either consume a "," or assert there's a ")"
(?:
,
(?! # make sure there is another parameter after the comma
\s*
\)
)
|
(?=
\)
)
)
)*
)
\)
P.S.: I haven't managed to come up with an acceptable pattern for the HTML example yet.

This does some recursion. Use it in a global find function.
# '~(?is)(?:([a-z]\w*)\s*\(((?&core)|)\))(?(DEFINE)(?<core>(?>(?&content)|(?:[a-z]\w*\s*\(|\()(?:(?=.)(?&core)|)\))+)(?<content>(?>(?![a-z]\w*\s*\(|[()]).)+))~'
(?xis-)
(?:
( [a-z] \w* ) # (1), Start-Delimiter, Function
\s* \(
( # (2), CORE
(?&core)
|
)
\) # End-Delimiter, close paren
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
(?: # Start-Delimiter
[a-z] \w* \s* \( # Function
| \( # Or, a open paren
)
(?:
(?= . )
(?&core) # Recurse core
|
)
\) # End-Delimiter, close paren
)+
)
# content
(?<content>
(?>
(?!
[a-z] \w* \s* \(
| [()]
)
.
)+
)
)
Output:
** Grp 0 - ( pos 0 , len 29 )
example( other(1), 123, [25])
** Grp 1 - ( pos 0 , len 7 )
example
** Grp 2 - ( pos 8 , len 20 )
other(1), 123, [25]
** Grp 3 - NULL
** Grp 4 - NULL
-----------------------
** Grp 0 - ( pos 30 , len 16 )
othermethod(456)
** Grp 1 - ( pos 30 , len 11 )
othermethod
** Grp 2 - ( pos 42 , len 3 )
456
** Grp 3 - NULL
** Grp 4 - NULL
For the html div -
# '~(?s)(?:<div>((?&core)|)</div>)(?(DEFINE)(?<core>(?>(?&content)|<div>(?:(?=.)(?&core)|)</div>)+)(?<content>(?>(?!</?div>).)+))~'
(?xs-)
(?:
<div> # Start-Delimiter <div>
( # (1), CORE
(?&core)
|
)
</div> # End-Delimiter </div>
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
<div> # Start-Delimiter <div>
(?:
(?= . )
(?&core) # Recurse core
|
)
</div> # End-Delimiter </div>
)+
)
# content
(?<content>
(?>
(?! </?div> )
.
)+
)
)
Output:
** Grp 0 - ( pos 0 , len 82 )
<div>
<div>
<div>12</div>
<div>34</div>
</div>
</div>
** Grp 1 - ( pos 5 , len 71 )
<div>
<div>12</div>
<div>34</div>
</div>
** Grp 2 - NULL
** Grp 3 - NULL
---------------------------
** Grp 0 - ( pos 84 , len 13 )
<div>56</div>
** Grp 1 - ( pos 89 , len 2 )
56
** Grp 2 - NULL
** Grp 3 - NULL

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Powershell parsing regex in text file - regex

Related

How to get specific error message to Data Validation using Regex with python 3

How do named and unnamed PCRE capturing groups interact?

Perl nested parentheses expression

Perl: Keep only one of two consecutive characters

How to match the main subgroup in a regular expression?

Categories

Resources