Trying to parse string values into an array after a pattern match - regex

I have the following lines in a text file:
<Entry>
<Key argument="ComputerNames"/>
<Value type="string" argument="localhost,localhost,engine1,engine2"/></Entry>
<Entry>
<Key argument="BranchIDMultiple"/>
<Value type="int" argument="1"/></Entry>
I know how to find the line that has ComputerNames. I know how to read the next line as well.
I need to parse the line as follows where the number of arguments can be dynamic. Parse output should be:
#result = $result[0]=localhost, $result[1]=localhost, $result[2]=engine1, $result[3]=engine2.
There must be at least one argument, but there can be more as well..
I'm not able to construct the right regex to accomplish the split. Any ideas?

Let's say input contains your following xml line.
Since you've mentioned that you know how to extract this line. I've left that portion to you.
After you got this line use the following regex
String regex ="argument=\"[a-zA-Z0-9,]*\"" ;
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
String[] op;
if(matcher.find())
{
op = input.subString(matcher.start(),matcher.end()).split(",");
}

Ok, Here is what I have:
--- After many different trials, I was finally able to get something that works. See below:
BEGIN { require 5.8.0; }
use strict;
use warnings;
# string to test regular expressions
my $test_string = '<Value type="string" argument="400teets,localhost,localhost,engine1,engine2,engine50,engine100,100afdasfdas"/></Entry>';
# print out the initial string
print "The initial string is: $test_string\n\n";
# first set of arguments - all words that have a comma after them
my #first_words = ($test_string =~ /(\w+),/g);
# print first set of arguments
print "\nFirst set of arguments found\n";
foreach my $word (#first_words) {
print "$word\n";
}
# second set of arguments - all words that have a comma before them
my #last_words = ($test_string =~ /,(\w+)/g);
#print second set of arguments
print "\nSecond set of arguments found\n";
foreach my $word (#last_words) {
print "$word\n";
}
#merge the sets by popping the last element off of last_words array and pushing it into the first_words array
push(#first_words,pop(#last_words));
#print the results
print "\nMerged Sets\n";
foreach my $word (#first_words) {
print "$word\n";
}
# END OF PROGRAM
--- Really, if you exclude all of the print statements and comments, all you really need is these three lines:
my #first_words = ($test_string =~ /(\w+),/g);
my #last_words = ($test_string =~ /,(\w+)/g);
push(#first_words,pop(#last_words));
--- Here is the output:
The initial string is:
First set of arguments found
400teets
localhost
localhost
engine1
engine2
engine50
engine100
Second set of arguments found
localhost
localhost
engine1
engine2
engine50
engine100
100afdasfdas
Merged Sets
400teets
localhost
localhost
engine1
engine2
engine50
engine100
100afdasfdas

Related

How to represent many parts of awk sub/gsub's matched string

How to represent more than one part of awk sub or gsub's matched string.
For a regexpr like "##code", if I want to insert a word between "##" and "code", I would want a way like VSCode's syntax in witch $1 represent the first part and $2 represent the second part
sub(/(##)(code)/, "$1before$2", str)
from awk's user manual, I found that awk use & to represent the whole matched string。 How can I represent one,two or more part in the matched string like VSCode.
sub(regexp, replacement [, target])
Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).
The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.
This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub() can store a modified value there. If this argument is omitted, then the default is to use and alter $0.48 For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets str to ‘wither, water, everywhere’, by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:
The user manual's link is here
Your best option is to use GNU awk for either of these:
$ awk '{$0=gensub(/(##)(code)/,"\\1before\\2",1)} 1' <<<'##code'
##beforecode
$ awk 'match($0,/(##)(code)/,a){$0=a[1] "before" a[2]} 1' <<<'##code'
##beforecode
The first one only lets you move text segments around while the 2nd lets you call functions, perform math ops or do anything else on the matching text before moving it around in the original or doing anything else with it:
$ awk 'match($0,/(##)(code)/,a){$0=length(a[1])*10 "before" toupper(a[2])} 1' <<<'##code'
20beforeCODE
After thinking about this for a bit, I don't know how to get the desired behavior in any reasonable way using just POSIX awk constructs. Here's something I tried (the matches() function):
$ cat tst.awk
BEGIN {
str = "foobar"
re = "(f.*o)(b.*r)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "succ: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
str = "foofoo"
re = "(f.*o)(f.*o)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "fail: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
}
function matches(str,re,arr, start,tgt,n,i,segs) {
delete arr
if ( start=match(str,re) ) {
tgt = substr($0,RSTART,RLENGTH)
n = split(re,segs,/[)(]+/) - 1
for (i=1; RSTART && (i < n); i++) {
if ( match(str,segs[i+1]) ) {
arr[i] = substr(str,RSTART,RLENGTH)
str = substr(str,RSTART+RLENGTH)
}
}
}
return start
}
.
$ awk -f tst.awk
re "(f.*o)(b.*r)" matching string "foobar"
succ: gensub(): <foo> <bar>
succ: match(): <foo> <bar>
succ: matches(): <foo> <bar>
re "(f.*o)(f.*o)" matching string "foofoo"
succ: gensub(): <foo> <foo>
succ: match(): <foo> <foo>
fail: matches(): <foofoo> <>
but of course that doesn't work for the 2nd case as the first RE segment of f.*o matches the whole string foofoo and of course the same thing happens if you try to take the RE segments in reverse. I also considered getting the RE segments like above but then build up a new string one char at a time from the string passed in and compare the first RE segment to THAT until it matches as THAT would be the shortest matching string to the RE segment BUT that would fail for a string+RE like:
str='foooobar'
re='(f.*o)(b.*r)'
since f.*o would match foo with that alorigthm when it really needs to match fooooo.
So - I guess you'd need to keep iterating (being careful of what direction you iterate in - from the end is correct I expect) till you get the string split up into segments that each match every RE segment in a left-most-longest fashion. Seems like a lot of work!
When you use GNU awk, you can use gensub for this purpose. Without gensub for any generic awk it becomes a bit more tedious. The procedure could be something like this:
ere="(ere1)(ere2)"
match(str,ere)
tmp=substr(str,RSTART,RLENGTH)
match(tmp,"ere1"); part1=substr(tmp,RSTART,RLENGTH)
part2=substr(tmp,RLENGTH)
sub(ere,part1 "before" part2,str)
The problem with this is that it will not always work and you have to engineer it a bit. A simple fail can be created due to the greedyness of the ERE":
str="foocode"
ere="(f.*o)(code)"
match(str,ere) # finds "foocode"
tmp=substr(str,RSTART,RLENGTH) # tmp <: "foocode"
match(tmp,"(f.*o)"); # greedy "fooco"
part1=substr(tmp,RSTART,RLENGTH) # part1 <: "fooco"
part2=substr(tmp,RLENGTH) # part2 <: "de"
sub(ere,part1 "before" part2,str) # :> "foocobeforede

perl regular expression match scalar plus punctuation

I have scalars (columns in a table) that have one or two email addresses separated by a comma. such as 'Joek#xyznco.com, jrancher#candyco.us' or 'jsmith#wellingent.com,mjones#wellingent.com' for several of these records I need to remove a bad/old email address and the trailing comma (if one exists).
if jmsith#wellingent is no longer valid how do I remove that address and the trailing comma?
This only removes the address but leaves the comma.
my $general_email = 'jsmith#wellingent.com,mjones#wellingent.com';
my $bad_addr = 'jsmith#wellingent.com';
$general_email =~ s/$bad_addr//;
Thanks for any help.
You may be better off without a regex but with list splitting:
use strict;
use warnings;
sub remove_bad {
my ($full, $bad) = #_;
my #emails = split /\s*,\s*/, $full; # split at comma, allowing for spaces around the comma
my #filtered = grep { $_ ne $bad } #emails;
return join ",", #filtered;
}
print 'First: ' , remove_bad('me#example.org, you#example.org', 'me#example.org'), "\n";
print 'Last: ', remove_bad('me#example.org, you#example.org', 'you#example.org'), "\n";
print 'Middle: ', remove_bad('me#example.org, you#example.org, other#eample.org', 'you#example.org'), "\n";
First, split the bad email address list at the comma, creating an array. Filter that using grep to remove the bad address. join the remaining elements back into a string.
The above code prints:
First: you#example.org
Last: me#example.org
Middle: me#example.org,other#eample.org

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|
TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.
The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.
An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'
/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665
You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.

In Perl, how many groups are in the matched regex?

I would like to tell the difference between a number 1 and string '1'.
The reason that I want to do this is because I want to determine the number of capturing parentheses in a regular expression after a successful match. According the perlop doc, a list (1) is returned when there are no capturing groups in the pattern. So if I get a successful match and a list (1) then I cannot tell if the pattern has no parens or it has one paren and it matched a '1'. I can resolve that ambiguity if there is a difference between number 1 and string '1'.
You can tell how many capturing groups are in the last successful match by using the special #+ array. $#+ is the number of capturing groups. If that's 0, then there were no capturing parentheses.
For example, bitwise operators behave differently for strings and integers:
~1 = 18446744073709551614
~'1' = Î ('1' = 0x31, ~'1' = ~0x31 = 0xce = 'Î')
#!/usr/bin/perl
($b) = ('1' =~ /(1)/);
print isstring($b) ? "string\n" : "int\n";
($b) = ('1' =~ /1/);
print isstring($b) ? "string\n" : "int\n";
sub isstring() {
return ($_[0] & ~$_[0]);
}
isstring returns either 0 (as a result of numeric bitwise op) which is false, or "\0" (as a result of bitwise string ops, set perldoc perlop) which is true as it is a non-empty string.
If you want to know the number of capture groups a regex matched, just count them. Don't look at the values they return, which appears to be your problem:
You can get the count by looking at the result of the list assignment, which returns the number of items on the right hand side of the list assignment:
my $count = my #array = $string =~ m/.../g;
If you don't need to keep the capture buffers, assign to an empty list:
my $count = () = $string =~ m/.../g;
Or do it in two steps:
my #array = $string =~ m/.../g;
my $count = #array;
You can also use the #+ or #- variables, using some of the tricks I show in the first pages of Mastering Perl. These arrays have the starting and ending positions of each of the capture buffers. The values in index 0 apply to the entire pattern, the values in index 1 are for $1, and so on. The last index, then, is the total number of capture buffers. See perlvar.
Perl converts between strings and numbers automatically as needed. Internally, it tracks the values separately. You can use Devel::Peek to see this in action:
use Devel::Peek;
$x = 1;
$y = '1';
Dump($x);
Dump($y);
The output is:
SV = IV(0x3073f40) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
SV = PV(0x30698cc) at 0x3073484
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x3079bb4 "1"\0
CUR = 1
LEN = 4
Note that the dump of $x has a value for the IV slot, while the dump of $y doesn't but does have a value in the PV slot. Also note that simply using the values in a different context can trigger stringification or nummification and populate the other slots. e.g. if you did $x . '' or $y + 0 before peeking at the value, you'd get this:
SV = PVIV(0x2b30b74) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x3079c5c "1"\0
CUR = 1
LEN = 4
At which point 1 and '1' are no longer distinguishable at all.
Check for the definedness of $1 after a successful match. The logic goes like this:
If the list is empty then the pattern match failed
Else if $1 is defined then the list contains all the catpured substrings
Else the match was successful, but there were no captures
Your question doesn't make a lot of sense, but it appears you want to know the difference between:
$a = "foo";
#f = $a =~ /foo/;
and
$a = "foo1";
#f = $a =~ /foo(1)?/;
Since they both return the same thing regardless if a capture was made.
The answer is: Don't try and use the returned array. Check to see if $1 is not equal to ""

Regex: optional group

I want to split a string like this:
abc//def//ghi
into a part before and after the first occurrence of //:
a: abc
b: //def//ghi
I'm currently using this regex:
(?<a>.*?)(?<b>//.*)
Which works fine so far.
However, sometimes the // is missing in the source string and obviously the regex fails to match. How is it possible to make the second group optional?
An input like abc should be matched to:
a: abc
b: (empty)
I tried (?<a>.*?)(?<b>//.*)? but that left me with lots of NULL results in Expresso so I guess it's the wrong idea.
Try a ^ at the begining of your expression to match the begining of the string and a $ at the end to match the end of the string (this will make the ungreedy match work).
^(?<a>.*?)(?<b>//.*)?$
A proof of Stevo3000's answer (Python):
import re
test_strings = ['abc//def//ghi', 'abc//def', 'abc']
regex = re.compile("(?P<a>.*?)(?P<b>//.*)?$")
for ts in test_strings:
match = regex.match(ts)
print 'a:', match.group('a'), 'b:', match.group('b')
a: abc b: //def//ghi
a: abc b: //def
a: abc b: None
Why use group matching at all? Why not just split by "//", either as a regex or a plain string?
use strict;
my $str = 'abc//def//ghi';
my $short = 'abc';
print "The first:\n";
my #groups = split(/\/\//, $str, 2);
foreach my $val (#groups) {
print "$val\n";
}
print "The second:\n";
#groups = split(/\/\//, $short, 2);
foreach my $val (#groups) {
print "$val\n";
}
gives
The first:
abc
def//ghi
The second:
abc
[EDIT: Fixed to return max 2 groups]