How to represent more than one part of awk sub or gsub's matched string.
For a regexpr like "##code", if I want to insert a word between "##" and "code", I would want a way like VSCode's syntax in witch $1 represent the first part and $2 represent the second part
sub(/(##)(code)/, "$1before$2", str)
from awk's user manual, I found that awk use & to represent the whole matched string。 How can I represent one,two or more part in the matched string like VSCode.
sub(regexp, replacement [, target])
Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).
The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.
This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub() can store a modified value there. If this argument is omitted, then the default is to use and alter $0.48 For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets str to ‘wither, water, everywhere’, by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:
The user manual's link is here
Your best option is to use GNU awk for either of these:
$ awk '{$0=gensub(/(##)(code)/,"\\1before\\2",1)} 1' <<<'##code'
##beforecode
$ awk 'match($0,/(##)(code)/,a){$0=a[1] "before" a[2]} 1' <<<'##code'
##beforecode
The first one only lets you move text segments around while the 2nd lets you call functions, perform math ops or do anything else on the matching text before moving it around in the original or doing anything else with it:
$ awk 'match($0,/(##)(code)/,a){$0=length(a[1])*10 "before" toupper(a[2])} 1' <<<'##code'
20beforeCODE
After thinking about this for a bit, I don't know how to get the desired behavior in any reasonable way using just POSIX awk constructs. Here's something I tried (the matches() function):
$ cat tst.awk
BEGIN {
str = "foobar"
re = "(f.*o)(b.*r)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "succ: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
str = "foofoo"
re = "(f.*o)(f.*o)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "fail: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
}
function matches(str,re,arr, start,tgt,n,i,segs) {
delete arr
if ( start=match(str,re) ) {
tgt = substr($0,RSTART,RLENGTH)
n = split(re,segs,/[)(]+/) - 1
for (i=1; RSTART && (i < n); i++) {
if ( match(str,segs[i+1]) ) {
arr[i] = substr(str,RSTART,RLENGTH)
str = substr(str,RSTART+RLENGTH)
}
}
}
return start
}
.
$ awk -f tst.awk
re "(f.*o)(b.*r)" matching string "foobar"
succ: gensub(): <foo> <bar>
succ: match(): <foo> <bar>
succ: matches(): <foo> <bar>
re "(f.*o)(f.*o)" matching string "foofoo"
succ: gensub(): <foo> <foo>
succ: match(): <foo> <foo>
fail: matches(): <foofoo> <>
but of course that doesn't work for the 2nd case as the first RE segment of f.*o matches the whole string foofoo and of course the same thing happens if you try to take the RE segments in reverse. I also considered getting the RE segments like above but then build up a new string one char at a time from the string passed in and compare the first RE segment to THAT until it matches as THAT would be the shortest matching string to the RE segment BUT that would fail for a string+RE like:
str='foooobar'
re='(f.*o)(b.*r)'
since f.*o would match foo with that alorigthm when it really needs to match fooooo.
So - I guess you'd need to keep iterating (being careful of what direction you iterate in - from the end is correct I expect) till you get the string split up into segments that each match every RE segment in a left-most-longest fashion. Seems like a lot of work!
When you use GNU awk, you can use gensub for this purpose. Without gensub for any generic awk it becomes a bit more tedious. The procedure could be something like this:
ere="(ere1)(ere2)"
match(str,ere)
tmp=substr(str,RSTART,RLENGTH)
match(tmp,"ere1"); part1=substr(tmp,RSTART,RLENGTH)
part2=substr(tmp,RLENGTH)
sub(ere,part1 "before" part2,str)
The problem with this is that it will not always work and you have to engineer it a bit. A simple fail can be created due to the greedyness of the ERE":
str="foocode"
ere="(f.*o)(code)"
match(str,ere) # finds "foocode"
tmp=substr(str,RSTART,RLENGTH) # tmp <: "foocode"
match(tmp,"(f.*o)"); # greedy "fooco"
part1=substr(tmp,RSTART,RLENGTH) # part1 <: "fooco"
part2=substr(tmp,RLENGTH) # part2 <: "de"
sub(ere,part1 "before" part2,str) # :> "foocobeforede
My String is : PI Last Name equal to one of
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN','ALVAREZ','AMARYAN','AMBESI-IMPIOMBATO','AMEGBETO','AMOWITZ', 'ANAGNOSTARAS','ANAND','ANDERSEN','ANDERSON', 'ANDRADE','ANDREEFF','ANDROPHY','ANGER','ANHOLT','ANTHONY','ANTLE','ANTONELLI','ANTONY', 'ANZULOVICH', 'APODACA','APOSHIAN','APPEL','APPLEBY','APRIL','ARAUJO','ARBIB','ARBOLEDA', 'ARCHAKOV','ARCHER', 'ARECHAVALETA-VELASCO','ARENS','ARGON','ARGYROKASTRITIS', 'ARIAS','ARIZAGA','ARMSTRONG','ARNON', 'ARSHAVSKY','ARVIN','ASATRYAN','ASCOLI','ASKENASE','ASSI','ATALAY','ATANASOVA','ATKINSON','ATTYGALLE','ATWEH','AU','AVETISYAN','AWE','AYOUB','AZAD','BACSO','BAGASRA','BAKER','BALAS', 'BALCAZAR','BALK','BALKAY','BALLOU','BALRAJ','BALSTER','BANERJEE','BANKOLE','BANTA','BARAL','BARANOWSKA','BARBAS', 'BARBER','BARILLAS-MURY','BARKHOLT','BARNES','BARNETT','BARRETT','BARRIA','BARROW','BARROWS','BARTKE','BARTLETT','BASSINGTHWAIGHTE','BASSIOUNY','BASU','BATES','BATTAGLIA','BATTERMAN','BAUER','BAUERLE','BAUM','BAUME', 'BAUMLER','BAVISTER','BAWA','BAYNE','BEASLEY','BEATTY','BEATY','BEBENEK','BECK','BECKER','BECKMAN','BECKMAN-SUURKULA' ,'BEDFORD','BEDOLLA','BEEBE','BEEMON','BEHETS','BEHRMAN','BEIER','BEKKER','BELL','BELLIDO','BELMAIN', 'BENATAR','BENBENISHTY','BENBROOK','BENDER','BENEDETTI','BENNETT','BENNISH','BENZ','BERG','BERGER','BERGEY','BERGGREN','BERK','BERKOWITZ','BERLIN','BERLINER','BERMAN','BERTINO','BERTOZZI','BERTRAND','BERWICK','BETHONY','BEYERS','BEYRER' ,'BEZPROZVANNY','BHAGWAT','BHANDARI','BHARGAVA','BHARUCHA','BHUJWALLA','BIANCO','BIDLACK','BIELERT','BIER','BIESSMANN','BIGELOW' ,'BILLER','BILLINGS','BINDER','BINDMAN','BINUTU','BIRBECK','BIRGE','BIRNBAUM','BIRO','BIRT','BISHAI','BISHOP','BISSELL','BJORKEGREN','BJORNSTAD','BLACK','BLANCHARD','BLASS','BLATTNER','BLIGNAUT','BLOCH','BLOCK','BLOOM','BLOOM,','BLUM','BLUMBERG' ,'BLUMENTHAL','BLYUKHER','BODDULURI','BOFFETTA','BOGOLIUBOVA', 'BOLLINGER','BOLLS','BOMSZTYK','BONANNO','BONNER','BOOM','BOOTHROYD','BOPPANA','BORAWSKI','BORG','BORIS-LAWRIE','BORISY','BORLONGAN','BORNSTEIN','BORODOVSKY','BORST','BOS','BOTO','BOWDEN','BOWEN','BOYCE-JACINO','BRADEN','BRADY' ,'BRAITHWAITE','BRANN','BRASH','BRAUNSTEIN', 'BREMAN','BRENNAN','BRENNER','BRETSCHER','BREW','BREYSSE','BRIGGS','BRITES','BRITT','BRITTENHAM','BRODIE','BRODY','BROOK','BROOTEN','BROSCO','BROSNAN','BROWN','BROWNE','BRUCKNER','BRUNENGRABER','BRYL','BRYSON','BU','BUCHAN','BUDD','BUDNIK', 'BUEKENS','BUKRINSKY','BULLMORE','BULUN','BURBANO','BURGENER','BURGESS','BURKS','BURMEISTER','BURNETT','BURNHAM','BURNS','BURRIDGE','BURTON','BUSCIGLIO','BUSHEK','BUSIJA','BUZSAKI','BZYMEK','CABA')
I need to have a regex which will greedily looks for up to 150 characters with a last character being a ','. And then replace the last ',' of the 150 with a <br />
Any suggestions pls?
I used this ','(?=[^()]*\)) but this one replaces all the occurences. I want the 150th ones to be replaced.
Thanks everyone for your suggestions. I managed to do it with Java code instead of regex.
StringBuilder sb = new StringBuilder(html);
int i = 0;
while ((i = sb.indexOf("','", i + 150)) != -1) {
int j = sb.lastIndexOf("','", i + 150);
sb.insert(i+1, "<BR>");
}
return sb.toString();
However, this breaks at the first encounter of ',' in the 150 chars.
Can anyone help modify my code to incorporate the break at the last occurence of ',' withing the 150 chars.
You'll want something like this:
Look for every occurrence of \([^)]+*,[^)]+*\) (Find a parenthesis-wrapped string with a comma in it and then run the following regular expression on each of the matched elements:
(.{135,150}[^,]*?),
The first number is the minimum number of characters you want to match before you add a break tag -- the second is the maximum number of characters you would like to match before inserting a break tag. If there is no , between the characters in question then the regular expression will continue to consume characters until it finds a comma.
You could probably do it like this:
regex ~ /(^.{1,14}),/
replacement ~ '\1<replacement' or "$1<insert your text>"
In Perl:
$target = ','x 22;
$target =~ s/(^ .{1,14}) , /$1<15th comma>/x;
print $target;
Output
,,,,,,,,,,,,,,<15th comma>,,,,,,,
Edit: As an alternative, if you want to break the string up into succesive 150 or less
you could do it this way:
regex ~ /(.{1,150},)/sg
replacement ~ '\1<br/>' or "$1<br\/>"
// That is a regex of type global (/g) and include newlines (/s)
In Perl:
$target = "
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN', ... )
";
if ($target =~ s/( .{1,150} , )/$1<br\/>/sxg) {
print $target;
}
Output:
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI',<br/>'ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA',<br/>'AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE',<br/>'ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN',<br/> ... )
I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom
The big advantage of using a regex is to deal with the tricky cases like & Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider ?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.
This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END
* Repaired SNOBOL4 Solution
* & -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end
I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.
The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test A. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.
Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello, here's a test colon:.
Here's a test semi-colon; ''
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello, here's a test colon:.
Here's a test semi-colon; ''
boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello, here's a test colon:." | spirit_ncr2a
output:
"Hello, here's a test colon:."
I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?
Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello, here's a test colon:.
> Here's a test semi-colon; ''
> ; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello, here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive
And the recursive one produces:
Hello, here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive
This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end
Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}