Creating matching brackets- awk :sed - regex

I have a data set that has three patterns:
First:
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
Second:
inaccurate in:prefix<>accurate:stem
inactive in:prefix<>active:stem
Third:
incommunicable in:prefix<>communicate:stem<>able:suffix
incompatibility in:prefix<>compatible:stem<>ity:suffix
I need to convert the above to following form : Matching the brackets in the way for Penn Tree Bank (http://languagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf)
First:
abrasion ((abrade:stem) ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
Second:
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
Third:
incommunicable (in:prefix ((communicate:stem)able:suffix))
incompatibility (in:prefix ((compatible:stem)ity:suffix))
The code, I am working is using awk
{
n = gsub(/<>/,")",$2)
s = sprintf("%*s",n,"")
gsub(/ /,"(",s)
print "(" $1, s "((" $2 "))"
}
EDIT
More complex forms
nationalistic national: stem <>ism:suffix<>ist:suffix<>ic:suffix
to:
nationalistic ((((national: stem) ism:suffix)ist:suffix)ic:suffix)
It is not producing the expected outputs that mentioned in the examples.

This should be general enough as it takes into account :stem, :prefix, and :suffix for matching:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
b=gensub(/(\([a-zA-Z]*:stem\))<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
c=gensub(/([a-zA-Z]*:prefix)<>(.*)/,"(\\1\\2)", "g", b);
print c;}' testfile
Demo here: https://ideone.com/U3ux91
EDIT
This should take care of multiple suffixes and prefixes:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
while ( a ~ /stem)<>.*:suffix/) {
a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
}
while ( a ~ /<>/) {
a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(\\1\\2)", "g", a);
}
print a;}' test
Demo here: https://ideone.com/U7LYXi
(sorry if antinationalistic is not a word, but for testing sake....)

The expected output for pattern 1 may have problem, the brackets are not paired, I guess it was typos and it should be:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
I make this awk script:
awk -v d="<>" '{$2="("$2")"}
$1~/^ab/{sub(d,")",$2);$2="(" $2}
$1~/^ina/{sub(d,"(",$2);$2=$2")"}
$1~/^inc/{sub(d,"((",$2);sub(d,")",$2);$2=$2")"}7' file
with your 3 patterns example in same file, it gives:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))

awk -F'<>| ' -v OFS= '{
$1 = $1 " "
for (i=2; i<=NF; i++) {
if ($i ~ /prefix$/) { $i = "(" $i; $NF = $NF ")" }
if ($i ~ /stem\)?$/) { stem = i; $i = "(" $i ")" }
if ($i ~ /suffix\)?$/) { $i = $i ")"; $stem = "(" $stem } }
} { print }'

awk to the rescue!
$ awk 'function wrap(v) {return "("v")"; }
{n=split($2,a,"<>");
if(n==3) w=wrap(a[1] wrap(wrap(a[2]) a[3]));
else if(a[1]~/:prefix/) w=wrap(a[1] wrap(a[2]));
else w=wrap(wrap(a[1]) a[2]);
print $1, w}' stems
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))

Related

backreferencing in awk gensub with conditional branching

I'm referencing to
answer to: GNU awk: accessing captured groups in replacement text
but whith ? Quantifier for regex matching
I would like to make if statement or ternary operator ?: or something more elegant so that if regex group that is backreferenced with \\1 returns nonempty string then, one arbitrary string (\\1 is not excluded) is inserted and if it returns empty string some other arbitrary string is inserted. My example works when capturing group returns nonempty string, but doesn't return expected branch "B" when backreference is empty. How to make conditional branching based on backreferenced values?
echo abba | awk '{ print gensub(/a(b*)?a/, "\\1"?"A":"B", "g", $0)}'
you can do the assignment in the gensub and use the value for the ternary operator afterwards, something like this
... | awk '{ v=gensub(/a(b*)?a/, "\\1", "g", $0); print v?"A":"B"}'
Something like this, maybe?:
$ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< abba
A
$ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< acca
B
The expressions in any arguments you pass to any function are evaluated before the function is called so gensub(/a(b*)?a/, "\\1"?"A":"B", "g", $0) is the same as str=("\\1"?"A":"B"); gensub(/a(b*)?a/, str, "g", $0) which is the same as gensub(/a(b*)?a/, "A", "g", $0).
So you cannot do what you're apparently trying to do with a single call to any function, nor can you call gsub() twice, once with ab+a and then again with aa, or similar without breaking the left-to-right, leftmost-longest order in which such a replacement function would match the regexp against the input if it existed.
It looks like you might be trying to do the following, using GNU awk for patsplit():
awk '
n = patsplit($0,f,/ab*a/,s) {
$0 = s[0]
for ( i=1; i<=n; i++ ) {
$0 = $0 (f[i] ~ /ab+a/ ? "A" : "B") s[i]
}
}
1'
or with any awk:
awk '
{
head = ""
while ( match($0,/ab*a/) ) {
str = substr($0,RSTART,RLENGTH)
head = head substr($0,1,RSTART-1) (str ~ /ab+a/ ? "A" : "B")
$0 = substr($0,RSTART+RLENGTH)
}
$0 = head $0
}
1'
but without sample input/output it's a guess. FWIW given this sample input file:
$ cat file
XabbaXaaXabaX
foo
abbaabba
aabbaabba
bar
abbaaabba
the above will output:
XAXBXAX
foo
AA
BbbBbba
bar
ABbba

How to use sed to extract numbers from a comma separated string?

I managed to extract the following response and comma separate it. It's comma seperated string and I'm only interested in comma separated values of the account_id's. How do you pattern match using sed?
Input: ACCOUNT_ID,711111111119,ENVIRONMENT,dev,ACCOUNT_ID,111111111115,dev
Expected Output: 711111111119, 111111111115
My $input variable stores the input
I tried the below but it joins all the numbers and I would like to preserve the comma ','
echo $input | sed -e "s/[^0-9]//g"
I think you're better served with awk:
awk -v FS=, '{for(i=1;i<=NF;i++)if($i~/[0-9]/){printf sep $i;sep=","}}'
If you really want sed, you can go for
sed -e "s/[^0-9]/,/g" -e "s/,,*/,/g" -e "s/^,\|,$//g"
$ awk '
BEGIN {
FS = OFS = ","
}
{
c = 0
for (i = 1; i <= NF; i++) {
if ($i == "ACCOUNT_ID") {
printf "%s%s", (c++ ? OFS : ""), $(i + 1)
}
}
print ""
}' file
711111111119,111111111115

I changed repetitive awk expressions into one function

I had a 8 awk expressions that only differed by 2 patterns I was searching for. So I then created an awk function to improve my code, however now it wont work. What I am doing is...
printFmt () {
awk -v MYPATH="$MYPATH" -v FILE_EXT="$FILE_EXT" -v NAME_OF_FILE="$NAME_OF_FILE" -v DATE="$DATE" -v PATTERN="$1" -v SEARCH="$2" '
$0 ~ PATTERN {
rec = $1 OFS $2 OFS $4 OFS $7
for (i=9; i<=NF; i++) {
rec = rec OFS $i
if ($i ~ SEARCH) {
break
}
}
print rec >> "'$MYPATH''$NAME_OF_FILE''$DATE'.'$FILE_EXT'"
}
' "$FILE_LOCATION"
}
and calling with printFmt "$STORED_PROCS_FINISHED" "/([01])/". My code was exactly above except instead of SEARCH it was /([01])/. Is there something with syntax that i am missing?
Do this and read the book Effective Awk Programming, 4th Edition, by Arnold Robbins:
printFmt () {
awk -v regexp1="$1" -v regexp2="$2" '
$0 ~ regexp1 {
rec = $1 OFS $2 OFS $4 OFS $7
for (i=9; i<=NF; i++) {
rec = rec OFS $i
if ($i ~ regexp2) {
break
}
}
print rec
}
' "$FILE_LOCATION" >> "${MYPATH}${NAME_OF_FILE}${DATE}.${FILE_EXT}"
}
printFmt "$STORED_PROCS_FINISHED" "[01]"
Your use of all-caps for variable names is bad - that's for exported shell variables only.
Don't use the word "pattern" as it's ambiguous, and "search" is meaningless - come up with 2 meaningful names for the variables that I named regexp1 and regexp2.
As noted in comments:
You should omit the slashes from the regex passed as a parameter. Passing "([01])" instead of "/([01])/" should work correctly. I'm not convinced the parentheses are necessary either; just "[01]" should work too.
You pass values with -v to the awk script that are not used inside the awk script. You have the shell use those values to create a file name as well. You should either not pass the values to awk, or you should not use the shell to create the file name.
Given these comments, I think your code could be:
printFmt() {
awk -v PATTERN="$1" -v SEARCH="$2" '
$0 ~ PATTERN {
rec = $1 OFS $2 OFS $4 OFS $7
for (i=9; i<=NF; i++) {
rec = rec OFS $i
if ($i ~ SEARCH) {
break
}
}
print rec
}
' "$FILE_LOCATION" >> "$MYPATH$NAME_OF_FILE$DATE.$FILE_EXT"
}
printFmt "$STORED_PROCS_FINISHED" "[01]"
Unless the constructed file name changes on each invocation of the function, I would create the file name once, outside the function, and use it outside the function:
printFmt() {
awk -v PATTERN="$1" -v SEARCH="$2" '
$0 ~ PATTERN {
rec = $1 OFS $2 OFS $4 OFS $7
for (i=9; i<=NF; i++) {
rec = rec OFS $i
if ($i ~ SEARCH) {
break
}
}
print rec
}
' "$FILE_LOCATION"
}
OUTFILE="$MYPATH$NAME_OF_FILE$DATE.$FILE_EXT"
printFmt "$STORED_PROCS_FINISHED" "[01]" >> "$OUTFILE"
…7 other calls to printFmt each with I/O redirection…
Or even:
{
printFmt "$STORED_PROCS_FINISHED" "[01]"
…7 other calls to printFmt…
} >> "$OUTFILE"
On the whole, I'd probably pass the file(s) to be scanned as an argument to the function too:
printFmt() {
pattern="${1:?}"
search="${2:?}"
shift 2
awk -v PATTERN="$pattern1" -v SEARCH="$search" '
$0 ~ PATTERN {
rec = $1 OFS $2 OFS $4 OFS $7
for (i=9; i<=NF; i++) {
rec = rec OFS $i
if ($i ~ SEARCH) {
break
}
}
print rec
}
' "$#" # All the remaining arguments
}
{
printFmt "$STORED_PROCS_FINISHED" "[01]" "$FILE_LOCATION"
…7 other calls to printFmt…
} >> "$OUTFILE"
This gives the most flexibility about where the data comes from and goes to. It allows the function to read its standard input if no file name arguments are supplied. The ${1:?} notation will generate an error if $1 is not set to a non-empty string; it is a crude but effective way of checking that argument 1 (the pattern) was provided to the function. Similarly with the search argument too. The error message won't be wonderfully informative, but any message is probably better than trying to proceed when the values were not provided.

awk remove unwanted records and consolidate multiline fields to one line record in specific order

I have an output file that I am trying to process into a formatted csv for our audit team.
I thought I had this mastered until I stumbled across bad data within the output. As such, I want to be able to handle this using awk.
MY OUTPUT FILE EXAMPLE
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
THE OUTPUT I WANT AFTER PROCESSING
joe,bloggs,s01234565;uid=bloggsj,cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=E09876543;cn=andy-peters,ou=appserver,ou=components,o=hoster
As you can see:
we always have a cn= variable that contains o=hoster
uid can have any value
we may have multiple cn= variables without o=hoster
I have acheived the following:
cat output | awk '!/^o.*/ && !/^Enter.*/{print}' | awk '{getline a; getline b; getline c; getline d; print $0,a,b,c,d}' | awk -v srch1="cn=" -v repl1="" -v srch2="sn=" -v repl2="" '{ sub(srch1,repl1,$2); sub(srch2,repl2,$3); print $4";"$2" "$3";"$1 }'
Any pointers or guidance is greatly appreciated using awk. Or should I give up and just use the age old long winded method a large looping script to process the file?
You may try following awk code
$ cat file
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
Awk Code :
awk '
function out(){
print s,u,last
i=0; s=""
}
/^cn/,!NF{
++i
last = i == 1 ? $0 : last
s = i>1 && !/uid/ && NF ? s ? s "," $NF : $NF : s
u = /uid/ ? $0 : u
}
i && !NF{
out()
}
END{
out()
}
' FS="=" OFS=";" file
Resulting
joe,bloggs,S01234565;uid=bloggsj;cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=petersa;cn=andy-peters,ou=appserver,ou=components,o=hoster
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
This awk script works for your sample and produces the sample output:
BEGIN { delete cn[0]; OFS = ";" }
function print_info() {
if (length(cn)) {
names = cn[1] "," sn
for (i=2; i <= length(cn); ++i) names = names "," cn[i]
print names, uid, dn
delete cn
}
}
/^cn=/ {
if ($0 ~ /o=hoster/) dn = $0
else {
cn[length(cn)+1] = substr($0, index($0, "=") + 1)
uid = $0; sub("cn", "uid", uid)
}
}
/^sn=/ { sn = substr($0, index($0, "=") + 1) }
/^uid=/ { uid = $0 }
/^$/ { print_info() }
END { print_info() }
This should help you get started.
awk '$1 ~ /^cn/ {
for (i = 2; i <= NF; i++) {
if ($i ~ /^uid/) {
u = $i
continue
}
sub(/^[^=]*=/, x, $i)
r = length(r) ? r OFS $i : $i
}
print r, u, $1
r = u = x
}' OFS=, RS= infile
I assume that there is an error in your sample output: in the 3d record the uid should be petersa and not E09876543.
You might want look at some of the "already been there and done that" solutions to accomplish the task.
Apache Directory Studio for example, will do the LDAP query and save the file in CSV or XLS format.
-jim

Command line to match lines with matching first field (sed, awk, etc.)

What is fast and succinct way to match lines from a text file with a matching first field.
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output, alternative:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
I can imagine many ways to write this, but I suspect there's a smart way to do it, e.g., with sed, awk, etc. My source file is approx 0.5 GB.
There are some related questions here, e.g., "awk | merge line on the basis of field matching", but that other question loads too much content into memory. I need a streaming method.
Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)
awk -F \| '
$1 == prev_key {print prev_line; matches ++}
$1 != prev_key {
if (matches) print prev_line
matches = 0
prev_key = $1
}
{prev_line = $0}
END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Alternate output
awk -F \| '
$1 == prev_key {
if (matches == 0) printf "%s", $1
printf "%s%s", FS, prev_value
matches ++
}
$1 != prev_key {
if (matches) printf "%s%s\n", FS, prev_value
matches = 0
prev_key = $1
}
{prev_value = $2}
END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
For fixed width fields you can used uniq:
$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
If you don't have fixed width fields here are two awk solution:
awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
Using awk:
awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor
This might work for you (GNU sed):
sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file
This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.
$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit