How to pass a parameter to a regex inside an awk command - regex

I am having troubles with passing a parameter to a regex inside an awk command.What seems to be the problem here? Does the regex read the parameter name instead of the value? Thanks
FILE=*some file here*
TEST_STRING1=test
awk -v testString1="$TEST_STRING1" 'BEGIN {
}
{
##Sample REGEX HERE
if ( $0 ~ "^testString1.* - \[.*\] - .*$") {
##DO SOMETHING HERE
}
}
END{}
' $FILE

You need to use awk string concatenation:
if ( $0 ~ "^" testString1 ".* - \[.*\] - .*$" ) {
Or, do the variable substitution in the shell -- the quoting is a bit tricky
awk -v regex="^${TEST_STRING1}"'.* - \[.*\] - .*$'
Then, in awk
if ($0 ~ regex) ...

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

Removing multiple delimiters between outside delimiters on each line

Using awk or sed in a bash script, I need to remove comma separated delimiters that are located between an inner and outer delimiter. The problem is that wrong values ends up in the wrong columns, where only 3 columns are desired.
For example, I want to turn this:
2020/11/04,Test Account,569.00
2020/11/05,Test,Account,250.00
2020/11/05,More,Test,Accounts,225.00
Into this:
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
I've tried to use a few things, testing regex:
But I cannot find a solution to only select the commas in order to remove.
awk -F, '{ printf "%s,",$1;for (i=2;i<=NF-2;i++) { printf "%s ",$i };printf "%s,%s\n",$(NF-1),$NF }' file
Using awk, print the first comma delimited field and then loop through the rest of the field up to the last but 2 field printing the field followed by a space. Then for the last 2 fields print the last but one field, a comma and then the last field.
With GNU awk for the 3rd arg to match():
$ awk -v OFS=, '{
match($0,/([^,]*),(.*),([^,]*)/,a)
gsub(/,/," ",a[2])
print a[1], a[2], a[3]
}' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
or with any awk:
$ awk '
BEGIN { FS=OFS="," }
{
n = split($0,a)
gsub(/^[^,]*,|,[^,]*$/,"")
gsub(/,/," ")
print a[1], $0, a[n]
}
' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
Use this Perl one-liner:
perl -F',' -lane 'print join ",", $F[0], "#F[1 .. ($#F-1)]", $F[-1];' in.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
$F[0] : first element of the array #F (= first comma-delimited value).
$F[-1] : last element of #F.
#F[1 .. ($#F-1)] : elements of #F between the second from the start and the second from the end, inclusive.
"#F[1 .. ($#F-1)]" : the above elements, joined on blanks into a string.
join ",", ... : join the LIST "..." on a comma, and return the resulting string.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perl -pe 's{,\K.*(?=,)}{$& =~ y/,/ /r}e' file
sed -e ':a' -e 's/\(,[^,]*\),\([^,]*,\)/\1 \2/; t a' file
awk '{$1=$1","; $NF=","$NF; gsub(/ *, */,","); print}' FS=, file
awk '{for (i=2; i<=NF; ++i) $i=(i>2 && i<NF ? " " : ",") $i} 1' FS=, OFS= file
awk doesn't support look arounds, we could have it by using match function of awk; using that could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/,.*,/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/," ",val)
print substr($0,1,RSTART) val substr($0,RSTART+RLENGTH-1)
}
' Input_file
Yet another perl
$ perl -pe 's/(?:^[^,]*,|,[^,]*$)(*SKIP)(*F)|,/ /g' ip.txt
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
(?:^[^,]*,|,[^,]*$) matches first/last field along with the comma character
(*SKIP)(*F) this would prevent modification of preceding regexp
|, provide , as alternate regexp to be matched for modification
With sed (assuming \n is supported by the implementation, otherwise, you'll have to find a character that cannot be present in the input)
sed -E 's/,/\n/; s/,([^,]*)$/\n\1/; y/,/ /; y/\n/,/'
s/,/\n/; s/,([^,]*)$/\n\1/ replace first and last comma with newline character
y/,/ / replace all comma with space
y/\n/,/ change newlines back to comma
A similar answer to Timur's, in awk
awk '
BEGIN { FS = OFS = "," }
function join(start, stop, sep, str, i) {
str = $start
for (i = start + 1; i <= stop; i++) {
str = str sep $i
}
return str
}
{ print $1, join(2, NF-1, " "), $NF }
' file.csv
It's a shame awk doesn't ship with a join function builtin

How to use sed to extract numbers from a comma separated string?

I managed to extract the following response and comma separate it. It's comma seperated string and I'm only interested in comma separated values of the account_id's. How do you pattern match using sed?
Input: ACCOUNT_ID,711111111119,ENVIRONMENT,dev,ACCOUNT_ID,111111111115,dev
Expected Output: 711111111119, 111111111115
My $input variable stores the input
I tried the below but it joins all the numbers and I would like to preserve the comma ','
echo $input | sed -e "s/[^0-9]//g"
I think you're better served with awk:
awk -v FS=, '{for(i=1;i<=NF;i++)if($i~/[0-9]/){printf sep $i;sep=","}}'
If you really want sed, you can go for
sed -e "s/[^0-9]/,/g" -e "s/,,*/,/g" -e "s/^,\|,$//g"
$ awk '
BEGIN {
FS = OFS = ","
}
{
c = 0
for (i = 1; i <= NF; i++) {
if ($i == "ACCOUNT_ID") {
printf "%s%s", (c++ ? OFS : ""), $(i + 1)
}
}
print ""
}' file
711111111119,111111111115

How to replace second pattern(dot) after pattern(comma) in bash

How do i replace second dot after comma.
this is the closest i could go
echo '0.592922148,0.821504176,1.174.129.731' | xargs -d ',' -n1 echo | sed 's/\([^\.]*\.[^\.]*\)\./\1/' | sed 's/\([^\.]*\.[^\.]*\)\./\1/'
Output :
0.592922148
0.821504176
1.174129731
Expected output :
0.592922148,0.821504176,1.174129731
You may use
sed -e ':a' -e 's/\(\.[^.,]*\)\./\1/' -e 't a'
See online sed demo:
s='0.592922148,0.821504176,1.174.129.731'
sed -e ':a' -e 's/\(\.[^.,]*\)\./\1/' -e 't a' <<< "$s"
Details
:a - label a
s/\(\.[^.,]*\)\./\1/ - finds and captures into Group 1 a dot, then any 0+ chars other than dot and comma, and then just matches a dot, and replaces this match with the value in Group 1 (thus, removing the second matched dot)
t a - if there was a successful replacement, goes back to the a label position in the string.
While I think the sed solution is your best choice, since you have tagged your question with both sed and awk, an awk solution is fairly straight forward as well using split() and basic string concatenation. (just not nearly as short) For example you could do:
awk -v OFS=, -F, '{
for (i=1; i<=NF; i++) {
n=split ($i, a,".")
if (n > 2) {
s=a[1] "." a[2]
for (j=3; j<=n; j++)
s = s a[j]
$i=s
}
}
}1'
Where you define the field separator and output field separators as ','. Then looping over each field, check the return of split(), splitting the field into an array on '.' into array a. If the resulting number of elements is greater than 2, then put your first two elements back together restoring the first '.' in the number, and then simply concatenate the remaining fields. The 1 at the end is the default "print record" to print the updated record.
Example Use/Output
$ echo '0.592922148,0.821504176,1.174.129.731' |
> awk -v OFS=, -F, '{
> for (i=1; i<=NF; i++) {
> n=split ($i, a,".")
> if (n > 2) {
> s=a[1] "." a[2]
> for(j=3;j<=n;j++)
> s = s a[j]
> $i=s
> }
> }
> }1'
0.592922148,0.821504176,1.174129731
Could you please try following.
echo '0.592922148,0.821504176,1.174.129.731' |
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
ind=index($i,".")
if(ind){
val1=substr($i,1,ind)
val2=substr($i,ind+1)
gsub(/\./,"",val2)
$i=val1 val2
}
}
val1=val2=""
}
1'
Explanation: Adding explanation for above code.
echo '0.592922148,0.821504176,1.174.129.731' | ##Printing values as per OP mentioned and using pipe to send its output as standard input for awk command.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="," ##Setting FS and OFS as comma for each line of Input_file here.
} ##Closing BEGIN BLOCK here.
{
for(i=1;i<=NF;i++){ ##Starting a for loop to traverse through fields of line..
ind=index($i,".") ##Checking index of DOT in current field and saving it into ind variable.
if(ind){ ##Checking condition if variable ind is NOT NULL.
val1=substr($i,1,ind) ##Creating variable val1 from sub-string in current field from 1 to ind value.
val2=substr($i,ind+1) ##Creating variable val2 from sub-string in current field from ind+1 value to till complete length of current field.
gsub(/\./,"",val2) ##Globally substituting DOTs with NULL in val2 variable.
$i=val1 val2 ##Re-crearing current field with value of val1 val2.
} ##Closing BLOCK for if condition.
} ##Closing BLOCK for for loop.
val1=val2="" ##Nullifying val1 and val2 variables here.
} ##Closing main code BLOCK here.
1' ##Mentioning 1 will print edited/non-edited line.
An awk verison:
echo '0.592922148,0.821504176,1.174.129.731' | awk -F, '{for (i=1;i<=NF;i++) {sub(/\./,"#",$i);gsub(/\./,"",$i);sub(/#/,".",$i);print $i}}'
0.592922148
0.821504176
1.174129731
It splits the line inn to multiple fields by ,. Then replace first . to #. Then replace rest of . to nothing. Last replace # back to . and print it.
Edit
awk -F, '{for (i=1;i<=NF;i++) {sub(/\./,"#",$i);gsub(/\./,"",$i);sub(/#/,".",$i);a=a (i==1?"":",")$i}print a}' file
0.592922148,0.821504176,1.174129731

SED regex find (and remove) option from a command text

I have a config file with param=option[,option...], using standard bash utilities, perhaps the the help of sed, remove one option from the list.
#
param=aa,bb,cc
param=aa,bb
param=bb,cc
param=bb
#
in this example, I want to remove 'bb' (and the separator) from all lines, and in the last case, because 'bb' was the sole option, remove the complete line, so the final result will be
#
param=aa,cc
param=aa
param=cc
#
option 'bb' can be alone or at the start, center or end of the list. Obviously, 'bb' embedded on another option name (ie xxbb, bbxx, etc) should not be considered.
edit: fix typo, addn'l example
Here is a sed version to remove bb parameter from any position and delete the line if bb is the only parameter:
First the input file:
#
param=aa,bb,cc
param=aa,bb
param=bb,cc
param=bb
#
Now run this sed:
sed -E '/^param=/{/=bb$/d; s/,bb(,|$)/\1/; s/=bb,/=/;}' file
This will give:
#
param=aa,cc
param=aa
param=cc
#
To use inline editing use:
sed -i.bak -E '/^param=/{/=bb$/d; s/,bb(,|$)/\1/; s/=bb,/=/;}' file
Note: The solutions below do not address updating the input file; a simple (though not fully robust) approach is to use
awk '...' file > file.$$ && mv file.$$ file
A POSIX-compliant awk solution that should work robustly:
awk -F'=' '
$1 != "param" { print; next }
{
sub(/,bb,/, ",", $2)
sub(/(^|,)bb$/, "", $2)
if ($2 != "") print $1 FS $2
}
' file
GNU awk allows for a simpler solution, using its (nonstandard) gensub() function:
awk -F'=' '
$1 != "param" { print; next }
{
newList = gensub(/(^|,)bb(,|$)/, "\\2", 1, $2)
if (newList != "") print $1 FS newList
}
' file
A (POSIX-compliant) field-based alternative (more verbose, but perhaps easier to generalize):
awk -F'=' '
$1 != "param" { print; next }
{
n = split($2, opts, ","); optList = ""
for (i=1; i<=n; ++i) {
if (opts[i] != "bb") {
optList = optList (optList == "" ? "" : ",") opts[i]
}
}
if (optList != "") print $1 FS optList
}
' file
Let's say your Input_file is as follows:
param=aa,bb,cc
param=aa,bb
param=bb
Then the following code:
awk -F"=" '$2=="bb"{next} {sub(/,bb/,"");print}' Input_file
outputs:
param=aa,cc
param=aa
I'd use a temporary format to be able to find the occurrences easier. And to remove lines I would suggest using grep:
sed 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d'
the s/=/=,/ converts it to:
param=,aa,bb,cc
param=,aa,bb
param=,bb
than s/$/,/ to:
param=,aa,bb,cc,
param=,aa,bb,
param=,bb,
than s/,bb,/,/
param=,aa,cc,
param=,aa,
param=,
and s/=,/=/;s/,$// will remove the commata at the begining and end again
removing empty options can be done with grep -v '=$', or some more advanced sed magic (so it can be still used with sed -i)
EDIT:
the "sed magic" is just appending '/=$/d'
tested this one, and it works fine:
sed -i 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d' filename
or
sed 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d' filename_in > filename_out