Bash Regular expression for "not space, comma, not space" - regex

I have a file like this:
a,b,c,"hello, hi",d
I want the field separator to be not space, comma, not space.
Currently I have
cat file | awk 'BEGIN { FS = "[^ ],[^ ]" } ; { print $4 }'
which should give "hello, hi" but it returns nothing. I'm quite new to this regular expression thing so any help would be appreciated.

Eh, no it should not give hello, hi. What actually happens is:
a,b,c,"hello, hi",d
|| ||| || ||_|Third fied separator
|| ||| ||_______|
|| ||| | $3
|| |||_|
|| || Second field separator
|| ||
|| |+- $2 is a comma
||_|
| First field separator
|
+- $0 is empty
So after the third field separator, the line is empty. You can verify this behaviour with
aaa,baa,caa,"hello, hi",daa
as input-file.

If you work with CSV files regularly, consider installing the csvtool, then you can simply say:
echo 'a,b,c,"hello, hi",d' | csvtool col 4 -
and it will spit out
"hello, hi"

You can also use sed:
>sed 's/.*\("[^"]*"\).*/\1/' <<< 'a,b,c,"hello, hi",d'
"hello, hi"
or grep:
>grep -o '"[^"]*"' <<< 'a,b,c,"hello, hi",d'
"hello, hi"

solution is to define the field content instead of field separator. You need to use gawk because standard awk does not have this feature natively. (on linux, awk = gawk)
echo 'a,b,c,"hello, hi",d' \
| awk '
# define the content with FPAT
# here any non , or a encapsulate quoted content
BEGIN{ FPAT = "[^,]*|\"[^\"]*\"" }
# for showing each field
{for (i=1;i<=NF;i++) printf( "field %d: %s\n", i, $i)}
'
field 1: a
field 2: b
field 3: c
field 4: "hello, hi"
field 5: d
By default, regex matching try to always take the longest possible so a "..,..." is longer than ".. and/or ..." taking full quoted string instead of partial coma separated content of the same string

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

Replace special characters except the following ,.#

I'm looking for an option to remove special characters from a file except for the following 3 items ,.#
The following awk command gets close but it removes all punctuation.
awk '{gsub(/[[:punct:]]/,"",except(".","#",","))}1' test.csv > test2.csv
Any ideas...
There are no opposite character classes in POSIX and no lookarounds to restrict a more generic pattern with some exceptions. The only way is to spell out the POSIX character class.
According to Character Classes and Bracket Expressions:
‘[:punct:]’
Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ \ { | } ~.
You may use
/[!-+\/:-?[-`{-~-]/
See the regex demo.
Legend:
All 3 of these approaches will work in any locale and will work for any character class by just changing the class name and will work for other bracket expressions or strings etc.:
1) Just look for any punct but only change it if it's not one of the chars you don't want changed:
$ echo 'a.b?c#d#e,f' |
awk '{
new = ""
while ( match($0,/[[:punct:]]/) ) {
chr = substr($0,RSTART,1)
new = new substr($0,1,RSTART-1) (chr ~ /[,.#]/ ? chr : "")
$0 = substr($0,RSTART+RLENGTH)
}
print new $0
}'
a.bcd#e,f
2) Turn the chars you don't want changed into other strings first then turn them back afterwards:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/a/,"aA"); gsub(/,/,"aB"); gsub(/\./,"aC"); gsub(/#/,"aD")
gsub(/[[:punct:]]/,"")
gsub(/aD/,"#"); gsub(/aC/,"."); gsub(/aB/,","); gsub(/aA/,"a")
print
}'
a.bcd#e,f
Changing a into aA and back is what guarantees that the strings you create when converting the #, etc. are strings that cannot exist elsewhere in the input at that time and that's why you can safely convert them back afterwards.
3) Suffix the puncts with the RS value, then remove the RS suffix from the chars you don't want changed, then change the remaining RS-suffixed puncts:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/[[:punct:]]/,"&"RS)
$0 = gensub("([,.#])"RS,"\\1","g")
gsub("[[:punct:]]"RS,"")
print
}'
a.bcd#e,f
That one uses GNU awk for gensub(), with other awks you'd need match()+substr().

Dynamically generated regex for gsub not working

I have an input CSV file:
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
I need to create a string out of this as follows:
;5,1,3,5;6,2,6,7;7,4
So each character, except the first which is the value of the field $2, in the substring in between the ; denotes the row number of middle field; for example ;5,1,3,5 means that 5 is at row number 1,3,5.
I've been trying to use awk with gsub, trying to create the string MYSTR dynamically.
The regex inside the gsub is not working. I need a regex that will match ;$3 (the value of $3, which can be a two digit number) and replace it with ;$3,RowNO, if the pattern is not matched then add ;$3 at the end of the string.
This is what I have so far:
awk -F',' '{
print NR, $3;
noofchars=gsub(/;$3/,";"$3","NR,MYSTR);
print noofchars;
if ( noofchars == 1 )
;
else
MYSTR=MYSTR";"$3","NR;
print NR, $3;
print MYSTR;
}
END{print MYSTR;}' $1
The regex doesn't work because $3 isn't interpreted as the field #3 value but is seen as the anchor $ (that matches the end of the line) and a literal 3.
You can do it without gsub:
awk -F, '{a[$2]=a[$2]","NR}END{for (i in a){printf(";%d%s",i,a[i])}}'
Input
$ cat file
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
Output
$ awk -F, '{gsub(/[ ]+/,"",$3);a[$2] = ($2 in a ? a[$2]:$2) FS $3 }END{for(i in a)printf("%s%s",";",a[i]); print ""}' file
;5,1,3,5;6,2,6,7;7,4
Better Readable version
awk -F, '
{
gsub(/[ ]+/,"",$3); # suppress space char in third field
a[$2] = ($2 in a ? a[$2]:$2) FS $3 # array a where index being field2 and value will be field3, if index exists before append string with existing value
}
END{
for(i in a) # loop through array a and print values
printf("%s%s",";",a[i]);
print ""
}
' file
#vsshekhar: Try following too: It will provide you values in the correct same order which Input_file ($2) are coming.
awk -F, '{A[++i]=$2;B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR} END{for(j=1;j<=i;j++){if(B[A[j]]){printf(";%s,%s",A[j],B[A[j]]);delete B[A[j]]}};print ""}' Input_file
Adding a non-one liner form of solution too now.
awk -F, '{
A[++i]=$2;
B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR
}
END{
for(j=1;j<=i;j++){
if(B[A[j]]){
printf(";%s,%s",A[j],B[A[j]]);
delete B[A[j]]
}
};
print ""
}
' Input_file

How can I get the hostname from a file using awk and regex or substring

The file name is in a format like this:
YYYY-MM-DD_hostname_something.log
I want to get the hostname from the filename. The hostname can be any length, but always has a _ before and after. This is my current awk statement. It worked fine until the hostname length changed. Now I can't use it anymore.
awk 'BEGIN { OFS = "," } FNR == 1 { d = substr(FILENAME, 1, 10) } { h = substr(FILENAME, 12, 10) } $2 ~ /^[AP]M$/ && $3 != "CPU" { print d, $1 "" $2, h, $4+$5, $6, $7+$8+$9}' *_something.log > myfile.log
echo 'YYYY-MM-DD_hostname_something.log' | awk -F"_" '{print $2}'
Output:
hostname
I suppose your hostname contains no _.
$ ls YYYY-MM-DD_hostname_something.log | cut -d _ -f 2
hostname
The cut(1) utility is POSIX and accepts the -d _ option to specify a delimiter and -f 2 to specify the second field. It has got a few more nifty options that you can read about in its fine manual page.
Since you have mentioned you need to modify your awk code, replace your substr function with split.
split(FILENAME,a,"_");date = a[1];host = a[2]
split the FILENAME value into array a with _ as FS.
a[1] will contain date
a[2] will contain the hostname value.