Replace special characters except the following ,.# - regex

I'm looking for an option to remove special characters from a file except for the following 3 items ,.#
The following awk command gets close but it removes all punctuation.
awk '{gsub(/[[:punct:]]/,"",except(".","#",","))}1' test.csv > test2.csv
Any ideas...

There are no opposite character classes in POSIX and no lookarounds to restrict a more generic pattern with some exceptions. The only way is to spell out the POSIX character class.
According to Character Classes and Bracket Expressions:
‘[:punct:]’
Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ \ { | } ~.
You may use
/[!-+\/:-?[-`{-~-]/
See the regex demo.
Legend:

All 3 of these approaches will work in any locale and will work for any character class by just changing the class name and will work for other bracket expressions or strings etc.:
1) Just look for any punct but only change it if it's not one of the chars you don't want changed:
$ echo 'a.b?c#d#e,f' |
awk '{
new = ""
while ( match($0,/[[:punct:]]/) ) {
chr = substr($0,RSTART,1)
new = new substr($0,1,RSTART-1) (chr ~ /[,.#]/ ? chr : "")
$0 = substr($0,RSTART+RLENGTH)
}
print new $0
}'
a.bcd#e,f
2) Turn the chars you don't want changed into other strings first then turn them back afterwards:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/a/,"aA"); gsub(/,/,"aB"); gsub(/\./,"aC"); gsub(/#/,"aD")
gsub(/[[:punct:]]/,"")
gsub(/aD/,"#"); gsub(/aC/,"."); gsub(/aB/,","); gsub(/aA/,"a")
print
}'
a.bcd#e,f
Changing a into aA and back is what guarantees that the strings you create when converting the #, etc. are strings that cannot exist elsewhere in the input at that time and that's why you can safely convert them back afterwards.
3) Suffix the puncts with the RS value, then remove the RS suffix from the chars you don't want changed, then change the remaining RS-suffixed puncts:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/[[:punct:]]/,"&"RS)
$0 = gensub("([,.#])"RS,"\\1","g")
gsub("[[:punct:]]"RS,"")
print
}'
a.bcd#e,f
That one uses GNU awk for gensub(), with other awks you'd need match()+substr().

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

Removing multiple delimiters between outside delimiters on each line

Using awk or sed in a bash script, I need to remove comma separated delimiters that are located between an inner and outer delimiter. The problem is that wrong values ends up in the wrong columns, where only 3 columns are desired.
For example, I want to turn this:
2020/11/04,Test Account,569.00
2020/11/05,Test,Account,250.00
2020/11/05,More,Test,Accounts,225.00
Into this:
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
I've tried to use a few things, testing regex:
But I cannot find a solution to only select the commas in order to remove.
awk -F, '{ printf "%s,",$1;for (i=2;i<=NF-2;i++) { printf "%s ",$i };printf "%s,%s\n",$(NF-1),$NF }' file
Using awk, print the first comma delimited field and then loop through the rest of the field up to the last but 2 field printing the field followed by a space. Then for the last 2 fields print the last but one field, a comma and then the last field.
With GNU awk for the 3rd arg to match():
$ awk -v OFS=, '{
match($0,/([^,]*),(.*),([^,]*)/,a)
gsub(/,/," ",a[2])
print a[1], a[2], a[3]
}' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
or with any awk:
$ awk '
BEGIN { FS=OFS="," }
{
n = split($0,a)
gsub(/^[^,]*,|,[^,]*$/,"")
gsub(/,/," ")
print a[1], $0, a[n]
}
' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
Use this Perl one-liner:
perl -F',' -lane 'print join ",", $F[0], "#F[1 .. ($#F-1)]", $F[-1];' in.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
$F[0] : first element of the array #F (= first comma-delimited value).
$F[-1] : last element of #F.
#F[1 .. ($#F-1)] : elements of #F between the second from the start and the second from the end, inclusive.
"#F[1 .. ($#F-1)]" : the above elements, joined on blanks into a string.
join ",", ... : join the LIST "..." on a comma, and return the resulting string.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perl -pe 's{,\K.*(?=,)}{$& =~ y/,/ /r}e' file
sed -e ':a' -e 's/\(,[^,]*\),\([^,]*,\)/\1 \2/; t a' file
awk '{$1=$1","; $NF=","$NF; gsub(/ *, */,","); print}' FS=, file
awk '{for (i=2; i<=NF; ++i) $i=(i>2 && i<NF ? " " : ",") $i} 1' FS=, OFS= file
awk doesn't support look arounds, we could have it by using match function of awk; using that could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/,.*,/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/," ",val)
print substr($0,1,RSTART) val substr($0,RSTART+RLENGTH-1)
}
' Input_file
Yet another perl
$ perl -pe 's/(?:^[^,]*,|,[^,]*$)(*SKIP)(*F)|,/ /g' ip.txt
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
(?:^[^,]*,|,[^,]*$) matches first/last field along with the comma character
(*SKIP)(*F) this would prevent modification of preceding regexp
|, provide , as alternate regexp to be matched for modification
With sed (assuming \n is supported by the implementation, otherwise, you'll have to find a character that cannot be present in the input)
sed -E 's/,/\n/; s/,([^,]*)$/\n\1/; y/,/ /; y/\n/,/'
s/,/\n/; s/,([^,]*)$/\n\1/ replace first and last comma with newline character
y/,/ / replace all comma with space
y/\n/,/ change newlines back to comma
A similar answer to Timur's, in awk
awk '
BEGIN { FS = OFS = "," }
function join(start, stop, sep, str, i) {
str = $start
for (i = start + 1; i <= stop; i++) {
str = str sep $i
}
return str
}
{ print $1, join(2, NF-1, " "), $NF }
' file.csv
It's a shame awk doesn't ship with a join function builtin

How can I output the number of repeats of a pattern in regex?

I would like to output the number of repeats of a pattern with regex. For example, convert "aaad" to "3xad", "bCCCCC" to "b5xC". I want to do this in sed or awk.
I know I can match it by (.)\1+ or even capture it by ((.)\1+). But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?
Perl to the rescue!
perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'
-p reads the input line by line and prints it after processing
s/// is the substitution similar to sed
/e makes the replacement evaluated as code
e.g.
aaadbCCCCCxx -> 3xadb5xC2xx
In GNU awk:
$ echo aaadbCCCCCxx | awk -F '' '{
for(i=1;i<=NF;i+=RLENGTH) {
c=$i
match(substr($0,i),c"+")
b=b (RLENGTH>1?RLENGTH "x":"") c
}
print b
}'
3xadb5xC2xx
If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):
$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
for(i=1;i<=NF;i+=RLENGTH) {
c=$i
# print i,c # for debugging
if(c~/[*.\\]/) # if c is a regex metachar (not complete)
c="\\"c # escape it
match(substr($0,i),c"+") # find all c:s
b=b (RLENGTH>1?RLENGTH "x":"") $i # buffer to b
}
print b
}'
3x\2x.2x*3xadb5xC2x+2xx
Just for fun.
With sed it is cumbersome but do-able. Note this example relies on GNU sed (:
parse.sed
/(.)\1+/ {
: nextrepetition
/((.)\2+)/ s//\n\1\n/ # delimit the repetition with new-lines
h # and store the delimited version
s/^[^\n]*\n|\n[^\n]*$//g # now remove prefix and suffix
b charcount # count repetitions
: aftercharcount # return here after counting
G # append the new-line delimited version
# Reorganize pattern space to the desired format
s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/
# Run again if more repetitions exist
/(.)\1+/b nextrepetition
}
b
# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount
s/./a/g
# Do the carry. The t's and b's are not necessary,
# but they do speed up the thing
t a
: a; s/aaaaaaaaaa/b/g; t b; b done
: b; s/bbbbbbbbbb/c/g; t c; b done
: c; s/cccccccccc/d/g; t d; b done
: d; s/dddddddddd/e/g; t e; b done
: e; s/eeeeeeeeee/f/g; t f; b done
: f; s/ffffffffff/g/g; t g; b done
: g; s/gggggggggg/h/g; t h; b done
: h; s/hhhhhhhhhh//g
: done
# On the last line, convert back to decimal
: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/
y/bcdefgh/abcdefg/
/[a-h]/ b loop
b aftercharcount
Run it like this:
sed -Ef parse.sed infile
With an infile like this:
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa
The output is:
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa
I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:
$ cat tst.awk
{
out = ""
for (pos=1; pos<=length($0); pos+=reps) {
char = substr($0,pos,1)
for (reps=1; char == substr($0,pos+reps,1); reps++);
out = out (reps > 1 ? reps "x" : "") char
}
print out
}
$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa
The above was run against the sample input that #Thor kindly provided:
$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa
The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower() around each side of the comparison in the innermost for loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.

Bash Regular expression for "not space, comma, not space"

I have a file like this:
a,b,c,"hello, hi",d
I want the field separator to be not space, comma, not space.
Currently I have
cat file | awk 'BEGIN { FS = "[^ ],[^ ]" } ; { print $4 }'
which should give "hello, hi" but it returns nothing. I'm quite new to this regular expression thing so any help would be appreciated.
Eh, no it should not give hello, hi. What actually happens is:
a,b,c,"hello, hi",d
|| ||| || ||_|Third fied separator
|| ||| ||_______|
|| ||| | $3
|| |||_|
|| || Second field separator
|| ||
|| |+- $2 is a comma
||_|
| First field separator
|
+- $0 is empty
So after the third field separator, the line is empty. You can verify this behaviour with
aaa,baa,caa,"hello, hi",daa
as input-file.
If you work with CSV files regularly, consider installing the csvtool, then you can simply say:
echo 'a,b,c,"hello, hi",d' | csvtool col 4 -
and it will spit out
"hello, hi"
You can also use sed:
>sed 's/.*\("[^"]*"\).*/\1/' <<< 'a,b,c,"hello, hi",d'
"hello, hi"
or grep:
>grep -o '"[^"]*"' <<< 'a,b,c,"hello, hi",d'
"hello, hi"
solution is to define the field content instead of field separator. You need to use gawk because standard awk does not have this feature natively. (on linux, awk = gawk)
echo 'a,b,c,"hello, hi",d' \
| awk '
# define the content with FPAT
# here any non , or a encapsulate quoted content
BEGIN{ FPAT = "[^,]*|\"[^\"]*\"" }
# for showing each field
{for (i=1;i<=NF;i++) printf( "field %d: %s\n", i, $i)}
'
field 1: a
field 2: b
field 3: c
field 4: "hello, hi"
field 5: d
By default, regex matching try to always take the longest possible so a "..,..." is longer than ".. and/or ..." taking full quoted string instead of partial coma separated content of the same string

How do I properly match unicode characters with awk's regex?

I have the following statement in a script, to retrieve the domain portion of an email address from a variety of email logs with a reliably formatted To: line:
awk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
This matches lines such as To: doc#bequerelint.net (Omer). However, it does not match the lines To: andy.vitrella#uol.com.br (André) or To: boggers#operamail.com (Pål), nor any other line with a non-ascii character within the trailing parentheses after the email address.
Incidentally, od -c for the first non-matching example gives:
0000000 T o : a n d y . v i t r e l l
0000020 a # u o l . c o m . b r ( A n
0000040 d r 351 ) \n
0000045
I surmise there is something going on with awk's regex's . not matching the non-ascii character in (André). What is the correct regex statement to match such a line?
I give my comment as an answer to have the code formatted correctly,
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' | gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
uol.com.br
operamail.com
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' > fileee12
$ gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}' fileee12
uol.com.br
operamail.com
$ env | grep -e '\(LOC\)\|\(LAN\)'
LANG=C
XTERM_LOCALE=C
$
as you see, your command works both reading from stdin and reading from a file, using a C locale, so I can exclude that on my computer it is the locale or the differences between reading from stdin rather than from a file to make a difference.
My computer has linux, my gawk is 4.1.1, what are your circumstances?
further simplifying it, where locale setting simply doesn't matter
{mawk/mawk2/gawk [-b]? -e} 'BEGIN { FS = "\100"; # ampersand
} /^To: / && ( NF > 1 ) { # play it safe in case
# of no ampersand
print ($2 !~ / /) ? $2 : \ # in case no "(Omer)" towards the end
\
substr($2, 1, index($2, " ") - 1);
}'
since spaces aren't valid in email address (unless URI-encoded (?)), and you're force delimiting by # , this substr alone does it without all the gsub and unicode and what not