How do I properly match unicode characters with awk's regex?

How do I properly match unicode characters with awk's regex? - regex

I have the following statement in a script, to retrieve the domain portion of an email address from a variety of email logs with a reliably formatted To: line:
awk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
This matches lines such as To: doc#bequerelint.net (Omer). However, it does not match the lines To: andy.vitrella#uol.com.br (André) or To: boggers#operamail.com (Pål), nor any other line with a non-ascii character within the trailing parentheses after the email address.
Incidentally, od -c for the first non-matching example gives:
0000000 T o : a n d y . v i t r e l l
0000020 a # u o l . c o m . b r ( A n
0000040 d r 351 ) \n
0000045
I surmise there is something going on with awk's regex's . not matching the non-ascii character in (André). What is the correct regex statement to match such a line?

I give my comment as an answer to have the code formatted correctly,
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' | gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
uol.com.br
operamail.com
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' > fileee12
$ gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}' fileee12
uol.com.br
operamail.com
$ env | grep -e '\(LOC\)\|\(LAN\)'
LANG=C
XTERM_LOCALE=C
$
as you see, your command works both reading from stdin and reading from a file, using a C locale, so I can exclude that on my computer it is the locale or the differences between reading from stdin rather than from a file to make a difference.
My computer has linux, my gawk is 4.1.1, what are your circumstances?

further simplifying it, where locale setting simply doesn't matter
{mawk/mawk2/gawk [-b]? -e} 'BEGIN { FS = "\100"; # ampersand
} /^To: / && ( NF > 1 ) { # play it safe in case
# of no ampersand
print ($2 !~ / /) ? $2 : \ # in case no "(Omer)" towards the end
\
substr($2, 1, index($2, " ") - 1);
}'
since spaces aren't valid in email address (unless URI-encoded (?)), and you're force delimiting by # , this substr alone does it without all the gsub and unicode and what not

Related

How can I output the number of repeats of a pattern in regex?

I would like to output the number of repeats of a pattern with regex. For example, convert "aaad" to "3xad", "bCCCCC" to "b5xC". I want to do this in sed or awk.
I know I can match it by (.)\1+ or even capture it by ((.)\1+). But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?

Perl to the rescue!
perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'
-p reads the input line by line and prints it after processing
s/// is the substitution similar to sed
/e makes the replacement evaluated as code
e.g.
aaadbCCCCCxx -> 3xadb5xC2xx

In GNU awk:
$ echo aaadbCCCCCxx | awk -F '' '{
for(i=1;i<=NF;i+=RLENGTH) {
c=$i
match(substr($0,i),c"+")
b=b (RLENGTH>1?RLENGTH "x":"") c
}
print b
}'
3xadb5xC2xx
If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):
$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
for(i=1;i<=NF;i+=RLENGTH) {
c=$i
# print i,c # for debugging
if(c~/[*.\\]/) # if c is a regex metachar (not complete)
c="\\"c # escape it
match(substr($0,i),c"+") # find all c:s
b=b (RLENGTH>1?RLENGTH "x":"") $i # buffer to b
}
print b
}'
3x\2x.2x*3xadb5xC2x+2xx

Just for fun.
With sed it is cumbersome but do-able. Note this example relies on GNU sed (:
parse.sed
/(.)\1+/ {
: nextrepetition
/((.)\2+)/ s//\n\1\n/ # delimit the repetition with new-lines
h # and store the delimited version
s/^[^\n]*\n|\n[^\n]*$//g # now remove prefix and suffix
b charcount # count repetitions
: aftercharcount # return here after counting
G # append the new-line delimited version
# Reorganize pattern space to the desired format
s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/
# Run again if more repetitions exist
/(.)\1+/b nextrepetition
}
b
# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount
s/./a/g
# Do the carry. The t's and b's are not necessary,
# but they do speed up the thing
t a
: a; s/aaaaaaaaaa/b/g; t b; b done
: b; s/bbbbbbbbbb/c/g; t c; b done
: c; s/cccccccccc/d/g; t d; b done
: d; s/dddddddddd/e/g; t e; b done
: e; s/eeeeeeeeee/f/g; t f; b done
: f; s/ffffffffff/g/g; t g; b done
: g; s/gggggggggg/h/g; t h; b done
: h; s/hhhhhhhhhh//g
: done
# On the last line, convert back to decimal
: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/
y/bcdefgh/abcdefg/
/[a-h]/ b loop
b aftercharcount
Run it like this:
sed -Ef parse.sed infile
With an infile like this:
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa
The output is:
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:
$ cat tst.awk
{
out = ""
for (pos=1; pos<=length($0); pos+=reps) {
char = substr($0,pos,1)
for (reps=1; char == substr($0,pos+reps,1); reps++);
out = out (reps > 1 ? reps "x" : "") char
}
print out
}
$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa
The above was run against the sample input that #Thor kindly provided:
$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa
The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower() around each side of the comparison in the innermost for loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.

Replace special characters except the following ,.#

I'm looking for an option to remove special characters from a file except for the following 3 items ,.#
The following awk command gets close but it removes all punctuation.
awk '{gsub(/[[:punct:]]/,"",except(".","#",","))}1' test.csv > test2.csv
Any ideas...

There are no opposite character classes in POSIX and no lookarounds to restrict a more generic pattern with some exceptions. The only way is to spell out the POSIX character class.
According to Character Classes and Bracket Expressions:
‘[:punct:]’
Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ \ { | } ~.
You may use
/[!-+\/:-?[-`{-~-]/
See the regex demo.
Legend:

All 3 of these approaches will work in any locale and will work for any character class by just changing the class name and will work for other bracket expressions or strings etc.:
1) Just look for any punct but only change it if it's not one of the chars you don't want changed:
$ echo 'a.b?c#d#e,f' |
awk '{
new = ""
while ( match($0,/[[:punct:]]/) ) {
chr = substr($0,RSTART,1)
new = new substr($0,1,RSTART-1) (chr ~ /[,.#]/ ? chr : "")
$0 = substr($0,RSTART+RLENGTH)
}
print new $0
}'
a.bcd#e,f
2) Turn the chars you don't want changed into other strings first then turn them back afterwards:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/a/,"aA"); gsub(/,/,"aB"); gsub(/\./,"aC"); gsub(/#/,"aD")
gsub(/[[:punct:]]/,"")
gsub(/aD/,"#"); gsub(/aC/,"."); gsub(/aB/,","); gsub(/aA/,"a")
print
}'
a.bcd#e,f
Changing a into aA and back is what guarantees that the strings you create when converting the #, etc. are strings that cannot exist elsewhere in the input at that time and that's why you can safely convert them back afterwards.
3) Suffix the puncts with the RS value, then remove the RS suffix from the chars you don't want changed, then change the remaining RS-suffixed puncts:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/[[:punct:]]/,"&"RS)
$0 = gensub("([,.#])"RS,"\\1","g")
gsub("[[:punct:]]"RS,"")
print
}'
a.bcd#e,f
That one uses GNU awk for gensub(), with other awks you'd need match()+substr().

Bash Regular expression for "not space, comma, not space"

I have a file like this:
a,b,c,"hello, hi",d
I want the field separator to be not space, comma, not space.
Currently I have
cat file | awk 'BEGIN { FS = "[^ ],[^ ]" } ; { print $4 }'
which should give "hello, hi" but it returns nothing. I'm quite new to this regular expression thing so any help would be appreciated.

Eh, no it should not give hello, hi. What actually happens is:
a,b,c,"hello, hi",d
|| ||| || ||_|Third fied separator
|| ||| ||_______|
|| ||| | $3
|| |||_|
|| || Second field separator
|| ||
|| |+- $2 is a comma
||_|
| First field separator
|
+- $0 is empty
So after the third field separator, the line is empty. You can verify this behaviour with
aaa,baa,caa,"hello, hi",daa
as input-file.

If you work with CSV files regularly, consider installing the csvtool, then you can simply say:
echo 'a,b,c,"hello, hi",d' | csvtool col 4 -
and it will spit out
"hello, hi"

You can also use sed:
>sed 's/.*\("[^"]*"\).*/\1/' <<< 'a,b,c,"hello, hi",d'
"hello, hi"
or grep:
>grep -o '"[^"]*"' <<< 'a,b,c,"hello, hi",d'
"hello, hi"

solution is to define the field content instead of field separator. You need to use gawk because standard awk does not have this feature natively. (on linux, awk = gawk)
echo 'a,b,c,"hello, hi",d' \
| awk '
# define the content with FPAT
# here any non , or a encapsulate quoted content
BEGIN{ FPAT = "[^,]*|\"[^\"]*\"" }
# for showing each field
{for (i=1;i<=NF;i++) printf( "field %d: %s\n", i, $i)}
'
field 1: a
field 2: b
field 3: c
field 4: "hello, hi"
field 5: d
By default, regex matching try to always take the longest possible so a "..,..." is longer than ".. and/or ..." taking full quoted string instead of partial coma separated content of the same string

copying first string into second line

I have a text file in this format:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Here I call the first string before the first space as word (for example abacısı)
The string which starts with after first space and ends with integer is definition (for example Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875)
I want to do this: If a line includes more than one definition (first line has one, second line has two, third line has three), apply newline and put the first string (word) into the beginning of the new line. Expected output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
I have almost 1.500.000 lines in my text file and the number of definition is not certain for each line. It can be 1 to 5

Small python script does the job. Input is expected in input.txt, output gotes to output.txt.
import re
rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')
with open("input.txt", "r") as f:
text = f.read()
with open("output.txt", "w") as f:
for l in text.split('\n'):
offset = 0
first = ""
match = re.search(rf, l[offset:])
if match:
first = match.group(1)
offset = len(first)
while True:
match = re.search(r, l[offset:])
if not match:
break
s = match.group(1)
offset += len(s)
f.write(first + " " + s + "\n")

I am assuming the following format:
word definitionkey : definitionvalue [definitionkey : definitionvalue …]
None of those elements may contain a space and they are always delimited by a single space.
The following code should work:
awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file
Explanation (this is the same code but with comments and more spaces):
awk '
# match any line
{
# iterate over each "key : value"
for (i=2; i<=NF; i+=3)
print $1, $i, $(i+1), $(i+2) # prints each "word key : value"
}
' file
awk has some tricks that you may not be familiar with. It works on a line-by-line basis. Each stanza has an optional conditional before it (awk 'NF >=4 {…}' would make sense here since we'll have an error given fewer than four fields). NF is the number of fields and a dollar sign ($) indicates we want the value of the given field, so $1 is the value of the first field, $NF is the value of the last field, and $(i+1) is the value of the third field (assuming i=2). print will default to using spaces between its arguments and adds a line break at the end (otherwise, we'd need printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2), which is a bit harder to read).

With perl:
perl -a -F'[^]:]\K\h' -ne 'chomp(#F);$p=shift(#F);print "$p ",shift(#F),"\n" while(#F);' yourfile.txt
With bash:
while read -r line
do
pre=${line%% *}
echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"
This script read the file line by line. For each line, the prefix is extracted with a parameter expansion (all until the first space) and spaces preceded by a digit are replaced with a newline and the prefix using sed.
edit: as tripleee suggested it, it's much faster to do all with sed:
sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt

Assuming there are always 4 space-separated words for each definition:
awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file
Or if the split should occur after that floating point number
perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file
(This is the perl equivalent of Avinash's answer)

Bash and grep:
#!/bin/bash
while IFS=' ' read -r in1 in2 in3 in4; do
if [[ -n $in4 ]]; then
prepend="$in1"
echo "$in1 $in2 $in3 $in4"
else
echo "$prepend $in1 $in2 $in3"
fi
done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")
The output of grep -o is putting all definitions on a separate line, but definitions originating from the same line are missing the "word" at the beginning:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
The for loop now loops over this, using a space as the input file separator. If in4 is a zero length string, we're on a line where the "word" is missing, so we prepend it.
The script takes the input file name as its argument, and saving output to an output file can be done with simple redirection:
./script inputfile > outputfile

Using perl:
$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log
Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Explanation:
Split the line to separate first field as word.
Then split the remaining line using the regex .*?[0-9]+\.[0-9]+.
Print word concatenated with every match of above regex.

I would approach this with one of the excellent Awk answers here; but I'm posting a Python solution to point to some oddities and problems with the currently accepted answer:
It reads the entire input file into memory before processing it. This is harmless for small inputs, but the OP mentions that the real-world input is kind of big.
It needlessly uses re when simple whitespace tokenization appears to be sufficient.
I would also prefer a tool which prints to standard output, so that I can redirect it where I want it from the shell; but to keep this compatible with the earlier solution, this hard-codes output.txt as the destination file.
with open('input.txt', 'r') as input:
with open('output.txt', 'w') as output:
for line in input:
tokens = line.rstrip().split()
word = tokens[0]
for idx in xrange(1, len(tokens), 3):
print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)
If you really, really wanted to do this in pure Bash, I suppose you could:
while read -r word analyses; do
set -- $analyses
while [ $# -gt 0 ]; do
printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
shift; shift; shift
done
done <input.txt >output.txt

Please find the following bash code
#!/bin/bash
# read.sh
while read variable
do
for i in "$variable"
do
var=`echo "$i" |wc -w`
array_1=( $i )
counter=0
for((j=1 ; j < $var ; j++))
do
if [ $counter = 0 ] #1
then
echo -ne ${array_1[0]}' '
fi #1
echo -ne ${array_1[$j]}' '
counter=$(expr $counter + 1)
if [ $counter = 3 ] #2
then
counter=0
echo
fi #2
done
done
done
I have tested and it is working.
To test
On bash shell prompt give the following command
$ ./read.sh < input.txt > output.txt
where read.sh is script , input.txt is input file and output.txt is where output is generated

here is a sed in action
sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file
output
indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875

How can I get the hostname from a file using awk and regex or substring

The file name is in a format like this:
YYYY-MM-DD_hostname_something.log
I want to get the hostname from the filename. The hostname can be any length, but always has a _ before and after. This is my current awk statement. It worked fine until the hostname length changed. Now I can't use it anymore.
awk 'BEGIN { OFS = "," } FNR == 1 { d = substr(FILENAME, 1, 10) } { h = substr(FILENAME, 12, 10) } $2 ~ /^[AP]M$/ && $3 != "CPU" { print d, $1 "" $2, h, $4+$5, $6, $7+$8+$9}' *_something.log > myfile.log

echo 'YYYY-MM-DD_hostname_something.log' | awk -F"_" '{print $2}'
Output:
hostname
I suppose your hostname contains no _.

$ ls YYYY-MM-DD_hostname_something.log | cut -d _ -f 2
hostname
The cut(1) utility is POSIX and accepts the -d _ option to specify a delimiter and -f 2 to specify the second field. It has got a few more nifty options that you can read about in its fine manual page.

Since you have mentioned you need to modify your awk code, replace your substr function with split.
split(FILENAME,a,"_");date = a[1];host = a[2]
split the FILENAME value into array a with _ as FS.
a[1] will contain date
a[2] will contain the hostname value.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I properly match unicode characters with awk's regex? - regex

Related

How can I output the number of repeats of a pattern in regex?

Replace special characters except the following ,.#

Bash Regular expression for "not space, comma, not space"

copying first string into second line

How can I get the hostname from a file using awk and regex or substring

Categories

Resources