awk regex pattern does not match beginning of the line - regex

I'm using GNU awk version 3.1.7 on Windows 10, MinGW installation.
File to test this has this contents but same behaviour is with other files as well.
test.txt
line one
second line
another line
end this one should match
double test
yet another
I want to print only first words beginning with e.
awk command I'm using is:
awk '{ if ($1 ~ /^e/) {print $1} }' test.txt
But this prints every first word which has character e anywhere.
output
line
second
another
end
double
yet
When I want to match end of the word works fine.
Match every first word ending with d.
awk '{ if ($1 ~ /d$/) {print $1} }' test.txt
output
second
end
Any idea why first example matching beginning of the word does not work?
What I'm I doing wrong there?

That's got nothing to do with gawk it's Windows quoting rules. gawk doesn't even see the quotes - it just runs on whatever script Windows passes to it (i.e. the part between the quotes) and it's entirely Windows that interprets the quotes to isolate the script that it then passes to gawk. The standard advice is to avoid the problem is by putting the awk script in a file and running as awk -f script instead of trying to deal with the Windows quoting nightmare. The best advice though is to run cygwin on top of Windows.

I just tried it with gawk 3.1.6 - 1 on Windows 10.
When I try with single quotes it gives syntax error:
awk '{ if ($1 ~ /^e/) {print $1} }' test.txt
// Error
awk: '{
awk: ^ invalid char ''' in expression
With double quotes works fine, prints only end.
awk "{ if ($1 ~ /^e/) {print $1} }" test.txt
So I have tried this line with double quotes on gawk 3.1.7 as well.
It works.
Prints only end.
gawk 3.1.7 does not give any error when I use line example with single quotes but /^e/ regex in it does not match as it should for some reason.
So at least from my perspective if you are using gawk on windows, always use double quotes for awk code in a command line.

awk "{ if ($1 ~ /^^e/) {print $1} }" test.txt
on Windows platform:
1- exchange " with ' and vice versa
2- for ^ use ^^

Related

Using protected wildcard character in awk field separator doesn't work

I have a file that contains paragraphs separated by lines of *(any amount). When I use egrep with the regex of '^\*+$' it works as intended, only displaying the lines that contain only stars.
However, when I use the same expression in awk -F or awk FS, it doesn't work and just prints out the whole document, excluding the lines of stars.
Commands that I tried so far:
awk -F'^\*+$' '{print $1, $2}' msgs
awk -F'/^\*+$/' '{print $1, $2}' msgs
awk 'BEGIN{ FS="/^\*+$/" } ; { print $1,$2 }' msgs
Printing the first field always prints out the whole document, using the first version it excludes the lines with the stars, other versions include everything from the file.
Example input:
Par1 test teststsdsfsfdsf
fdsfdsfdsftesyt
fdsfdsfdsf fddsteste345sdfs
***
Par2 dsadawe232343a5edsfe
43s4esfsd s45s45e4t rfgsd45
***
Par3 dsadasd
fasfasf53sdf sfdsf s45 sdfs
dfsf dsf
***
Par4 dasdasda r3ar d afa fs
ds fgdsfgsdfaser ar53d f
***
Par 5 dasdawr3r35a
fsada35awfds46 s46 sdfsds5 34sdf
***
Expected output for print $1:
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs
EDIT: Added example input and expected output
Strings used as regexps in awk are parsed twice:
to turn them into a regexp, and
to use them as a regexp.
So if you want to use a string as a regexp (including any time you assign a Field Separator or Record Separator as a regexp) then you need to double any escapes as each iteration of parsing will consume one of them. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for details.
Good (a literal/constant regexp):
$ echo 'a(b)c' | awk '$0 ~ /\(b)/'
a(b)c
Bad (a poorly-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\(b)"'
awk: cmd. line:1: warning: escape sequence `\(' treated as plain `('
a(b)c
Good (a well-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\\(b)"'
a(b)c
but IMHO if you're having to double escapes to make a char literal then it's clearer to use a bracket expression instead:
$ echo 'a(b)c' | awk '$0 ~ "[(]b)"'
a(b)c
Also, ^ in a regexp means "start of string" which is only matched at the start of all the input, just like $ would only be matched at the end of all of the output. ^ does not mean "start of line" as some documents/scripts may lead you to believe. It only appears to mean that in grep and sed because they are line-oriented and so usually the script is being compared to 1 line at a time, but awk isnt line-oriented, it's record-oriented and so the input being compared to the regexp isn't necessarily just a line (the same is true in sed if you read multiple lines into its hold space).
So to match a line of *s as a Record Separator (RS) assuming you're using gawk or some other awk that can treat a multi-char RS as a regexp, you'd have to write this regexp:
(^|\n)[*]+(\n|$)
but be aware that also matches the newlines before the first and after the last *s on the target lines so you need to handle that appropriately in your code.
It seems like this is what you're really trying to do:
$ awk -v RS='(^|\n)[*]+(\n|$)' 'NR==1{$1=$1; print}' file
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs

unix sed not backtracking to finish the job

I'm trying to make a script to convert postgres CSV dumps into Oracle csv dumps. Aka, I'm trying to replace "true" with "Y" and "false" with "N".
So I want a script called to_oracle like this:
echo "false,false,false,true" | to_oracle
N,N,N,Y
So here is my attempt:
sed -E -e 's:(,|^)true(,|$):\1Y\2:g' -e 's:(,|^)false(,|$):\1N\2:g' "$#"
The logic is that a field in a CSV file either starts with beginning of line or a comma "," and it ends with either the end of line or a comma ","
The problem with this script is that it greedily absorbs the comma and thus every second field doesn't work:
echo "false,false,false,true" | to_oracle
N,false,N,Y
Now I suppose I could pipe it to the script twice, and that would do the job, but I'm wondering is there a more elegant solution?
An awk version:
echo "false,false,false,true" | awk -F, -v OFS=, '{for(i=1;i<=NF;i++) $i=$i=="true"?"Y":"N"}1'
N,N,N,Y
It test one by one field, if its true use Y, else use N
If you like to test for false as well
echo "false,false,false,true" | awk -F, -v OFS=, '{for(i=1;i<=NF;i++) $i=($i=="true"?"Y":($i=="false"?"N":"other"))}1'
N,N,N,Y
With GNU sed, you may use
sed -E ':a;s/(,|^)false(,|$)/\1N\2/;ta; :b;s/(,|^)true(,|$)/\1Y\2/;tb'
See the online demo
Details
-E will enable POSIX ERE syntax
':a;s/(,|^)false(,|$)/\1N\2/;ta; will recursively replace false in between commas or start/end of string with N
:b;s/(,|^)true(,|$)/\1Y\2/;tb' will recursively replace true in between commas or start/end of string with Y.

AWK print based on FILENAME pattern

I have a directory of files with filenames of the form file000.txt to filennn.txt. I would like to be able to specify a range of file names and print the content of those files based on a match. I have achieved it with a single file pattern:
$ gawk 'FILENAME ~/file038.txt/ {print FILENAME, $0}' file*.txt
file038.txt Some 038 text here
But I cannot get a pattern that would allow me to specify a range of file names, for instance
gawk 'FILENAME ~/file[038-040].txt/ {print FILENAME, $0}' file*.txt
I'm sure I'm missing something simple here, I'm an AWK newbie. Any suggestions?
you can do some substitution on the filename, for example:
awk '{x=FILENAME;gsub(/[^0-9]/,"",x);x+=0}x>10&&x<50{your logic}' file*.txt
in this way, file file011.txt ~ file049.txt would be handled with "your logic"
You can adjust the part: x>10&&x<50 for example, handle only file with the number in the name as odd/even/.... just write boolean expressions there.
Odd way but something on these lines:
awk '{ if (match(FILENAME,/file0[3-4][0-8].txt/)) { print FILENAME, $0}}' file*.txt
Solution using gawk and a recent version of bash
There is a bash primitive to handle file[038-040].txt. It makes the code quite simple:
gawk 'FNR==1 {print FILENAME, $0} {quit}' file{038..040}.txt
Key points:
FNR==1 {print FILENAME, $0}
This prints the filename and the first line of each file
{quit}
This saves time by skipping directly to the next file.
file{038..040}.txt
The construct {038..040} is a bash feature called brace expansion. bash will replace this with the file names that you want. If you want to test out brace expansion to see how it works, try it on the command line with this simple statement:
echo file{038..040}.txt
UPDATE 1: Mac OSX currently uses bash v3.2 which does not support leading zeros in brace expansion.
UPDATE 2: If there are missing files and you have a modern gawk (v4.0 or better), use this instead:
gawk 'BEGINFILE{ if (ERRNO) nextfile} FNR==1 {print FILENAME, $0} {quit}' file{038..040}.txt
Solution using gawk with a plain POSIX shell
gawk '{n=0+substr(FILENAME,5,3)} FNR==1 && n>=38 && n<=40 {print FILENAME, $0} {quit}' file*.txt
Explanation:
n=0+substr(FILENAME,5,3)
Extract the number from the filename. 0+ is a trick to force awk to treat n as numeric.
n>=38 && n<=40 {print FILENAME, $0}
This selects the file based on its number and prints the filename and first line.
{quit}
As before, this saves time by stopping awk from reading the rest of each file.
file*.txt
This can be expanded by any POSIX shell to the list of file names.
Should work
awk '(x=FILENAME)~/(3[8-9]|40).txt$/{print x,$0;quit}' file*.txt
As quit doesn't work(atleast with my version of awk) here is another way
awk 'FNR==((x=FILENAME)~/(3[8-9]|40).txt$/){print x,$0}' file*.txt

extract all values for specific key from space delimited text file

have a text file in the format
1=23 2=44 15=17:31:37.640 5=abc 15=17:31:37.641 4=23 15=17:31:37.643 15=17:31:37.643
I need a regex to extract all the values for key 15 for a multiline text file
output should be
17:31:37.640 17:31:37.641 17:31:37.643 17:31:37.643
Sorry, I should have stated that the values I'm trying to extract are timestamps in the form 17:31:37.643
You can use GNU grep to extract the substrings.
grep -Po '\b15=\K\S+' | tr '\n' ' '
-P option interprets the pattern as a Perl regular expression.
-o option shows only the matching part that matches the pattern.
\K throws away everything that it has matched up to that point.
Output
17:31:37.640 17:31:37.641 17:31:37.643 17:31:37.643
You can use sed:
sed 's/15=\([^ ]*\)/\1/g;s/[0-9]\+[^ ]\+ //g' input.file
Gave that answer before OP added the expected output, it will work too, but adds a new line after every value:
If you have GNU grep, you can use a lookbehind assertion that comes with perl compatible regex mode:
grep -oP '(?<=15=)[^ ]*' <<< '1=23 2=44 15=xyz 5=abc 15=yyy 4=23 15=omnet 15=that'
Output:
xyz
yyy
omnet
that
Using awk:
awk -F'=' -v RS=' ' -v ORS=' ' '$1==15 { print $2 }' file
xyz yyy omnet that
Set the Input and Output Record Separator to space and Input Field Separator to =. Test the condition of column1 to be 15. If that is true, print the second column.
As suggested by Ed Morton in the comments, this would leave a trailing blank char or even an absent newline. If thats a concern, you can use the following using GNU awk for multi-char RS.
gawk -F'=' -v RS='[[:space:]]+' '$1==15{ printf "%s%s", (c++?OFS:""), $2 } END{print ""}' file

AWK: Access captured group from line pattern

If I have an awk command
pattern { ... }
and pattern uses a capturing group, how can I access the string so captured in the block?
With gawk, you can use the match function to capture parenthesized groups.
gawk 'match($0, pattern, ary) {print ary[1]}'
example:
echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}'
outputs cd.
Note the specific use of gawk which implements the feature in question.
For a portable alternative you can achieve similar results with match() and substr.
example:
echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'
outputs cd.
That was a stroll down memory lane...
I replaced awk by perl a long time ago.
Apparently the AWK regular expression engine does not capture its groups.
you might consider using something like :
perl -n -e'/test(\d+)/ && print $1'
the -n flag causes perl to loop over every line like awk does.
This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.
Definition
Add this to your .bash_profile etc.
function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }
Usage
Capture regex for each line in file
$ cat filename | regex '.*'
Capture 1st regex capture group for each line in file
$ cat filename | regex '(.*)' 1
You can use GNU awk:
$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]
$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/
NOTE: the use of gensub is not POSIX compliant
You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:
step 1. use gensub to surround matches with some character that doesnt appear in your string.
step 2. Use split against the character.
step 3. Every other element in the splitted array is your capture group.
$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad
I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:
function regex
{
perl -n -e "/$1/ && printf \"%s\n\", "'$1'
}
I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.
'([0-9]*)ms$'
i think gawk match()-to-array is only for first instance of the capture group.
if there are multiple things you'd like to capture, and perform any complex operations upon them, perhaps
gawk 'BEGIN { S = SUBSEP
} {
nx=split(gensub(/(..(..)..(..))/,
"\\1"(S)"\\2"(S)"\\3", "g", str),
arr, S)
for(x in nx) { perform-ops-over arr[x] } }'
This way you aren't constrained by either gensub(), which limits the complexity if your modifications, or by match().
by pure trial-and-error, one caveat i've noted about gawk in unicode mode : for a valid unicode string 뀇꿬 with the 6 octal codes listed below :
Scenario 1 : matching individual bytes are fine, but will also report you the multi-byte RSTART of 1 instead of a byte-level answer of 2. It also won't provide info on whether \207 is the 1st continuation byte, or the second one, since RLENGTH will always be 1 here.
$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }'
$ 1
Scenario 2 : Match also works against unicode-invalid patterns like this
$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352");
$ print RSTART, RLENGTH }'
$ 1 2
Scenario 3 : you can check for existence of a pattern against a unicode-illegal string (\300 \xC0 is UTF8-invalid for all possible byte pairings)
$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }'
$ 1
Scenarios 4/5/6 : the error message will show up for either (a) match() with unicode-invalid string, index() for either argument to be unicode-invalid/incomplete.
$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2
$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0
$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0