AWK Regex pattern matching

AWK Regex pattern matching - regex

I have a text file, and I need to identify a certain pattern in one field. I am using AWK, and trying to use the match() function.
The requirement is I need to see if the following pattern exists in a string of digits
??????1?
??????3?
??????5?
??????7?
ie I am only interested in the last but one digit being a 1, 3, 5, or a 7.
I have a solution, which looks like this;
b = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]1[0-9]")
c = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]3[0-9]")
d = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]5[0-9]")
e = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]7[0-9]")
if (b || c || d || e)
{
print "Found a match" $23
}
I think though I should be able to write the regex more succinctly like this;
b = match($23, "[0-9]{6}1[0-9]")
but this does not work.
Am I missing something, or are my regex skills (which are not great), really all that bad?
Thanks in anticipation

The regex delimiter is /.../, not "...". When you use quotes in an RE context, you're telling awk that there's an RE stored inside a string literal and that string literal gets parsed twice, once when the script is read and then again when it's executed which makes your RE specification that much more complicated to accommodate that double parsing.
So, do not write:
b = match($23, "[0-9]{6}1[0-9]")
write:
b = match($23, /[0-9]{6}1[0-9]/)
instead.
That's not your problem though. The most likely problem you have is that you are calling a version of awk that does not support RE-intervals like {6}. If you are using an older version of GNU awk, then you can enable that functionality by adding the --re-interval flag:
awk --re-interval '...b = match($23, /[0-9]{6}1[0-9]/)...'
but whether it's that or you're using an awk that just doesnt support RE_intervals, the best thing to do is get a newer version of gawk.
Finally, your whole script can be reduced to:
awk --re-interval '$23 ~ /[0-9]{6}[1357][0-9]/{print "Found a match", $23}'
Change [0-9] to [[:digit:]] for locale-independence if you like.
The reason why RE intervals weren't supported by default in gawk until recently is that old awk didn't support them so a script that had an RE of a{2}b when executed in old awk would have been looking for literally those 5 chars and gawk didn't want old scripts to quietly break when executed in gawk instead of old awk. A few release back the gawk guys rightly decided to take the plunge an enable RE intervals by default for our convenience over backward compatibility.

Here is one awk solution:
awk -v FS="" '$7~/(1|3|5|7)/' file
By setting FS to nothing, every character becomes a field. We can then test field #7.
As Tom posted.
awk -v FS="" '$7~/[1357]/' file

Related

Gawk regexp to select sequence

sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in
AADDEEDDA
EEEEEEEE
AEEEDEEA
AEEEDDAAA
and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in
EDDDDDDDD
EEEDDDDDD
EEEEEEEED
I'm fighting with the proper regexp to do that. Here my last attempt
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'
which does not work. Any help?
Thanks in advance.

If i understand correctly your request a simple
awk '/^E+D+$/' file.input
will do the trick.
UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):
awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test

First of all, you need the regular expression:
^E+[^ED]*D+$
This matches one or more Es at the beginning, zero or more characters that are neither E nor D in the middle, and one or more Ds at the end.
Then your AWK program will look like
$2 ~ /^E+[^ED]*D+$/
$2 refers to the 2nd field of the current record, ~ is the regex matching operator, and /s delimit a regular expression. Together, these components form what is known in AWK jargon as a "pattern", which amounts to a boolean filter for input records. Note that there is no "action" (a series of statements in {s) specified here. That's because when no action is specified, AWK assumes that the action should be { print $0 }, which prints the entire line.

If I understand you correct you want to match patterns that starts with at least one E and then continues with at least one D until the end.
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E+D+$) print $0}'

Linux search and replace a patterns case within a string

Been struggling to figure out a way to do this. Basically I need to change the case of anything enclosed in {} from lower to upper within a string representing a uri (and also strip out the braces but I can use sed to do that)
E.g
/logs/{server_id}/path/{os_id}
To
/logs/SERVER_ID/path/OS_ID
The case of the rest of the string must be preserved in lower which is what has been beating me. Looked at combos of sed,awk,tr with regex so far. Any help appreciated.

sed "s/{\([^{}]*\)}/\U\1/g"
This works by matching all text enclosed within {} and replacing it with its uppercase version.
echo "/logs/{server_id}/path/{os_id}" | sed "s/{\([^{}]*\)}/\U\1/g"
Gives /logs/SERVER_ID/path/OS_ID as the result.

echo "/logs/{server_id}/path/{os_id}" \
| sed 's#{\([^{}][^{}]*\)}#\U\1#;s#{\([^{}][^{}]*\)}#\U\1#'
output
/logs/SERVER_ID/path/OS_ID
The part of the solution you seem to have missed is the 'capture groups' available in sed, i.e. \(regex\). This is then referenced by \1. You could have anywhere from 1-9 capture groups if you're a real masochist ;-)
Also note that I just repeat the same cmd 2 times, as the first {...} pair as been converted to the UC version (without surrounding {}s), so only remaining {...} targets will match.
There are probably less verbose syntax available for [^{}][^{}* but this will work with just about any sed going back to the 80s. I seem to recall that some seds don't support the \U directive, but for the systems I have access to, this works.
Does that help?

$ awk '{
while(match($0,/{[^}]+}/))
$0=substr($0,1,RSTART-1) toupper(substr($0,RSTART+1,RLENGTH-2)) substr($0,RSTART+RLENGTH)
}1' file
/logs/SERVER_ID/path/OS_ID

This one handles arbitrary number and format of braces:
echo "/logs/{server_id}/path/{os_id}/{foo}" | awk -v RS='{' -v FS='}' -v ORS='\0' -v OFS='\0' '!/}/ { print } /}/ { $1 = toupper($1); print}'
Output:
/logs/SERVER_ID/path/OS_ID/FOO

1-indexing to 0-indexing in grep/sed/awk

I'm parsing a template using linux command line using pipes which has some 1-indexed pseudo-variables, I need 0-indexing. Bascially:
...{1}...{7}...
to become
...{0}...{6}...
I'd like to use one of grep, sed or awk in this order of preference. I guess grep can't do that but I'm not sure. Is such arithmetic operation even possible using any of these?
numbers used in the file are in range 0-9, so ignore problems like 23 becoming 12
no other numbers in the file, so you can even ignore the {}
I can do that using python, ruby, whatever but I prefer not to so stick to standard command line utils
other commend line utils usable with pipes with regex that I don't know are fine too
EDIT: Reworded first bullet point to clarify

If the input allows it, you may be able to get away with simply:
tr 123456789 012345678
Note that this will replace all instances of any digit, so may not be suitable. (For example, 12 becomes 01. It's not clear from the question if you have to deal with 2 digit values.)
If you do need to handle multi-digit numbers, you could do:
perl -pe 's/\{(\d+)\}/sprintf( "{%d}", $1-1)/ge'

You can use Perl.
perl -pe 's/(?<={)\d+(?=})/$&-1/ge' file.txt
If you are sure you can ignore {...}, then use
perl -pe 's/\d+/$&-1/ge' file.txt
And if index is always just one-digit number, then go with shorter one
perl -pe 's/\d/$&-1/ge' file.txt

With gawk version 4, you can write:
gawk '
{
n = split($0, a, /[0-9]/, seps)
for (i=1; i<n; i++)
printf("%s%d", a[i], seps[i]-1)
print a[n]
}
'
Older awks can use
awk '
{
while (match(substr($0,idx+1), /[0-9]/)) {
idx += RSTART
$0 = substr($0,1, idx-1) (substr($0,idx,1) - 1) substr($0,idx+1)
}
print
}
'
Both less elegant than the Perl one-liners.

Can we really do without lazy quantifiers?

Many people say we can do without lazy quantifiers in regular expressions, but I've just run into a problem that I can't solve without them (I'm using sed here).
The string I want to process is composed of substrings separated by the word rate, for example:
anfhwe9.<<76xnf9247 rate 7dh3_29snpq+074j rate 48jdhsn3gus8 rate
I want to replace those substrings (apart from the word 'rate') with 3 dashes (---) each; the result should be:
---rate---rate---rate
From what I understand (I don't know Perl), it can be easily done using lazy quantifiers. In vim there are lazy quantifiers too; I did it using this command
:s/.\{-}rate/---rate/g
where \{-} tells vim to match as few as possible.
However, vim is a text editor and I need to run the script on many machines, some of which have no Perl installed. It could also be solved if you can tell the regex to not match an atomic grouping like .*[^(rate)]rate but that did not work.
Any ideas how to achieve this using POSIX regex, or is it impossible?

In a case like this, I would use split():
perl -n -e 'print join ("rate", ("---") x split /rate/)' [input-file]

Are there any characters that are guaranteed not to be in the input? For instance, if '!'
can't occur, you could transform the input to substitute that unique character, and then do a global replace on the transformed input:
sed 's/ rate /!/g' < input | sed -e 's/[^!]*/---/g' -e 's/!/rate/g'
Another alternative is to use awk's split command in an analogous way to
the perl suggestion above, assuming awk is any more reliably available than perl.
awk '
{ ans="---"
n=split($0, x, / rate /);
while ( n-- ) { ans = ans "rate---";}
print ans
}'

It's not easy without using lazy quantifiers or negative lookaheads (neither of which POSIX supports), but this seems to work.
([^r]*((r($|[^a]|a([^t]|$)|at([^e]|$))))?)+rate
I vaguely recall POSIX character classes being a bit persnickety. You may need to alter the character classes in that regex if they're not already POSIX-compliant.

The fact that you don't care about the contents of the substrings opens up a lot of options. For example, to add to Bob Lied's suggestion — even if '!' can occur in the input, you can start by changing it to something else:
sed -e 's/!/./g' -e 's/rate/!/g' -e 's/[^!]\+/---/g' -e 's/!/rate/g' <input >output

With awk:
awk -Frate '{
for (i = 0; ++i <= NF;)
$i = (i == 1 || i == NF) && $i == x ? x : "---"
}1' OFS=rate infile

Or, awk 'BEGIN {OFS=FS="rate"} {for (i=1; i<=NF-1; i++) {$i = "---"}; print}'

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.

If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.

gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.

By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.

There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.

Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWK Regex pattern matching - regex

Here is one awk solution: awk -v FS="" '$7~/(1|3|5|7)/' file By setting FS to nothing, every character becomes a field. We can then test field #7. As Tom posted. awk -v FS="" '$7~/[1357]/' file

Related

Gawk regexp to select sequence

Linux search and replace a patterns case within a string

1-indexing to 0-indexing in grep/sed/awk

Can we really do without lazy quantifiers?

Awk/etc.: Extract Matches from File

Categories

Resources