gawk regex to find any record having characters other then the specified by character class in regex pattern - regex

I have list of email addresses in a text file. I have a pattern having character classes that specifies what characters are allowed in the email addresses.
Now from that input file, I want to only search the email addresses that has the characters other than the allowed ones.
I am trying to write a gawk for the same, but not able to get it to work properly.
Here is the gawk that I am trying:
gawk -F "," ' $2!~/[[:alnum:]#\.]]/ { print "has invalid chars" }' emails.csv
The problem I am facing is that the above gawk command only matches the records that has NONE of the alphanumeric, # and . (dot) in them. But what I am looking for is the records that are having the allowed characters but along with them the not-allowed ones as well.
For example, the above command would find
"_-()&(()%"
as the above only has the characters not in regex pattern, but will not find
"abc-123#xyz,com"
. as it also has the characters that are present in specified character classes in regex pattern.

How about several tests together: contains an alnum and an # and a dot and an invalid character
$2 ~ /[[:alnum:]]/ && $2 ~ /#/ && $2 ~ /\./ && $2 ~ /[^[:alnum:]#.]/

Your regex is wrong here:
/[[:alnum:]#\.]]/
It should be:
/[[:alnum:]#.]/
Not removal of an extra ] fron end.
Test Case:
# regex with extra ]
awk -F "," '{print ($2 !~ /[[:alnum:]#.]]/)}' <<< 'abc,ab#email.com'
1
# correct regex
awk -F "," '{print ($2 !~ /[[:alnum:]#.]/)}' <<< 'abc,ab#email.com'
0

Do you really care whether the string has a valid character? If not (and it seems like you don't), the simple solution is
$2 ~ /[^[:alnum:]#.]/{ print "has invalid chars" }
That won't trigger on an empty string, so you might want to add a test for that case.

Your question would REALLY benefit from some concise, testable sample input and expected output as right now we're all guessing at what you want but maybe this does it?
awk -F, '{r=$2} gsub(/[[:alnum:]#.]/,"",r) && (r!="") { print "has invalid chars" }' emails.csv
e.g. using the 2 input examples you provided:
$ cat file
_-()&(()%
abc-123#xyz,com
$ awk '{r=$0} gsub(/[[:alnum:]#.]/,"",r) && (r!="") { print $0, "has invalid chars" }' file
abc-123#xyz,com has invalid chars
There are more accurate email regexps btw, e.g.:
\<[[:alnum:]._%+-]+#[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>
which is a gawk-specific (for word delimiters \< and \>) modification of the one described at http://www.regular-expressions.info/email.html after updating to use POSIX character classes.
If you are trying to validate email addresses do not use the regexp you started with as it will declare # and 7 to each be valid email addresses.
See also How to validate an email address using a regular expression? for more email regexp details.

Related

what is '/[A-Z]/ s| |/|gp' meaning?

I am reading a sed tutorial at https://riptutorial.com/sed/example/13753/lines-matching-regular-expression-pattern.
Looks like
$ sed -n '/[A-Z]/ s| |/|gp' ip.txt
is filtering 'Add Sub Mul Div' out of the file, and convert it to 'Add/Sub/Mul/Div'
I really don't understand the regex considering I just read https://www.tldp.org/LDP/abs/html/x23170.html.
It does not even match the print syntax which is:
[address-range]/p
and is the pipe sign '|' here alternation?
Could anyone explain:
'/[A-Z]/ s| |/|gp'
in English?
Edit
I also found that the extra empty space before 's' and after '/' is allowed and does not do anything. the correct syntax should be:
[address-range]/s/pattern1/pattern2/
the syntax check of sed pattern is not strict, and confusing
-n option turns off automatic printing
sed allows to qualify commands with an address filtering, which could be regex or line addresses
for example, /foo/ d will delete lines containing foo
and /foo/ s/baz/123/ will change baz to 123 only if the line also contains foo
/[A-Z]/ match only lines containing at least one uppercase alphabet
if such a line is matched:
s| |/|gp perform this substitution and print
s command allows delimiter other than / too (see Using different delimiters in sed commands and range addresses)
in this case, using | allows you to use / as a normal character instead of having to escape it

Gawk - Regexp - unable to get results

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2
As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2
Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

Regex: find elements regardless of order

If I have the string:
geo:FR, host:www.example.com
(In reality the string is more complicated and has more fields.)
And I want to extract the "geo" value and the "host" value, I am facing a problem when the order of the keys change, as in the following:
host:www.example.com, geo:FR
I tried this line:
sed 's/.\*geo:\([^ ]*\).\*host:\([^ ]*\).*/\1,\2/'
But it only works on the first string.
Is there a way to do it in a single regex, and if not, what's the best approach?
I suggest extracting each text you need with a separate sed command:
s="geo:FR, host:www.example.com"
host="$(sed -n 's/.*host:\([^[:space:],]*\).*/\1/p' <<< "$s")"
geo="$(sed -n 's/.*geo:\([^[:space:],]*\).*/\1/p' <<< "$s")"
See the online demo, echo "$host and $geo" prints
www.example.com and FR
for both inputs.
Details
-n suppresses line output and p prints the matches
.* - matches any 0+ chars up the last...
host: - host: substring and then
\([^[:space:],]*\) - captures into Group 1 any 0 or more chars other than whitespace and a comma
.* - the rest of the line.
The result is just the contents of Group 1 (see \1 in the replacement pattern).
Whenever you have tag/name to value pairs in your input I find it best (clearest, simplest, most robust,, easiest to enhance, etc.) to first create an array that contains that mapping (f[] below) and then you can simply access the values by their tags:
$ cat file
geo:FR, host:www.example.com
host:www.example.com, geo:FR
foo:bar, host:www.example.com, stuff:nonsense, badgeo:uhoh, geo:FR, nastygeo:wahwahwah
$ cat tst.awk
BEGIN { FS=":|, *"; OFS="," }
{
for (i=1; i<=NF; i+=2) {
f[$i] = $(i+1)
}
print f["geo"], f["host"]
}
$ awk -f tst.awk file
FR,www.example.com
FR,www.example.com
FR,www.example.com
The above will work using any awk in any shell on every UNIX box.
Here I've used GNU Awk to convert your delimited key:value pairs to valid shell assignment. With Bash, you can load these assignments into your current shell using <(process substitution):
# source the file descriptor generated by proc sub
. < <(
# use comma-space as field separator, literal apostrophe as variable q
awk -F', ' -vq=\' '
# change every foo:bar in line to foo='bar' on its own line
{for(f=1;f<=NF;f++) print gensub(/:(.*)/, "=" q "\\1" q, 1, $f)}
# use here-string to load text; remove everything but first quote to use standard input
' <<< 'host:www.example.com, geo:FR'
)

AWK to match strings beginning with a number

I want to print all the lines of a file where the first element of each line begins with a number using awk. Below are the details on the data contained in the file and command used:
filename contents:
12.44.4444goad ABCDEF/END
LMNOP/START joker
98.0 kites
command used:
awk '{ $1 ~ /^\d[a-zA-Z0-9]*/ }' filename
After running the above command, no results are displayed on the prompt.
Please let me know if there is any correction that needs to be made to the above command.
To print the lines starting with a digit, you can try the following:
awk '/^[[:digit:]]+/' file
as pointed out by #HenkLangeveld your syntax is incorrect. Also the regex \d is not available in awk.
If you only need to match at least one digit at the start of the line, all you need is ^ to match the start of a line and [0-9] to match a digit.
You can use curly brackets with an if statement:
awk '{if($1 ~ /^[0-9]/) print $0}' filename
But that would just be longhand for this:
awk '$1 ~ /^[0-9]/' filename
From your attempted solution, it looks like you want:
awk 'NF>1 && $1 ~ /^[0-9.]*$/' filename
You need to explicitly match the . if you want to include the decimal point, and you need the $ anchor to make the * meaningful. This will miss lines in which the first column looks like 5e39 or -2.3. You can try to catch those cases with:
awk 'NF>1 && $1 ~ /^-?[0-9.]*(e[0-9*])?$/' filename
but at this point I would tell you to use perl and stop trying to be more robust with awk.
Perhaps (this will print blank lines...not sure which behavior you want):
perl -lane 'use POSIX qw(strtod); my ($num, $end) = strtod($F[0]);
print unless $end;' filename
This uses strtod to parse the number and tells you the number of characters at the end of the string that are not part of it.
Drop the braces and the \d, like this:
awk ' $1 ~ /^[0-9]/ ' filename
Awk programs come in chunks. A chunk is a pattern block pair, where the block
defaults to { print }. (An empty pattern defaults to true.)
The /\d/ is a perl-ism and might work in some versions awk - not in those that I tried*. You need either the traditional /^[0-9]/ or the POSIX /^[[:digit:]]/ notation.
*
gnu and ast

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file