Gawk - Regexp - unable to get results - regex

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2

As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2

Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

Related

Regex: find elements regardless of order

If I have the string:
geo:FR, host:www.example.com
(In reality the string is more complicated and has more fields.)
And I want to extract the "geo" value and the "host" value, I am facing a problem when the order of the keys change, as in the following:
host:www.example.com, geo:FR
I tried this line:
sed 's/.\*geo:\([^ ]*\).\*host:\([^ ]*\).*/\1,\2/'
But it only works on the first string.
Is there a way to do it in a single regex, and if not, what's the best approach?
I suggest extracting each text you need with a separate sed command:
s="geo:FR, host:www.example.com"
host="$(sed -n 's/.*host:\([^[:space:],]*\).*/\1/p' <<< "$s")"
geo="$(sed -n 's/.*geo:\([^[:space:],]*\).*/\1/p' <<< "$s")"
See the online demo, echo "$host and $geo" prints
www.example.com and FR
for both inputs.
Details
-n suppresses line output and p prints the matches
.* - matches any 0+ chars up the last...
host: - host: substring and then
\([^[:space:],]*\) - captures into Group 1 any 0 or more chars other than whitespace and a comma
.* - the rest of the line.
The result is just the contents of Group 1 (see \1 in the replacement pattern).
Whenever you have tag/name to value pairs in your input I find it best (clearest, simplest, most robust,, easiest to enhance, etc.) to first create an array that contains that mapping (f[] below) and then you can simply access the values by their tags:
$ cat file
geo:FR, host:www.example.com
host:www.example.com, geo:FR
foo:bar, host:www.example.com, stuff:nonsense, badgeo:uhoh, geo:FR, nastygeo:wahwahwah
$ cat tst.awk
BEGIN { FS=":|, *"; OFS="," }
{
for (i=1; i<=NF; i+=2) {
f[$i] = $(i+1)
}
print f["geo"], f["host"]
}
$ awk -f tst.awk file
FR,www.example.com
FR,www.example.com
FR,www.example.com
The above will work using any awk in any shell on every UNIX box.
Here I've used GNU Awk to convert your delimited key:value pairs to valid shell assignment. With Bash, you can load these assignments into your current shell using <(process substitution):
# source the file descriptor generated by proc sub
. < <(
# use comma-space as field separator, literal apostrophe as variable q
awk -F', ' -vq=\' '
# change every foo:bar in line to foo='bar' on its own line
{for(f=1;f<=NF;f++) print gensub(/:(.*)/, "=" q "\\1" q, 1, $f)}
# use here-string to load text; remove everything but first quote to use standard input
' <<< 'host:www.example.com, geo:FR'
)

Replace non-alphanumeric characters in substring

I am trying to replace any non-alphanumeric characters present in the first part (before the = sign) of a bunch of key value pairs, by a _:
Input
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
Desired Output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
I have tried patterns such as: s/\([^a-zA-Z]*\)=\(.*\)/\1=\2/g without much success. Any basic GNU/Linux tools can probably be used.
With awk
$ awk -F= -v OFS='=' '{gsub("[^a-zA-Z]", "_", $1)} 1' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Input and output field separators are set to = and then gsub("[^a-zA-Z]", "_", $1) will substitute all non-alphabet characters with _ only for first field
With perl
$ perl -pe 's/^[^=]+/$&=~s|[^a-z]|_|gir/e' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
^[^=]+ non = characters from start of line
$&=~s|[^a-z]|_|gir replace non-alphabet characters with _ only for the matched portion
Use perl -i -pe for inplace editing
Assuming your input is in a file called infile, you could do this:
while IFS== read key value; do
printf '%s=%s\n' "${key//[![:alnum:]]/_}" "${value}"
done < infile
with the output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
This sets the IFS variable to = and reads your key/value pairs line by line into a key and a value variable.
The printf command prints them and adds the = back in; "${key//[![:alnum:]]/_}" substitutes all non-alphanumeric characters in key by an underscore.
Any Posix compliant awk
$ cat f
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
$ awk 'BEGIN{FS=OFS="="}gsub(/[^[:alnum:]]/,"_",$1)+1' f
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Explanation
BEGIN{FS=OFS="="} Set input and Output field separator =
/[^[:alnum:]]/ Match a character not present in the list,
[:alnum:] matches a alphanumeric character [a-zA-Z0-9]
gsub(REGEXP, REPLACEMENT, TARGET)
This is similar to the sub function, except gsub replaces
all of the longest, leftmost, nonoverlapping matching
substrings it can find. The g in gsub stands for global, which means replace everywhere,The gsub function returns the number
of substitutions made
+1 It takes care of default operation {print $0} whenever gsub returns 0
Thought I would throw a little ruby in:
ruby -pe '$_.sub!(/[^=]+/){|m| m.gsub(/[^[:alnum:]]/,"_")}'

gawk regex to find any record having characters other then the specified by character class in regex pattern

I have list of email addresses in a text file. I have a pattern having character classes that specifies what characters are allowed in the email addresses.
Now from that input file, I want to only search the email addresses that has the characters other than the allowed ones.
I am trying to write a gawk for the same, but not able to get it to work properly.
Here is the gawk that I am trying:
gawk -F "," ' $2!~/[[:alnum:]#\.]]/ { print "has invalid chars" }' emails.csv
The problem I am facing is that the above gawk command only matches the records that has NONE of the alphanumeric, # and . (dot) in them. But what I am looking for is the records that are having the allowed characters but along with them the not-allowed ones as well.
For example, the above command would find
"_-()&(()%"
as the above only has the characters not in regex pattern, but will not find
"abc-123#xyz,com"
. as it also has the characters that are present in specified character classes in regex pattern.
How about several tests together: contains an alnum and an # and a dot and an invalid character
$2 ~ /[[:alnum:]]/ && $2 ~ /#/ && $2 ~ /\./ && $2 ~ /[^[:alnum:]#.]/
Your regex is wrong here:
/[[:alnum:]#\.]]/
It should be:
/[[:alnum:]#.]/
Not removal of an extra ] fron end.
Test Case:
# regex with extra ]
awk -F "," '{print ($2 !~ /[[:alnum:]#.]]/)}' <<< 'abc,ab#email.com'
1
# correct regex
awk -F "," '{print ($2 !~ /[[:alnum:]#.]/)}' <<< 'abc,ab#email.com'
0
Do you really care whether the string has a valid character? If not (and it seems like you don't), the simple solution is
$2 ~ /[^[:alnum:]#.]/{ print "has invalid chars" }
That won't trigger on an empty string, so you might want to add a test for that case.
Your question would REALLY benefit from some concise, testable sample input and expected output as right now we're all guessing at what you want but maybe this does it?
awk -F, '{r=$2} gsub(/[[:alnum:]#.]/,"",r) && (r!="") { print "has invalid chars" }' emails.csv
e.g. using the 2 input examples you provided:
$ cat file
_-()&(()%
abc-123#xyz,com
$ awk '{r=$0} gsub(/[[:alnum:]#.]/,"",r) && (r!="") { print $0, "has invalid chars" }' file
abc-123#xyz,com has invalid chars
There are more accurate email regexps btw, e.g.:
\<[[:alnum:]._%+-]+#[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>
which is a gawk-specific (for word delimiters \< and \>) modification of the one described at http://www.regular-expressions.info/email.html after updating to use POSIX character classes.
If you are trying to validate email addresses do not use the regexp you started with as it will declare # and 7 to each be valid email addresses.
See also How to validate an email address using a regular expression? for more email regexp details.

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

regex match square brackets once

There is a text file with the following info:
[[parent]]
[son]
[daughter]
How to get only [son] and [daughter]?
$0 ~ /\[([a-z])*\]/ ???
Your regex is almost right. Just put the * inside the round brackets (in order to have the whole text inside the only group) and remember to use the ^ and $ delimiters (to avoid matching [[parent]]):
^\[([a-z]*)\]$
Match any square bracket at beginning of line where the next character is an alphabetic.
awk '/^\[[a-z]/' file
You might want to add uppercase and/or numbers to the character class, depending on what your real requiements are. (Your examples show only lowercase, so I have assumed that's a valid generalization.)
You can use this awk command:
awk -F '[][]+' 'NF && !/\[\[/{print $2}' file
son
daughter
awk command breakup:
-F '[][]+' # set input field separator as 1 or more of [ or ]
NF # only if at least one field is found
!/\[\[/ # when input doesn't start with [[