How to check last 3 chars of a string are alphabets or not using awk? - regex

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287

$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.

Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.

sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file

{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

Related

grep regex how to get only results with one preceeding word?

My string is :
www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com
I an trying to get results only with "one" word before texas.com. Expectation when I do a regex grep :
mail.texas.com
www2.texas.com
So mail & www2 are the "one" word that I'm talking about. I tried :
grep "*.texas.com", but I get all of them in results. Can someone please help ?
You can use
grep '^[^.]*\.texas\.com'
Details:
^ - start of string
[^.]* - zero or more chars other than a . char
\.texas\.com - .texas.com string (literal . char must be escaped in the regex pattern).
See the online demo:
#!/bin/bash
s='www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com'
grep '^[^.]*\.texas\.com' <<< "$s"
Output:
mail.texas.com
www2.texas.com
With awk:
awk 'BEGIN{FS=OFS="."} /texas.com$/ && NF==3' file
Output:
mail.texas.com
www2.texas.com
Set one dot as input and output field separator, check for texas.com at the end ($) of your line and check for three fields.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
With your shown samples, please try following awk code.
awk -F'.' 'NF==3 && $2=="texas" && $3=="com"' Input_file
Explanation: Simple making field separator as . for all the lines in awk program. Then in main program checking condition if NF==3(means number of fields in current line)are 3 AND 2nd field is texas and 3rd field is com if all 3 conditions are MET then print the line.

Regex, select the line that starts with my condition, but take only the characters after space

I have a file that has content similiar below:
ptrn: 435324kjlkj34523453
Note1: rtewqtiojdfgkasdktewitogaidfks
Note2: t4rwe3tewrkterqwotkjrekqtrtlltre
I am trying to get characters after space at the line starts with "ptrn:" . I am trying the command below ;
>>> cat daily.txt | grep '^p.*$' > dailynew.txt
and I am getting the result in the new file:
ptrn: 435324kjlkj34523453
But I want only the characters after space, which are " 435324kjlkj34523453" to be written in the new file without "ptrn:" at the beginning.
The result should be like:
435324kjlkj34523453
How can establish this goal with an efficient regex?
You can use
grep -oP '^ptrn:\s*\K.*' daily.txt > dailynew.txt
awk '/^ptrn:/{print $2}' daily.txt > dailynew.txt
sed -n 's/^ptrn:[[:space:]]*\(.*\)/\1/p' daily.txt > dailynew.txt
See the online demo. All output 435324kjlkj34523453.
In the grep PCRE regex (enabled with -P option) the patterns match
^ - the startof string
ptrn: - a ptrn: substring
\s* - zero or more whitespaces
\K - match reset operator that clears the current match value
.* - the rest of the line.
In the awk command, ^ptrn: regex is used to find the line starting with ptrn: and then {print $2} prints the value after the first whitespace, from the second "column" (since the default field separator in awk is whitespace).
In sed, the command means
-n - suppresses the default line output
s - substitution command is used
^ptrn:[[:space:]]*\(.*\) - start of string, ptrn:, zero or more whitespace, and the rest of the line captured into Group 1
\1 - replaces the match with group 1 value
p - prints the result of the substitution.
You can use this sed:
sed -nE 's/^ptrn: (.*)/\1/p' file > output_file.txt

How to match and cut the string with different conditions using sed?

I want to grep the string which comes after WORK= and ignore if there comes paranthesis after that string .
The text looks like this :
//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU
So, desirable output should print only :
TEXT.L01.L02
TEST1.TEST2
OP.TEE.GHU
So far , I could just match and cut before WORK= but could not remove WORK= itself:
sed -E 's/(.*)(WORK=.*)/\2/'
I am not sure how to continue . Can anyone help please ?
You can use
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' file > newfile
Details:
-n - suppresses the default line output
/WORK=.*([^()]*)/! - if a line contains a WORK= followed with any text and then a (...) substring skips it
s/.*WORK=\([^,]*\).*/\1/p - else, takes the line and removes all up to and including WORK=, and then captures into Group 1 any zero or more chars other than a comma, and then remove the rest of the line; p prints the result.
See the sed demo:
s='//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU'
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' <<< "$s"
Output:
TEXT.LO1.LO2
TEST1.TEST2
OP.TEE.GHU
Could you please try following awk, written and tested with shown samples in GNU awk.
awk '
match($0,/WORK=[^,]*/){
val=substr($0,RSTART+5,RLENGTH-5)
if(val!~/\([a-zA-Z]+\)/){ print val }
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/WORK=[^,]*/){ ##Using match function to match WORK= till comma comes.
val=substr($0,RSTART+5,RLENGTH-5) ##Creating val with sub string of match regex here.
if(val!~/\([a-zA-Z]+\)/){ print val } ##checking if val does not has ( alphabets ) then print val here.
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -n '/.*WORK=\([^,]\+\).*/{s//\1/;/(.*)/!p}' file
Extract the string following WORK= and if that string does not contain (...) print it.
This will work if there is only zero or one occurrence of WORK= and that the exclusion depends only on the (...) occurring within that string and not other following fields.
For a global solution with the same stipulations for parens:
sed -n '/WORK=\([^,]\+\)/{s//\n\1\n/;s/[^\n]*\n//;/(.*).*\n/!P;D}' file
N.B. This prints each such string on a separate line an excludes empty strings.

Remove a specific part of string split by multiple-charater delimiter in bash

I'm trying to process my string and steps as below:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
And I want to remove a part of string (and delimiter) that match myvar after split by delimiter \n, the result should be
\\[[123 one (/)\n\\[[789 three (/)
But I still not find out the solution.
If the delimiter is a single character like :, I can be done with sed command:
mytext2='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar2=456
echo $mytext2 | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
Result as expected: \\[[123 one (/):\\[[789 three (/)
How can be done if delimiter is a multiple-character in this case?
Thanks.
Even if you use sed -E, you still lack some support (?<=, ?!, ?=, ? etc). I suggest you use perl (Perl Compatible Regular Expressions).
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)';
myvar=456;
echo $mytext | perl -pe "s/(?<=\\\n).*${myvar}.*?(\\\n|$)//g";
Details:
(?<=\\\n): string starts after \n. \\ escape character \
.*${myvar}.*?(\\\n|$): get string which contains value of variable myvar and ends with \n or end of line.
Result.
\\[[123 one (/)\n\\[[789 three (/)
If you can find another delimiter to be used for example, :, you can first replace it with :,
echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
The full script looks like this
mytext='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar=456
v=$(echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g")
echo $v
If the multiple character delimiter is something like abcd, you can try to use sed to replace it first instead of using tr.
I sugges awk in this case since you may specify the literal multichar delimiter pattern as the input/output field separator, iterate over the fields and discard all those not matching your value:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
awk -v myvar=$myvar 'BEGIN{FS=OFS="\\\\n"} {s="";
for (i=1; i<=NF; i++) {
if ($i !~ myvar) {s = s (i==1 ? "" : OFS) $i;}
}
} END{print s}' <<< "$mytext"
# => \\[[123 one (/)\\n\\[[789 three (/)
See the online awk demo.
NOTES:
BEGIN{FS=OFS="\\\\n"} - sets the input/output field separator to \n
-v myvar=$myvar passes the myvar to awk
s="" - assigns s to an empty string
for (i=1; i<=NF; i++) {...} - iterates over all fields
if ($i !~ myvar) {...} - if the current field value matches myvar...
s = s (i==1 ? "" : OFS) $i;} - append either the current field value to s (if is the first field) or output separator and the current field value (if it is not the first)
END{print s} - prints s after the field checks.

Regular expression for not more than one occurance of consecutive characters

I'm looking for regular expression that will match only if 2 consecutive characters occur in string once.
for example:
1123456 - match
1122345 - not match
1121125 - not match
1234567 - not match
1112345 - not match
currently have this regex: ([0-9])\1{1,} but it matches 1122345 as well which is not what i need
This awk does it, if you have minimal awk (mawk) or GNU awk (gawk):
awk -F "" '
{
d=0
for(i=1;i<NF;i++){
if ($i==$(i+1)) d++
}
if (d==1) print
}' file
Setting the field to empty string ("") you can read each line character-wise! If character i equals character i+1, then increment d. If d==1, the string is printed.
From your sample:
$ cat file
1123456
1122345
1121125
1234567
1112345
It outputs:
1123456
Important remark:
GNU awk manual says the use of empty string as field separator is a "dark corner", meaning that it is not standard and some implementations may handle it differently. If you want to be sure that it will work with any awk, go for
awk '
{
d=0
n=split($0,ch,"")
for(i=1;i<n;i++){
if (ch[i]==ch[i+1]) d++
}
if (d==1) print
}' file
It passed the gawk --posix test and yields the same result.