AWK regex convert 3 letter word beginning with 'a' to uppercase - regex

I have my regex expression to find 3 letter words beginning with "a"...
\b[aA][a-z]{2}\b
(seems to work, according to this! check it out: http://rubular.com/r/Jil0E4WZnW)
Now I need to know how to take that result and replace the lowercase word with the three letter word in uppercase.
Thanks!

call toupper function in awk:
echo "Abc" | awk '{print toupper($0)}'
gets you:
ABC

You can make use of the uc($string); command of PERL.

You can do it with Sed like this:
echo 'Ass ass ant Ant' | sed -re 's/\ba[a-z]{2}\b/\U&/gI'
(with your example string)

Another way is to use tr:
echo "Abc" | tr 'a-z' 'A-Z'

This solution "cheats" because it uses a loop and sub instead of gsub, but it is in awk and it works.
echo "abc Ape baaa ab abcd ant" | awk '{for (i=1;i<=NF;i++) if (length($i)==3){sub(/[aA][a-z]{2}/,toupper($i),$i)};print}'

perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;' your_file
tested:
> echo "Abc ab Ab" | perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;'
ABC ab Ab
>
Taken from here
Here is the awk version:
awk '{for(i=1;i<=NF;i++)
if((length($i)==3) && $i~/[aA][a-zA-Z][a-zA-Z]/)
$i=toupper($i)
}1' your_file

Related

How to extract a number out of a string preceded by zeroes

I got a string that looks like this SOMETHING00000076XYZ
How can I extract the number 76 out of the string using a shell script? Note that 76 is preceded by zeroes and followed by letters.
1st solution: If you are ok with awk could you please try following.
echo "SOMETHING00000076XYZ" | awk 'match($0,/0+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/0+/,"",val);print val;val=""}'
In case you want to save this into a variable use following.
variable="$(echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1')"
2nd solution: Adding 1 more awk solution here(keeping your sample in mind).
echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1'
Here is a sed option:
echo "SOMETHING00000076XYZ" | sed -r 's/[^0-9]*0*([0-9]+).*/\1/g';
76
Here is an explanation of the regex pattern used:
[^0-9]* match zero or more non digits
0* match zero or more 0's
([0-9]+) match AND capture any quantity of non zero digits
.* match the remainder of the string
Then, we just replace with \1, which is the first (and only) capture group.
echo 'SOMETHING00000076XYZ' | grep -o '[1-9][0-9]*'
Using gnu grep:
grep -oP '0+\K\d+' <<< 'SOMETHING00000076XYZ'
76
\K resets any matched information.
Here is another variant of awk:
awk -F '0+' 'match($2, /^[0-9]+/){print substr($2, 1, RLENGTH)}' <<< 'SOMETHING00000076XYZ'
76
You can try Perl as well
$ echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 '
76
$ a=$(echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 ')
$ echo $a
76
$
$ echo 'SOMETHING00000076XYZ' | awk '{sub(/^[^0-9]+/,""); print $0+0}'
76
You can use sed as
echo "SOMETHING00000076XYZ" | sed "s/[a-zA-Z]//g" | sed "s/^0*//"
The first step is for removing all letters
The second step is for removing leading zeroes

Regex to get number after last underscore

I am having trouble coming up with the regex command that will get me Y in the following string X_X_X_Y . BTW: Y is an interger, but can validate that after.
You could use shell parameter expansion:
$ s="X_X_X_Y"
$ echo "${s##*_}"
Y
Using sed:
$ sed 's/.*_//' <<< "$s"
Y
Using grep:
$ grep -oP '.*_\K.*' <<< "$s"
Y
This regex will work as long at the stuff you're matching for is an integer
[^_]+_[^_]+_[^_]+_(\d+)
as an alternative, if you are always tokenizing on the _ char you can skip regex and use awk
echo 'X_X_X_Y' | awk -F_ '{print $NF}'
Using BASH regex:
s='s="X_X_X_10'
[[ "$s" =~ [^_]+$ ]] && echo "${BASH_REMATCH[0]}"
10
This will print an integer at the end of the string after an underscore.
perl -e '"0_0_0_1" =~ /_([0-9]+)$/; print $1,"\n" if defined $1'
1
This might work for you:
sed 's/.*_\([0-9][0-9]*\)/\1/' file

awk - How to get only the matching portion of a regex

I have code like this
echo abc | awk '$0 ~ "a\(b\)c" {print $0}'
What if I only wanted what's in the parentheses instead of the whole line? This is obviously very simplified, and there is really a lot of awk code so I don't want to switch to sed or grep or something. Thanks
As far as I know you cannot do it in the pattern part, you must do it inside the action part with the match() function:
echo abc | awk '{ if ( match($0, /a(b)c/, a) > 0 ) { print a[1] } }'
It yields:
b
With GNU awk:
$ echo abc | awk '{print gensub(/a(b)c/,"\\1","")}'
b

Substitute a regex pattern using awk

I am trying to write a regex expression to replace one or more '+' symbols present in a file with a space. I tried the following:
echo This++++this+++is+not++done | awk '{ sub(/\++/, " "); print }'
This this+++is+not++done
Expected:
This this is not done
Any ideas why this did not work?
Use gsub which does global substitution:
echo This++++this+++is+not++done | awk '{gsub(/\++/," ");}1'
sub function replaces only 1st match, to replace all matches use gsub.
Or the tr command:
echo This++++this+++is+not++done | tr -s '+' ' '
The idiomatic awk solution would be just to translate the input field separator to the output separator:
$ echo This++++this+++is+not++done | awk -F'++' '{$1=$1}1'
This this is not done
Try this
echo "This++++this+++is+not++done" | sed -re 's/(\+)+/ /g'
You could use sed too.
echo This++++this+++is+not++done | sed -e 's/+\{1,\}/ /g'
This matches one or more + and replaces it with a space.
For this case I recommend sed, this is powerful for substitution and has a short syntax.
Solution sed:
echo This++++this+++is+not++done | sed -En 's/\\++/ /gp'
Result:
This this is not done
For awk:
You must use the gsub function for global line substitution (more than one substitution).
The syntax:
gsub(regexp, replacement [, target]).
If the third parameter is ommited then $0 is the target.
Target must a variable or array element. gsub works in target, overwritten target with the replacement.
Solution awk:
echo This++++this+++is+not++done | awk 'gsub(/\\++/," ")
Result:
This this is not done
echo "This++++this+++is+not++done" | sed 's/++*/ /g'
If you have access to node on your computer you can do it by installing rexreplace
npm install -g regreplace
and then run
rexreplace '\++' ' ' myfile.txt
Of if you have more files in a dir data you can do
rexreplace '\++' ' ' data/*.txt

How to print matched regex pattern using awk?

Using awk, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
xxx yyy zzz
And pattern:
/yyy/
I want to only get:
yyy
EDIT:
thanks to kurumi i managed to write something like this:
awk '{
for(i=1; i<=NF; i++) {
tmp=match($i, /[0-9]..?.?[^A-Za-z0-9]/)
if(tmp) {
print $i
}
}
}' $1
and this is what i needed :) thanks a lot!
This is the very basic
awk '/pattern/{ print $0 }' file
ask awk to search for pattern using //, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
awk '{for(i=1;i<=NF;i++){ if($i=="yyy"){print $i} } }' file
It sounds like you are trying to emulate GNU's grep -o behaviour. This will do that providing you only want the first match on each line:
awk 'match($0, /regex/) {
print substr($0, RSTART, RLENGTH)
}
' file
Here's an example, using GNU's awk implementation (gawk):
awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match, substr, RSTART and RLENGTH in the awk manual.
After that you may wish to extend this to deal with multiple matches on the same line.
gawk can get the matching part of every line using this as action:
{ if (match($0,/your regexp/,m)) print m[0] }
match(string, regexp [, array])
If array is present, it is cleared,
and then the zeroth element of array is set to the entire portion of
string matched by regexp. If regexp contains parentheses, the
integer-indexed elements of array are set to contain the portion of
string matching the corresponding parenthesized subexpression.
http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
If Perl is an option, you can try this:
perl -lne 'print $1 if /(regex)/' file
To implement case-insensitive matching, add the i modifier
perl -lne 'print $1 if /(regex)/i' file
To print everything AFTER the match:
perl -lne 'if ($found){print} else{if (/regex(.*)/){print $1; $found++}}' textfile
To print the match and everything after the match:
perl -lne 'if ($found){print} else{if (/(regex.*)/){print $1; $found++}}' textfile
If you are only interested in the last line of input and you expect to find only one match (for example a part of the summary line of a shell command), you can also try this very compact code, adopted from How to print regexp matches using `awk`?:
$ echo "xxx yyy zzz" | awk '{match($0,"yyy",a)}END{print a[0]}'
yyy
Or the more complex version with a partial result:
$ echo "xxx=a yyy=b zzz=c" | awk '{match($0,"yyy=([^ ]+)",a)}END{print a[1]}'
b
Warning: the awk match() function with three arguments only exists in gawk, not in mawk
Here is another nice solution using a lookbehind regex in grep instead of awk. This solution has lower requirements to your installation:
$ echo "xxx=a yyy=b zzz=c" | grep -Po '(?<=yyy=)[^ ]+'
b
Off topic, this can be done using the grep also, just posting it here in case if anyone is looking for grep solution
echo 'xxx yyy zzze ' | grep -oE 'yyy'
Using sed can also be elegant in this situation. Example (replace line with matched group "yyy" from line):
$ cat testfile
xxx yyy zzz
yyy xxx zzz
$ cat testfile | sed -r 's#^.*(yyy).*$#\1#g'
yyy
yyy
Relevant manual page: https://www.gnu.org/software/sed/manual/sed.html#Back_002dreferences-and-Subexpressions
If you know what column the text/pattern you're looking for (e.g. "yyy") is in, you can just check that specific column to see if it matches, and print it.
For example, given a file with the following contents, (called asdf.txt)
xxx yyy zzz
to only print the second column if it matches the pattern "yyy", you could do something like this:
awk '$2 ~ /yyy/ {print $2}' asdf.txt
Note that this will also match basically any line where the second column has a "yyy" in it, like these:
xxx yyyz zzz
xxx zyyyz
echo "abc123def" | awk '
function MATCH(haystack, needle, ltrim, rtrim)
{
if(ltrim == 0 && !length(ltrim))
ltrim = 0;
if(rtrim == 0 && !length(rtrim))
rtrim = 0;
return substr(haystack, match(haystack, needle) + ltrim, RLENGTH - ltrim - rtrim);
}
{
print $0 " - " MATCH($0, "123"); # 123
print $0 " - " MATCH($0, "[0-9]*d", 0, 1); # 123
print $0 " - " MATCH($0, "1234"); # Nothing printed
}'