awk - remove character in regex - regex

I want to remove 1 with awk from this regex: ^1[0-9]{10}$ if said regex is found in any field. I've been trying to make it work with sub or substr for a few hours now, I am unable to find the correct logic for this. I already have the solution for sed: s/^1\([0-9]\{10\}\)$/\1/, I need to make this work with awk.
Edit for input and output example. Input:
10987654321
2310987654321
1098765432123
(awk twisted and overcomplicated syntax)
Output:
0987654321
2310987654321
1098765432123
Basically the leading 1 needs to be removed only when it's followed by ten digits. The 2nd and 3rd example lines are correct, 2nd has 23 in front of 1, 3rd has a leading 1 but it's followed by 12 digits instead of ten. That's what the regex specifies.

With sub(), you could try:
awk '/^1[0-9]{10}$/ { sub(/^1/, "") }1' file
Or with substr():
awk '/^1[0-9]{10}$/ { $0 = substr($0, 2) }1' file
If you need to test each field, try looping over them:
awk '{ for(i=1; i<=NF; i++) if ($i ~ /^1[0-9]{10}$/) sub(/^1/, "", $i) }1' file
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

if gnu awk is available for you, you could use gensub function:
echo '10987654321'|awk '{s=gensub(/^1([0-9]{10})$/,"\\1","g");print s}'
0987654321
edit:
do it for every field:
awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7 file
test:
kent$ echo '10987654321 10987654321'|awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7'
0987654321 0987654321

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

Print everything before relevant symbol and keep 1 character after relevant symbol

I'm trying to find a one-liner to print every before relevant symbol and keep just 1 character after relevant symbol:
Input:
thisis#atest
thisisjust#anothertest
just#testing
Desired output:
thisis#a
thisjust#a
just#t
awk -F"#" '{print $1 "#" }' will almost give me what I want but I need to find a way to print the second character as well. Any ideas?
You can substitute what's after the first character after # with nothing with sed:
sed 's/\(#.\).*/\1/'
You could use grep:
$ grep -o '[^#]*#.' infile
thisis#a
thisisjust#a
just#t
This matches a sequence of characters other than #, followed by # and any character. The -o option retains only the match itself.
With the special RT variable in GNU's awk, you can do:
awk 'BEGIN{RS="#.|\n"}RT!="\n"{print $0 RT}'
Get the index of the '#', then pull out the substring.
$ awk '{print substr($0,1,index($0,"#")+1);}' in.txt
thisis#a
thisisjust#a
just#t
1st Solution: Could you please try following.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH)}' Input_file
Above will print lines as per your ask which have # in them and leave lines which does not have it, in case you want to completely print those lines use following then.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH);next} 1' Input_file
2nd solution:
awk 'BEGIN{FS=OFS="#"} {print $1,substr($2,1,1)}' Input_file
Some small variation of Ravindes 2nd example
awk -F# '{print $1"#"substr($2,1,1)}' file
awk -F# '{print $1FS substr($2,1,1)}' file
Another grep variation (shortest posted so far):
grep -oP '.+?#.' file
o print only matching
P Perl regex (due to +?)
. any character
+ and more
? but stop with:
#
. pluss one more character
If we do not add ?. This line test#one#two becomes test#one#t instead of test#o do to the greedy +
If you want to use awk, the cleanest way to do this with is using index which finds the position of a character:
awk 'n=index($0,'#') { print substr($0,1,n+1) }' file
There are, however, shorter and more dedicated tools for this. See the other answers.

Bash Regex Capture Groups

I have a single string that is this kind of format:
"Mike H<michael.haken#email1.com>" michael.haken#email2.com "Mike H<hakenmt#email1.com>"
If I was writing a normal regex in JS, C#, etc, I'd do this
(?:"(.+?)"|'(.+?)'|(\S+))
And iterate the match groups to grab each string, ideally without the quotes. I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows:
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
I can't figure out how to replicate this functionality with grep or sed or bash regex's. I've tried some things like
echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"
The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like
"Mike
H<michael.haken#email1.com>"
michael.haken#email2.com
If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. In that approach, I pipe the output to read so I can individually add each string to the array, but I'm open to other options.
EDIT:
I think my input example may have been confusing, it's just a possible input. The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. The Javascript/C# regex I provided is the real behavior I'm trying to achieve.
You can use Perl:
$ email='"Mike H<michael.haken#email1.com>" michael.haken#email2.com "Mike H<hakenmt#email1.com>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print $1 ? $1 : $2}'
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Or in pure Bash, it gets kinda wordy:
re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
i=${#BASH_REMATCH}
email=${email:i}
done
# same output
You may use sed to achieve that,
$ sed -r 's/"(.*)" (.*)"(.*)"/\1\n\2\n\3/g' <<< "$EMAIL"
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
gawk + bash solution (adding each item to array):
email_str='"Mike H<michael.haken#email1.com>" michael.haken#email2.com "Mike H<hakenmt#email1.com>"'
readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
'{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)
Now, all items are in email_arr
Accessing the 2nd item:
echo "${email_arr[1]}"
michael.haken#email2.com
Accessing the 3rd item:
echo "${email_arr[3]}"
Mike H<hakenmt#email1.com>
Your first expression is fine; just be careful with the quotes (use single quotes when \ are present). In the end trim the " with sed.
$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Using gawk where you can set multi-line RS.
awk -v RS='"|" ' 'NF' inputfile
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Modify your regex like this :
grep -oP '("?\s*)\K.*?(?=")' file
Output:
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Using GNU awk and FPAT to define fields by content:
$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" } # define a field to be space-separated or in quotes
{
for(i=1;i<=NF;i++) { # iterate every field
gsub(/^\"|\"$/,"",$i) # remove leading and trailing quotes
print $i # output
}
}' file
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
What I was able to do that worked, but wasn't as concise as I wanted the code to be:
arr=()
while read line; do
line="${line//\"/}"
arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")
This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. It also provided the elements in the array without the wrapping quotes. Appreciate all of the suggestions.

removing last character of every word in files

I have multiple files with just one line of simple text. I want to remove last character of every word in each file. Every file has different length of text.
The closest I got is to edit one file:
awk '{ print substr($1, 1, length($1)-1); print substr($2, 1, length($2)-1); }' file.txt
But I can not figure out, how to make this general, for files with different words count.
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file
this should do the removal.
If it was tested ok, and you want to overwrite your file, you can do:
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file > tmp && mv tmp file
Example:
kent$ awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' <<<"foo bar foobar"
fo ba fooba
Use awk to loop till max fields in each row upto NF, and apply the substr function.
awk '{for (i=1; i<=NF; i++) {printf "%s ", substr($i, 1, length($i)-1)}}END{printf "\n"}' file
For a sample input file
ABCD ABC BC
The awk logic produces an output
ABC AB B
Another way by changing the record-separator to NULL and just using print:-
awk 'BEGIN{ORS="";}{for (i=1; i<=NF; i++) {print substr($i, 1, length($i)-1); print " "}}END{print "\n"}' file
I would go for a Bash approach:
Since ${var%?} removes the last character of a variable:
$ var="hello"
$ echo "${var%?}"
hell
And you can use the same approach on arrays:
$ arr=("hello" "how" "are" "you")
$ printf "%s\n" "${arr[#]%?}"
hell
ho
ar
yo
What about going through the files, read their only line (you said the files just consist in one line) into an array and use the abovementioned tool to remove the last character of each word:
for file in dir/*; do
read -r -a myline < "$file"
printf "%s " "${myline[#]%?}"
done
Sed version, assuming word are only composed of letter (if not, just adapt the class [[:alpha:]] to reflect your need) and separate by space and puctuation
sed 's/$/ /;s/[[:alpha:]]\([[:blank:][:punct:]]\)/\1/g;s/ $//' YourFile
awk (gawk for regex boundaries in fact)
gawk '{gsub(/.\>/, "");print}' YourFile
#or optimized by #kent ;-) thks for the tips
gawk '4+gsub(/.\>/, "")' YourFile
$ cat foo
word1
word2 word3
$ sed 's/\([^ ]*\)[^ ]\( \|$\)/\1\2/g' foo
word
word word
A word is any string of characters excluding space (=[^ ]).
EDIT: If you want to enforce POSIX (--posix), you can use:
$ sed --posix 's/\([^ ]*\)[^ ]\([ ]\{,1\}\)/\1\2/g' foo
word
word word
This \( \|$\) changes to \([ ]\{,1\}\), ie there is an optional space in the end.

How to print matched regex pattern using awk?

Using awk, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
xxx yyy zzz
And pattern:
/yyy/
I want to only get:
yyy
EDIT:
thanks to kurumi i managed to write something like this:
awk '{
for(i=1; i<=NF; i++) {
tmp=match($i, /[0-9]..?.?[^A-Za-z0-9]/)
if(tmp) {
print $i
}
}
}' $1
and this is what i needed :) thanks a lot!
This is the very basic
awk '/pattern/{ print $0 }' file
ask awk to search for pattern using //, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
awk '{for(i=1;i<=NF;i++){ if($i=="yyy"){print $i} } }' file
It sounds like you are trying to emulate GNU's grep -o behaviour. This will do that providing you only want the first match on each line:
awk 'match($0, /regex/) {
print substr($0, RSTART, RLENGTH)
}
' file
Here's an example, using GNU's awk implementation (gawk):
awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match, substr, RSTART and RLENGTH in the awk manual.
After that you may wish to extend this to deal with multiple matches on the same line.
gawk can get the matching part of every line using this as action:
{ if (match($0,/your regexp/,m)) print m[0] }
match(string, regexp [, array])
If array is present, it is cleared,
and then the zeroth element of array is set to the entire portion of
string matched by regexp. If regexp contains parentheses, the
integer-indexed elements of array are set to contain the portion of
string matching the corresponding parenthesized subexpression.
http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
If Perl is an option, you can try this:
perl -lne 'print $1 if /(regex)/' file
To implement case-insensitive matching, add the i modifier
perl -lne 'print $1 if /(regex)/i' file
To print everything AFTER the match:
perl -lne 'if ($found){print} else{if (/regex(.*)/){print $1; $found++}}' textfile
To print the match and everything after the match:
perl -lne 'if ($found){print} else{if (/(regex.*)/){print $1; $found++}}' textfile
If you are only interested in the last line of input and you expect to find only one match (for example a part of the summary line of a shell command), you can also try this very compact code, adopted from How to print regexp matches using `awk`?:
$ echo "xxx yyy zzz" | awk '{match($0,"yyy",a)}END{print a[0]}'
yyy
Or the more complex version with a partial result:
$ echo "xxx=a yyy=b zzz=c" | awk '{match($0,"yyy=([^ ]+)",a)}END{print a[1]}'
b
Warning: the awk match() function with three arguments only exists in gawk, not in mawk
Here is another nice solution using a lookbehind regex in grep instead of awk. This solution has lower requirements to your installation:
$ echo "xxx=a yyy=b zzz=c" | grep -Po '(?<=yyy=)[^ ]+'
b
Off topic, this can be done using the grep also, just posting it here in case if anyone is looking for grep solution
echo 'xxx yyy zzze ' | grep -oE 'yyy'
Using sed can also be elegant in this situation. Example (replace line with matched group "yyy" from line):
$ cat testfile
xxx yyy zzz
yyy xxx zzz
$ cat testfile | sed -r 's#^.*(yyy).*$#\1#g'
yyy
yyy
Relevant manual page: https://www.gnu.org/software/sed/manual/sed.html#Back_002dreferences-and-Subexpressions
If you know what column the text/pattern you're looking for (e.g. "yyy") is in, you can just check that specific column to see if it matches, and print it.
For example, given a file with the following contents, (called asdf.txt)
xxx yyy zzz
to only print the second column if it matches the pattern "yyy", you could do something like this:
awk '$2 ~ /yyy/ {print $2}' asdf.txt
Note that this will also match basically any line where the second column has a "yyy" in it, like these:
xxx yyyz zzz
xxx zyyyz
echo "abc123def" | awk '
function MATCH(haystack, needle, ltrim, rtrim)
{
if(ltrim == 0 && !length(ltrim))
ltrim = 0;
if(rtrim == 0 && !length(rtrim))
rtrim = 0;
return substr(haystack, match(haystack, needle) + ltrim, RLENGTH - ltrim - rtrim);
}
{
print $0 " - " MATCH($0, "123"); # 123
print $0 " - " MATCH($0, "[0-9]*d", 0, 1); # 123
print $0 " - " MATCH($0, "1234"); # Nothing printed
}'