removing last character of every word in files - regex

I have multiple files with just one line of simple text. I want to remove last character of every word in each file. Every file has different length of text.
The closest I got is to edit one file:
awk '{ print substr($1, 1, length($1)-1); print substr($2, 1, length($2)-1); }' file.txt
But I can not figure out, how to make this general, for files with different words count.

awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file
this should do the removal.
If it was tested ok, and you want to overwrite your file, you can do:
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file > tmp && mv tmp file
Example:
kent$ awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' <<<"foo bar foobar"
fo ba fooba

Use awk to loop till max fields in each row upto NF, and apply the substr function.
awk '{for (i=1; i<=NF; i++) {printf "%s ", substr($i, 1, length($i)-1)}}END{printf "\n"}' file
For a sample input file
ABCD ABC BC
The awk logic produces an output
ABC AB B
Another way by changing the record-separator to NULL and just using print:-
awk 'BEGIN{ORS="";}{for (i=1; i<=NF; i++) {print substr($i, 1, length($i)-1); print " "}}END{print "\n"}' file

I would go for a Bash approach:
Since ${var%?} removes the last character of a variable:
$ var="hello"
$ echo "${var%?}"
hell
And you can use the same approach on arrays:
$ arr=("hello" "how" "are" "you")
$ printf "%s\n" "${arr[#]%?}"
hell
ho
ar
yo
What about going through the files, read their only line (you said the files just consist in one line) into an array and use the abovementioned tool to remove the last character of each word:
for file in dir/*; do
read -r -a myline < "$file"
printf "%s " "${myline[#]%?}"
done

Sed version, assuming word are only composed of letter (if not, just adapt the class [[:alpha:]] to reflect your need) and separate by space and puctuation
sed 's/$/ /;s/[[:alpha:]]\([[:blank:][:punct:]]\)/\1/g;s/ $//' YourFile
awk (gawk for regex boundaries in fact)
gawk '{gsub(/.\>/, "");print}' YourFile
#or optimized by #kent ;-) thks for the tips
gawk '4+gsub(/.\>/, "")' YourFile

$ cat foo
word1
word2 word3
$ sed 's/\([^ ]*\)[^ ]\( \|$\)/\1\2/g' foo
word
word word
A word is any string of characters excluding space (=[^ ]).
EDIT: If you want to enforce POSIX (--posix), you can use:
$ sed --posix 's/\([^ ]*\)[^ ]\([ ]\{,1\}\)/\1\2/g' foo
word
word word
This \( \|$\) changes to \([ ]\{,1\}\), ie there is an optional space in the end.

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

awk - remove character in regex

I want to remove 1 with awk from this regex: ^1[0-9]{10}$ if said regex is found in any field. I've been trying to make it work with sub or substr for a few hours now, I am unable to find the correct logic for this. I already have the solution for sed: s/^1\([0-9]\{10\}\)$/\1/, I need to make this work with awk.
Edit for input and output example. Input:
10987654321
2310987654321
1098765432123
(awk twisted and overcomplicated syntax)
Output:
0987654321
2310987654321
1098765432123
Basically the leading 1 needs to be removed only when it's followed by ten digits. The 2nd and 3rd example lines are correct, 2nd has 23 in front of 1, 3rd has a leading 1 but it's followed by 12 digits instead of ten. That's what the regex specifies.
With sub(), you could try:
awk '/^1[0-9]{10}$/ { sub(/^1/, "") }1' file
Or with substr():
awk '/^1[0-9]{10}$/ { $0 = substr($0, 2) }1' file
If you need to test each field, try looping over them:
awk '{ for(i=1; i<=NF; i++) if ($i ~ /^1[0-9]{10}$/) sub(/^1/, "", $i) }1' file
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
if gnu awk is available for you, you could use gensub function:
echo '10987654321'|awk '{s=gensub(/^1([0-9]{10})$/,"\\1","g");print s}'
0987654321
edit:
do it for every field:
awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7 file
test:
kent$ echo '10987654321 10987654321'|awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7'
0987654321 0987654321

Perl, sed, or awk one-liner to change the format of the file

I need advice on how to change the file formatted following way
file1:
A 504688
B jobnameA
A 504690
B jobnameB
A 504691
B jobnameC
...
into file2:
A B
504688 jobnameA
504690 jobnameB
504691 jobnameC
...
One solution I could think of is:
cat file1 | perl -0777 -p -e 's/\s+B/\t/' | awk '{print $2"\t"$3}'.
But I am wondering if there is more efficient way or already known practice that does this job.
perl -nawe 'print "#F[1 .. $#F]", $F[0] eq "A" ? "\t" : "\n"' < /tmp/ab
Look up the options in perlrun.
Another useful one to add is -l (append newline to print), but not in this case.
Assuming your input file is tab separated:
echo $'A\tB'
cut -f2 filename | paste - -
Should be pretty quick because this is exactly what cut and paste were written to do.
awk '/^A/{num=$2}/^B/{print num,$2}' file
Or, alternately,
awk '{num=$2;getline;print num,$2}' file
Here is an sed solution:
sed -e 'N' -e 's/A\s*\(.*\)\nB\s*\(.*\)/\1\t\2/' file
This version will also print the header at the top:
sed '1{h;s/.*/A\tB/p;g};N;s/A\s*\(.*\)\nB\s*\(.*\)/\1\t\2/' file
Or an alternative:
sed -n '/^A\s*/{s///;h};/^B\s*/{s///;H;g;s/\n/\t/p}' file
If your sed does not support semicolons as a command separator for the alternative:
sed -n '
/^A\s*/{ # if the line starts with "A"
s/// # remove the "A" and the whitespace
h # copy the remainder into the hold space
} # end if
/^B\s*/{ # if the line starts with "B"
s/// # remove the "B" and the whitespace
H # append pattern space to hold space
g # copy hold space to pattern space
s/\n/\t/p # replace newline with tab and print
}' file
This version will also print the header at the top:
sed -n '/^A\s*/{s///;h;1s/.*/A\tB/p};/^B\s*/{s///;H;g;s/\n/\t/p}' file
This will work with any header text, not just fixed A and B >>
awk '{a=$1;b=$2;getline;if(c!=1){print a,$1;c=1};print b,$2}' file1 >file2
...and it will print also header row
If you need \t separator, then use:
awk '{a=$1;b=$2;getline;if(c!=1){print a"\t"$1;c=1};print b"\t"$2}' file1 >file2
This might work for you:
sed -e '1i\A\tB' -e 'N;s/A\s*\(\S*\).*\nB\s*\(\S*\).*/\1\t\2/' file

sed replace character only between two known strings

Is it possible to replace a character between two known strings only? I have a number of files in the format
title.header.index.subtitle.goes.here.footer
I can pick out the "subtitle.goes.here" with pattern matching between the index (which I need to backreference) and a footer (which is constant), but I then want to replace the period/dot character with an underscore, to give me
title.header.index.subtitle_goes_here.footer
So from input such as
title.header.01.the.first.subtitle.is.here.footer
I want to end up with
title.header.01.the_first_subtitle_is_here.footer
What I have so far is useless, but a start:
sed -r 's/([0-9][0-9]\.)([a-z]*\.*)*footer/\1footer/g'
But this is removing the entire subtitle and footer before manually adding it back in and has plenty of other flaws I'm sure. Any help would be much appreciated.
This might work for you:
echo "title.header.01.the.first.subtitle.is.here.footer" |
sed 's/\./_/4g;s/.\(footer\)/.\1/'
title.header.01.the_first_subtitle_is_here.footer
An ugly alternative:
sed 'h;s/\([0-9][0-9]\.\).*\(\.footer\)/\1\n\2/;x;s/.*[0-9][0-9]\.\(.*\).footer/\1/;s/\./_/g;x;G;s/\(\n\)\(.*\)\1\(.*\)/\3\2/' file
If you are open to awk solution then this might help -
awk '
{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"_":$i"."} print $NF}
' FS='.' OFS='.' file
Input File:
[jaypal:~/Temp] cat file
title.header.index.subtitle.goes.here.footer
title.header.01.the.first.subtitle.is.here.footer
Test:
[jaypal:~/Temp] awk '
{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"_":$i"."} print $NF}
' FS='.' OFS='.' file
title.header.index.subtitle_goes_here.footer
title.header.01.the_first_subtitle_is_here.footer

How to print matched regex pattern using awk?

Using awk, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
xxx yyy zzz
And pattern:
/yyy/
I want to only get:
yyy
EDIT:
thanks to kurumi i managed to write something like this:
awk '{
for(i=1; i<=NF; i++) {
tmp=match($i, /[0-9]..?.?[^A-Za-z0-9]/)
if(tmp) {
print $i
}
}
}' $1
and this is what i needed :) thanks a lot!
This is the very basic
awk '/pattern/{ print $0 }' file
ask awk to search for pattern using //, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
awk '{for(i=1;i<=NF;i++){ if($i=="yyy"){print $i} } }' file
It sounds like you are trying to emulate GNU's grep -o behaviour. This will do that providing you only want the first match on each line:
awk 'match($0, /regex/) {
print substr($0, RSTART, RLENGTH)
}
' file
Here's an example, using GNU's awk implementation (gawk):
awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match, substr, RSTART and RLENGTH in the awk manual.
After that you may wish to extend this to deal with multiple matches on the same line.
gawk can get the matching part of every line using this as action:
{ if (match($0,/your regexp/,m)) print m[0] }
match(string, regexp [, array])
If array is present, it is cleared,
and then the zeroth element of array is set to the entire portion of
string matched by regexp. If regexp contains parentheses, the
integer-indexed elements of array are set to contain the portion of
string matching the corresponding parenthesized subexpression.
http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
If Perl is an option, you can try this:
perl -lne 'print $1 if /(regex)/' file
To implement case-insensitive matching, add the i modifier
perl -lne 'print $1 if /(regex)/i' file
To print everything AFTER the match:
perl -lne 'if ($found){print} else{if (/regex(.*)/){print $1; $found++}}' textfile
To print the match and everything after the match:
perl -lne 'if ($found){print} else{if (/(regex.*)/){print $1; $found++}}' textfile
If you are only interested in the last line of input and you expect to find only one match (for example a part of the summary line of a shell command), you can also try this very compact code, adopted from How to print regexp matches using `awk`?:
$ echo "xxx yyy zzz" | awk '{match($0,"yyy",a)}END{print a[0]}'
yyy
Or the more complex version with a partial result:
$ echo "xxx=a yyy=b zzz=c" | awk '{match($0,"yyy=([^ ]+)",a)}END{print a[1]}'
b
Warning: the awk match() function with three arguments only exists in gawk, not in mawk
Here is another nice solution using a lookbehind regex in grep instead of awk. This solution has lower requirements to your installation:
$ echo "xxx=a yyy=b zzz=c" | grep -Po '(?<=yyy=)[^ ]+'
b
Off topic, this can be done using the grep also, just posting it here in case if anyone is looking for grep solution
echo 'xxx yyy zzze ' | grep -oE 'yyy'
Using sed can also be elegant in this situation. Example (replace line with matched group "yyy" from line):
$ cat testfile
xxx yyy zzz
yyy xxx zzz
$ cat testfile | sed -r 's#^.*(yyy).*$#\1#g'
yyy
yyy
Relevant manual page: https://www.gnu.org/software/sed/manual/sed.html#Back_002dreferences-and-Subexpressions
If you know what column the text/pattern you're looking for (e.g. "yyy") is in, you can just check that specific column to see if it matches, and print it.
For example, given a file with the following contents, (called asdf.txt)
xxx yyy zzz
to only print the second column if it matches the pattern "yyy", you could do something like this:
awk '$2 ~ /yyy/ {print $2}' asdf.txt
Note that this will also match basically any line where the second column has a "yyy" in it, like these:
xxx yyyz zzz
xxx zyyyz
echo "abc123def" | awk '
function MATCH(haystack, needle, ltrim, rtrim)
{
if(ltrim == 0 && !length(ltrim))
ltrim = 0;
if(rtrim == 0 && !length(rtrim))
rtrim = 0;
return substr(haystack, match(haystack, needle) + ltrim, RLENGTH - ltrim - rtrim);
}
{
print $0 " - " MATCH($0, "123"); # 123
print $0 " - " MATCH($0, "[0-9]*d", 0, 1); # 123
print $0 " - " MATCH($0, "1234"); # Nothing printed
}'