sed replace character only between two known strings - regex

Is it possible to replace a character between two known strings only? I have a number of files in the format
title.header.index.subtitle.goes.here.footer
I can pick out the "subtitle.goes.here" with pattern matching between the index (which I need to backreference) and a footer (which is constant), but I then want to replace the period/dot character with an underscore, to give me
title.header.index.subtitle_goes_here.footer
So from input such as
title.header.01.the.first.subtitle.is.here.footer
I want to end up with
title.header.01.the_first_subtitle_is_here.footer
What I have so far is useless, but a start:
sed -r 's/([0-9][0-9]\.)([a-z]*\.*)*footer/\1footer/g'
But this is removing the entire subtitle and footer before manually adding it back in and has plenty of other flaws I'm sure. Any help would be much appreciated.

This might work for you:
echo "title.header.01.the.first.subtitle.is.here.footer" |
sed 's/\./_/4g;s/.\(footer\)/.\1/'
title.header.01.the_first_subtitle_is_here.footer
An ugly alternative:
sed 'h;s/\([0-9][0-9]\.\).*\(\.footer\)/\1\n\2/;x;s/.*[0-9][0-9]\.\(.*\).footer/\1/;s/\./_/g;x;G;s/\(\n\)\(.*\)\1\(.*\)/\3\2/' file

If you are open to awk solution then this might help -
awk '
{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"_":$i"."} print $NF}
' FS='.' OFS='.' file
Input File:
[jaypal:~/Temp] cat file
title.header.index.subtitle.goes.here.footer
title.header.01.the.first.subtitle.is.here.footer
Test:
[jaypal:~/Temp] awk '
{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"_":$i"."} print $NF}
' FS='.' OFS='.' file
title.header.index.subtitle_goes_here.footer
title.header.01.the_first_subtitle_is_here.footer

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

rename specific lines in a text file with sed

I have a file that looks like this:
>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj
I would like to edit just the lines starting with >, ideally in-place, to get a file:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj
I know that in principle this is achievable with various combinations of sed/awk/cut, but I haven't been able to figure out the right combination. Ideally it should be fast - the file has many millions of lines, and many of the lines are also very long.
Key things about the lines I want to edit:
Always start with >
The bit I want to keep is always between the first and second pipe symbol | (hence thinking cut is going to help
The bit I want to keep has alphanumeric symbols and sometimes underscores. The rest of the string on the same line can have any symbols
What I've tried that seems helpful
(Most of my sed attempts are pure garbage)
cut -d '|' -f 2 test.txt
Gets me the bit of the string that I want, and it keeps the other lines too. So it's close, but (of course) it doesn't preserve the initial > on the lines where cut applies, so it's missing a crucial part of the solution.
With sed:
sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
/^>/ to select lines starting with >, not strictly necessary for given sample but sometimes this provides faster result than using s alone
^[^|]+\| this will match non | characters from the start of line
([^|]+) capture the second field
.* rest of the line
>\1 replacement string where \1 will have the contents of ([^|]+)
If your input has only ASCII character, this would give you much faster results:
LC_ALL=C sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
Timing
Checking the timing results by creating a huge file from given input sample, awk is much faster and mawk is even faster
However, OP reports that the sed solution is faster for the actual data
With your shown samples, you could simply try following. In this code, we are setting field separator as | for all the lines of Input_file then in main program checking if line starts from > then print 2nd field else print the complete line.
awk -F'|' '/^>/{print ">"$2;next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk -F'|' ' ##Starting awk program from here and setting field separator as | here.
/^>/{ ##Checking condition if line starts from > then do following.
print ">"$2 ##Printing 2nd field of current line here.
next ##next will skip all further statements from here.
}
1 ##Will print current line.
' Input_file ##mentioning Input_file name here.
You can also use the following awk command:
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file
# Inplace replacement with gawk (GNU awk)
gawk -i inplace -F\| '/^>/{print ">"$2} !/^>/{print}' file
# "Inline-like" replacement with any awk
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file > tmp && mv tmp file
Here,
-F\| - sets the field separator to a | char
/^>/ is the condition: if line starts with < (and !/^>/ means the opposite)
{print ">"$2} prints the Field 2 value with a > char prepended to it
{print} simply prints the full line.
Note that since !/^>/{print} can be reduced to !/^>/ as print is the default action.
See an online demo:
s='>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj'
awk -F\| '/^>/{print ">"$2} !/^>/{print}' <<< "$s"
Output:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj

removing last character of every word in files

I have multiple files with just one line of simple text. I want to remove last character of every word in each file. Every file has different length of text.
The closest I got is to edit one file:
awk '{ print substr($1, 1, length($1)-1); print substr($2, 1, length($2)-1); }' file.txt
But I can not figure out, how to make this general, for files with different words count.
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file
this should do the removal.
If it was tested ok, and you want to overwrite your file, you can do:
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file > tmp && mv tmp file
Example:
kent$ awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' <<<"foo bar foobar"
fo ba fooba
Use awk to loop till max fields in each row upto NF, and apply the substr function.
awk '{for (i=1; i<=NF; i++) {printf "%s ", substr($i, 1, length($i)-1)}}END{printf "\n"}' file
For a sample input file
ABCD ABC BC
The awk logic produces an output
ABC AB B
Another way by changing the record-separator to NULL and just using print:-
awk 'BEGIN{ORS="";}{for (i=1; i<=NF; i++) {print substr($i, 1, length($i)-1); print " "}}END{print "\n"}' file
I would go for a Bash approach:
Since ${var%?} removes the last character of a variable:
$ var="hello"
$ echo "${var%?}"
hell
And you can use the same approach on arrays:
$ arr=("hello" "how" "are" "you")
$ printf "%s\n" "${arr[#]%?}"
hell
ho
ar
yo
What about going through the files, read their only line (you said the files just consist in one line) into an array and use the abovementioned tool to remove the last character of each word:
for file in dir/*; do
read -r -a myline < "$file"
printf "%s " "${myline[#]%?}"
done
Sed version, assuming word are only composed of letter (if not, just adapt the class [[:alpha:]] to reflect your need) and separate by space and puctuation
sed 's/$/ /;s/[[:alpha:]]\([[:blank:][:punct:]]\)/\1/g;s/ $//' YourFile
awk (gawk for regex boundaries in fact)
gawk '{gsub(/.\>/, "");print}' YourFile
#or optimized by #kent ;-) thks for the tips
gawk '4+gsub(/.\>/, "")' YourFile
$ cat foo
word1
word2 word3
$ sed 's/\([^ ]*\)[^ ]\( \|$\)/\1\2/g' foo
word
word word
A word is any string of characters excluding space (=[^ ]).
EDIT: If you want to enforce POSIX (--posix), you can use:
$ sed --posix 's/\([^ ]*\)[^ ]\([ ]\{,1\}\)/\1\2/g' foo
word
word word
This \( \|$\) changes to \([ ]\{,1\}\), ie there is an optional space in the end.

Match a string that could have a newline anywhere in it in - bash

I have a string containing a number that is represented as follows:
\S2=number_goes_here\
The number could be anything from 0.00000 and up. However, there could be a newline anywhere in that string, and I am not entirely sure how to go about matching that. Ultimately, I just want the number from this. Importantly, this string is amidst a large chunk of text that can be represented by this sample (S2 is found on the last line there):
1.454187\H,0,0.719618,3.525801,1.633708\H,0,-0.454651,2.80328,2.23844\
Ru,0,0.025774,1.557599,-0.253913\\Version=EM64L-G09RevD.01\State=6-A\H
F=-1238.5377983\S2=8.75446\S2-1=0.\S2A=8.750006\RMSD=2.314e-09\Dipole=
I'm open to bash, sed, awk, gawk; whatever thoughts you have to address this.
EDIT:
Here is example, the first answer below does not seem to have worked correctly for this example. It only prints "2."
.631441,-2.132979\H,0,0.20151,-1.464802,-2.95553\H,0,0.377883,-2.50668
5,-1.874761\\Version=EM64L-G09RevD.01\State=3-A\HF=-1265.9035096\S2=2.
053325\S2-1=0.\S2A=2.000966\RMSD=1.590e-04\Dipole=0.7197616,-2.1253769
grep -Po '(?<=S2=)[\d.]+' <(tr -d '\n' < file)
gives
8.75446
You can use perl, read the whole file in slurp mode, remove newline characters and search it using a regular expression:
perl -0777 -nE '
$_ = join q||, split /\n/;
printf qq|%s\n|, $1 if m/\\S2=([\d.]+)/
' infile
It yields:
8.75446
Also possible using just bash, though this won't work so well for very large files.
#!/bin/bash
IFS=$'\n'
string=$(<"test.txt")
var=$(echo $string) # word-splitting will replace each newline with a space here
while IFS= read -r word; do
[[ $word =~ '\S2='([0-9]*\.[0-9]*)'\' ]] && echo ${BASH_REMATCH[1]}
done <<< "$var"
e.g.
> ./abovescript
8.75446
Here is an gnu awk version (due to RS with multiple characters):
awk -F'\' 'NR==2 {print $1}' RS="S2=" file
8.75446
A version that works with most awk
awk -F\\ '{for (i=1;i<=NF;i++) if ($i~/S2=/) {split($i,a,"=");print a[2]}}' file
8.75446

Perl, sed, or awk one-liner to change the format of the file

I need advice on how to change the file formatted following way
file1:
A 504688
B jobnameA
A 504690
B jobnameB
A 504691
B jobnameC
...
into file2:
A B
504688 jobnameA
504690 jobnameB
504691 jobnameC
...
One solution I could think of is:
cat file1 | perl -0777 -p -e 's/\s+B/\t/' | awk '{print $2"\t"$3}'.
But I am wondering if there is more efficient way or already known practice that does this job.
perl -nawe 'print "#F[1 .. $#F]", $F[0] eq "A" ? "\t" : "\n"' < /tmp/ab
Look up the options in perlrun.
Another useful one to add is -l (append newline to print), but not in this case.
Assuming your input file is tab separated:
echo $'A\tB'
cut -f2 filename | paste - -
Should be pretty quick because this is exactly what cut and paste were written to do.
awk '/^A/{num=$2}/^B/{print num,$2}' file
Or, alternately,
awk '{num=$2;getline;print num,$2}' file
Here is an sed solution:
sed -e 'N' -e 's/A\s*\(.*\)\nB\s*\(.*\)/\1\t\2/' file
This version will also print the header at the top:
sed '1{h;s/.*/A\tB/p;g};N;s/A\s*\(.*\)\nB\s*\(.*\)/\1\t\2/' file
Or an alternative:
sed -n '/^A\s*/{s///;h};/^B\s*/{s///;H;g;s/\n/\t/p}' file
If your sed does not support semicolons as a command separator for the alternative:
sed -n '
/^A\s*/{ # if the line starts with "A"
s/// # remove the "A" and the whitespace
h # copy the remainder into the hold space
} # end if
/^B\s*/{ # if the line starts with "B"
s/// # remove the "B" and the whitespace
H # append pattern space to hold space
g # copy hold space to pattern space
s/\n/\t/p # replace newline with tab and print
}' file
This version will also print the header at the top:
sed -n '/^A\s*/{s///;h;1s/.*/A\tB/p};/^B\s*/{s///;H;g;s/\n/\t/p}' file
This will work with any header text, not just fixed A and B >>
awk '{a=$1;b=$2;getline;if(c!=1){print a,$1;c=1};print b,$2}' file1 >file2
...and it will print also header row
If you need \t separator, then use:
awk '{a=$1;b=$2;getline;if(c!=1){print a"\t"$1;c=1};print b"\t"$2}' file1 >file2
This might work for you:
sed -e '1i\A\tB' -e 'N;s/A\s*\(\S*\).*\nB\s*\(\S*\).*/\1\t\2/' file