bash scripting - using sed or awk to split and extract data

bash scripting - using sed or awk to split and extract data - regex

I'm having trouble with a specific situation. If I have a file filled with entries like:
my.site.example.com
somelinewithnodot
some.line .with.a.weird.space..this.is
this.one.has , and.stuff*.all.I
&&&83%23^&4,I;dont,even.need.2see
Using bash, how can I use like awk or sed or something to split the data on each line by "." and then only print the entries directly before and directly after the last ".", ignoring lines with no "."?
Desired output:
example.com
somelinewithnodot
this.is
all.I
need.2see
I've been trying to use sed but I'm having trouble setting up the regex. I've done stuff like this before but it's been a minute and I'm having trouble remembering how to properly set it up...

Could you please try following.
awk -F'.' 'NF>1{print $(NF-1) FS $NF;next} 1' Input_file
OR
awk 'BEGIN{FS=OFS="."}NF>1{print $(NF-1) FS $NF;next} 1' Input_file
OR
awk -F'.' 'NF>1{$0=$(NF-1) FS $NF} 1' Input_file
OR
awk 'BEGIN{FS=OFS="."}NF>1{print $(NF-1) FS $NF;next} 1' Input_file

You can use substitution with sed:
sed 's/^\([^.]*\.\)*\([^.]\+\.[^.]\+\)$/\2/'

This might work for you (GNU sed):
sed -E 's/.*[.](.*[.].*)$/\1/' file
Match the last two .'s and replace them by the last . and words either side.
Alternative:
sed 's/.*\.\(.*\..*\)$/\1/' file

You can try Perl also
perl -ne ' /(^[^\.]+$)|(?<=\.)([^\.]+\.[^\.]+$)/g and print "$1$2" '
with Inputs
$ cat johnred.txt
my.site.example.com
somelinewithnodot
some.line .with.a.weird.space..this.is
this.one.has , and.stuff*.all.I
&&&83%23^&4,I;dont,even.need.2see
$ perl -ne ' /(^[^\.]+$)|(?<=\.)([^\.]+\.[^\.]+$)/g and print "$1$2" ' johnred.txt
example.com
somelinewithnodot
this.is
all.I
need.2see
$
. loses its special meaning when used in [ ], so you can use
perl -ne ' /(^[^.]+$)|(?<=\.)([^.]+\.[^.]+$)/g and print "$1$2" ' johnred.txt
Another solution using array operation
perl -lne ' #b=$_=~/([^.]+)/g ; print $b[-2]? "$b[-2].":"", $b[-1] ' johnred.txt

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns

If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS

Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS

1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file

Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS

using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...

I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)

Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS

An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

bash regex multiple match in one line

I'm trying to process my text.
For example i got:
asdf asdf get.this random random get.that
get.it this.no also.this.no
My desired output is:
get.this get.that
get.it
So regexp should catch only this pattern (get.\w), but it has to do it recursively because of multiple occurences in one line, so easiest way with sed
sed 's/.*(REGEX).*/\1/'
does not work (it shows only first occurence).
Probably the good way is to use grep -o, but i have old version of grep and -o flag is not available.

This grep may give what you need:
grep -o "get[^ ]*" file

Try awk:
awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
You might need to tweak the regex between the slashes for your specific issue. Sample output:
$ awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
get.this
get.that
get.it

With awk:
awk -v patt="^get" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
printf "%s%s", $i, OFS;
print ""
}' <<< "$text"
bash
while read -a words; do
for word in "${words[#]}"; do
if [[ $word == get* ]]; then
echo -n "$word "
fi
done
echo
done <<< "$text"
perl
perl -lane 'print join " ", grep {$_ =~ /^get/} #F' <<< "$text"

This might work for you (GNU sed):
sed -r '/\bget\.\S+/{s//\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1 /g;s/ $//}' file
or if you want one per line:
sed -r '/\n/!s/\bget\.\S+/\n&\n/g;/^get/P;D' file

Match a string that could have a newline anywhere in it in - bash

I have a string containing a number that is represented as follows:
\S2=number_goes_here\
The number could be anything from 0.00000 and up. However, there could be a newline anywhere in that string, and I am not entirely sure how to go about matching that. Ultimately, I just want the number from this. Importantly, this string is amidst a large chunk of text that can be represented by this sample (S2 is found on the last line there):
1.454187\H,0,0.719618,3.525801,1.633708\H,0,-0.454651,2.80328,2.23844\
Ru,0,0.025774,1.557599,-0.253913\\Version=EM64L-G09RevD.01\State=6-A\H
F=-1238.5377983\S2=8.75446\S2-1=0.\S2A=8.750006\RMSD=2.314e-09\Dipole=
I'm open to bash, sed, awk, gawk; whatever thoughts you have to address this.
EDIT:
Here is example, the first answer below does not seem to have worked correctly for this example. It only prints "2."
.631441,-2.132979\H,0,0.20151,-1.464802,-2.95553\H,0,0.377883,-2.50668
5,-1.874761\\Version=EM64L-G09RevD.01\State=3-A\HF=-1265.9035096\S2=2.
053325\S2-1=0.\S2A=2.000966\RMSD=1.590e-04\Dipole=0.7197616,-2.1253769

grep -Po '(?<=S2=)[\d.]+' <(tr -d '\n' < file)
gives
8.75446

You can use perl, read the whole file in slurp mode, remove newline characters and search it using a regular expression:
perl -0777 -nE '
$_ = join q||, split /\n/;
printf qq|%s\n|, $1 if m/\\S2=([\d.]+)/
' infile
It yields:
8.75446

Also possible using just bash, though this won't work so well for very large files.
#!/bin/bash
IFS=$'\n'
string=$(<"test.txt")
var=$(echo $string) # word-splitting will replace each newline with a space here
while IFS= read -r word; do
[[ $word =~ '\S2='([0-9]*\.[0-9]*)'\' ]] && echo ${BASH_REMATCH[1]}
done <<< "$var"
e.g.
> ./abovescript
8.75446

Here is an gnu awk version (due to RS with multiple characters):
awk -F'\' 'NR==2 {print $1}' RS="S2=" file
8.75446
A version that works with most awk
awk -F\\ '{for (i=1;i<=NF;i++) if ($i~/S2=/) {split($i,a,"=");print a[2]}}' file
8.75446

i have a file and i need to extract a particular string followed after the regex 'LN:' from the second line

please refer the file contents below.
#HD VN:1.0 SO:unsorted
#SQ SN:Chr1 LN:30427680
#PG ID:bowtie2 PN:bowtie2 VN:2.1.0
how can i extract just the number 30427680 using awk or any other unix command.

Using sed
sed -n 's/.*LN://p' < input.txt
This will erase everything up until LN:, and print what's left, and only if a substitution did take place.
Using awk
awk -v FS=: '/LN:/ { print $3; }' < input.txt
This will match lines that contain LN:, use : as field separator, and print the 3rd column.
Using grep
grep -o '[0-9]\{3,\}' < input.txt
This will match sequences of 3 or more digits, and print only the matched pattern thanks to the -o.
Depending on other cases not included in your question, you might have to make the patterns more strict.

Using grep:
grep -oP 'LN:\K.*' filename

Just use grep:
grep -o 30427680 file
-o, --only-matching
Prints only the matching part of the lines.

Using perl :
perl -ne 'print $& if /LN:\K.*/' filename
or
perl -ne 'print $1 if /LN:(.*)/' filename

Another awk
awk -F"LN:" 'NF>1 {print $2}' file

Substitute a regex pattern using awk

I am trying to write a regex expression to replace one or more '+' symbols present in a file with a space. I tried the following:
echo This++++this+++is+not++done | awk '{ sub(/\++/, " "); print }'
This this+++is+not++done
Expected:
This this is not done
Any ideas why this did not work?

Use gsub which does global substitution:
echo This++++this+++is+not++done | awk '{gsub(/\++/," ");}1'
sub function replaces only 1st match, to replace all matches use gsub.

Or the tr command:
echo This++++this+++is+not++done | tr -s '+' ' '

The idiomatic awk solution would be just to translate the input field separator to the output separator:
$ echo This++++this+++is+not++done | awk -F'++' '{$1=$1}1'
This this is not done

Try this
echo "This++++this+++is+not++done" | sed -re 's/(\+)+/ /g'

You could use sed too.
echo This++++this+++is+not++done | sed -e 's/+\{1,\}/ /g'
This matches one or more + and replaces it with a space.

For this case I recommend sed, this is powerful for substitution and has a short syntax.
Solution sed:
echo This++++this+++is+not++done | sed -En 's/\\++/ /gp'
Result:
This this is not done
For awk:
You must use the gsub function for global line substitution (more than one substitution).
The syntax:
gsub(regexp, replacement [, target]).
If the third parameter is ommited then $0 is the target.
Target must a variable or array element. gsub works in target, overwritten target with the replacement.
Solution awk:
echo This++++this+++is+not++done | awk 'gsub(/\\++/," ")
Result:
This this is not done

echo "This++++this+++is+not++done" | sed 's/++*/ /g'

If you have access to node on your computer you can do it by installing rexreplace
npm install -g regreplace
and then run
rexreplace '\++' ' ' myfile.txt
Of if you have more files in a dir data you can do
rexreplace '\++' ' ' data/*.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

bash scripting - using sed or awk to split and extract data - regex

Could you please try following. awk -F'.' 'NF>1{print $(NF-1) FS $NF;next} 1' Input_file OR awk 'BEGIN{FS=OFS="."}NF>1{print $(NF-1) FS $NF;next} 1' Input_file OR awk -F'.' 'NF>1{$0=$(NF-1) FS $NF} 1' Input_file OR awk 'BEGIN{FS=OFS="."}NF>1{print $(NF-1) FS $NF;next} 1' Input_file

You can use substitution with sed: sed 's/^\([^.]\.\)\([^.]\+\.[^.]\+\)$/\2/'

This might work for you (GNU sed): sed -E 's/.[.](.[.].)$/\1/' file Match the last two .'s and replace them by the last . and words either side. Alternative: sed 's/.\.\(.\..\)$/\1/' file

Related

print the last letter of each word to make a string using `awk` command

bash regex multiple match in one line

Match a string that could have a newline anywhere in it in - bash

i have a file and i need to extract a particular string followed after the regex 'LN:' from the second line

Substitute a regex pattern using awk

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

bash scripting - using sed or awk to split and extract data - regex

Could you please try following. awk -F'.' 'NF>1{print $(NF-1) FS $NF;next} 1' Input_file OR awk 'BEGIN{FS=OFS="."}NF>1{print $(NF-1) FS $NF;next} 1' Input_file OR awk -F'.' 'NF>1{$0=$(NF-1) FS $NF} 1' Input_file OR awk 'BEGIN{FS=OFS="."}NF>1{print $(NF-1) FS $NF;next} 1' Input_file

You can use substitution with sed: sed 's/^\([^.]*\.\)*\([^.]\+\.[^.]\+\)$/\2/'

This might work for you (GNU sed): sed -E 's/.*[.](.*[.].*)$/\1/' file Match the last two .'s and replace them by the last . and words either side. Alternative: sed 's/.*\.\(.*\..*\)$/\1/' file

Related

print the last letter of each word to make a string using `awk` command

bash regex multiple match in one line

Match a string that could have a newline anywhere in it in - bash

i have a file and i need to extract a particular string followed after the regex 'LN:' from the second line

Substitute a regex pattern using awk

Categories

Resources

You can use substitution with sed: sed 's/^\([^.]\.\)\([^.]\+\.[^.]\+\)$/\2/'

This might work for you (GNU sed): sed -E 's/.[.](.[.].)$/\1/' file Match the last two .'s and replace them by the last . and words either side. Alternative: sed 's/.\.\(.\..\)$/\1/' file