using sed to insert whitespaces between a number and word - regex

I have a series of files that uses fixed with delimiting, instead of comma separated delimiting. They all look like this:
2015/09/29 659027 RIH619 25 105.80IN921186
2015/09/29 659027 RIH619 25 105.80IN921186
2015/09/29 659027 RIH619 25 105.80IN921186
2015/09/29 659027 RIH619 25 105.80IN921186
I would like to replace all the spaces with commas. I have a piece of code that accomplish this:
sed -r 's/^\s+//;s/\s+/,/g'
After running the code I get this result:
2015/09/29,659027,RIH619,25,105.80IN921186
2015/09/29,659027,RIH619,25,105.80IN921186
2015/09/29,659027,RIH619,25,105.80IN921186
2015/09/29,659027,RIH619,25,105.80IN921186
My problem is the files I get doesn't have a space between the amount and the reference. My output needs to look like this:
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
What I tried is:
sed -r 's/^\s+//;s/\.\d\d\D+/\.\d\d,\D/;s/\s+/,/g'
But it didn't seem to do anything

with tr and sed
tr ' ' ',' <file | sed -r 's/(\.[0-9]{2})/\1,/'

You can use this single sed for both:
sed -r 's/[[:blank:]]+/,/g; s/([[:digit:]])([[:alpha:]])/\1,\2/g' file
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
([[:digit:]]) matches a digit and captures it in group#1
([[:alpha:]]) matches an alphabet and captures it in group#2
\1,\2 places a comma between 2 groups.

awk has fixed field width support that is good for this sort of thing:
$ echo "2015/09/29 659027 RIH619 25 105.80IN921186" |
awk 'BEGIN { FIELDWIDTHS="10 1 6 1 6 1 2 1 6 8"; OFS="," }{ print $1,$3,$5,$7,$9,$10 }'
2015/09/29,659027,RIH619,25,105.80,IN921186

Related

Inserting a "," in a particular position of a text

(I put a exact text and command I executed so would be looking a bit messy.)
I have a .TXT file looking like
11111111111111111111111111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111111111111111111111111
And outcome I am looking for would be like
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,111,111
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,111,111
Command I have tried is
sed -i 's/\(.\{14\}\)\(.\{7\}\)\(.\{2\}\)\(.\{1\}\)\(.\{3\}\)\(.\{13\}\)\(.\{1\}\)\(.\{8\}\)\(.\{16\}\)\(.\{3\}\)/\1,\2,\3,\4,\5,\6,\7,\8,\9,\10,/' SOME.TXT
And outcome I have got was
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,1111111111111110,111
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,1111111111111110,111
I have literally no idea why these 0s suddenly popped out and ' , ' doesn't appear in the position where I command even though it worked half way.
Is this a bug or something in sed command?
It is printing 0 in output because sed capture groups and their back-references can be up to 9 only and \10 is interpreted as \1 followed by literal 0.
You can solve it easily using FIELDWIDTHS feature of gnu-awk:
awk -v OFS=, 'BEGIN { FIELDWIDTHS = "14 7 2 1 3 13 1 8 16 3 *" } {$1 = $1} 1' file
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,111,111
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,111,111
Just for academic exercise, here is a working sed to solve this using 2 substitutions:
sed -E 's/(.{14})(.{7})(.{2})(.)(.{3})(.{13})(.)(.{8})(.+)/\1,\2,\3,\4,\5,\6,\7,\8,\9/; s/(.+,.{16})(.{3})(.*)/\1,\2,\3/' file
sed can't reference capture groups > 9, Perl can:
perl -i -pe 's/(.{14})(.{7})(.{2})(.)(.{3})(.{13})(.)(.{8})(.{16})(.{3})/$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,/' SOME.TXT
If you insist to use sed, you can do something like:
sed 's/./&,/68;s/./&,/65;s/./&,/49;s/./&,/41;s/./&,/40;s/./&,/27;s/./&,/24;s/./&,/23;s/./&,/21;s/./&,/14' test.txt
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,111,111
11111111111111,1111111,11,1,111,1111111111111,1,11111111,1111111111111111,111,111

How to match a regex 1 to 3 times in a sed command?

Problem
I want to get any text that consists of 1 to three digits followed by a % but without the % using sed.
What I tried
So i guess the following regex should match the right pattern : [0-9]{1,3}%.
Then i can use this sed command to catch the three digits and only print them :
sed -nE 's/.*([0-9]{1,3})%.*/\1/p'
Example
However when i run it, it shows :
$ echo "100%" | sed -nE 's/.*([0-9]{1,3})%.*/\1/p'
0
instead of
100
Obviously, there's something wrong with my sed command and i think the problem comes from here :
[0-9]{1,3}
which apparently doesn't do what i want it to do.
edit:
Solution
The .* at the start of sed -nE 's/.*([0-9]{1,3})%.*/\1/p' "ate" the two first digits.
The right way to write it, according to Wicktor's answer, is :
sed -nE 's/(.*[^0-9])?([0-9]{1,3})%.*/\2/p'
The .* grabs all digits leaving just the last of the three digits in 100%.
Use
sed -nE 's/(.*[^0-9])?([0-9]{1,3})%.*/\2/p'
Details
(.*[^0-9])? - (Group 1) an optional sequence of any 0 or more chars up to the non-digit char including it
([0-9]{1,3}) - (Group 2) one to three digits
% - a % char
.* - the rest of the string.
The match is replaced with Group 2 contents, and that is the only value printed since n suppresses the default line output.
It will be easier to use a cut + grep option:
echo "abc 100%" | cut -d% -f1 | grep -oE '[0-9]{1,3}'
100
echo "100%" | cut -d% -f1 | grep -oE '[0-9]{1,3}'
100
Or else you may use this awk:
echo "100%" | awk 'match($0, /[0-9]{1,3}%/){print substr($0, RSTART, RLENGTH-1)}'
100
Or else if you have gnu grep then use -P (PCRE) option:
echo "abc 100%" | ggrep -oP '[0-9]{1,3}(?=%)'
100
This might work for you (GNU sed):
sed -En 's/.*\<([0-9]{1,3})%.*/\1/p' file
This is a filtering exercise, so use the -n option.
Use a back reference to capture 1 to 3 digits, followed by % and print the result if successful.
N.B. The \< ensures the digits start on a word boundary, \b could also be used. The -E option is employed to reduce the number of back slashes which would normally be necessary to quote (,),{ and } metacharacters.

How to extract a number out of a string preceded by zeroes

I got a string that looks like this SOMETHING00000076XYZ
How can I extract the number 76 out of the string using a shell script? Note that 76 is preceded by zeroes and followed by letters.
1st solution: If you are ok with awk could you please try following.
echo "SOMETHING00000076XYZ" | awk 'match($0,/0+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/0+/,"",val);print val;val=""}'
In case you want to save this into a variable use following.
variable="$(echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1')"
2nd solution: Adding 1 more awk solution here(keeping your sample in mind).
echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1'
Here is a sed option:
echo "SOMETHING00000076XYZ" | sed -r 's/[^0-9]*0*([0-9]+).*/\1/g';
76
Here is an explanation of the regex pattern used:
[^0-9]* match zero or more non digits
0* match zero or more 0's
([0-9]+) match AND capture any quantity of non zero digits
.* match the remainder of the string
Then, we just replace with \1, which is the first (and only) capture group.
echo 'SOMETHING00000076XYZ' | grep -o '[1-9][0-9]*'
Using gnu grep:
grep -oP '0+\K\d+' <<< 'SOMETHING00000076XYZ'
76
\K resets any matched information.
Here is another variant of awk:
awk -F '0+' 'match($2, /^[0-9]+/){print substr($2, 1, RLENGTH)}' <<< 'SOMETHING00000076XYZ'
76
You can try Perl as well
$ echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 '
76
$ a=$(echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 ')
$ echo $a
76
$
$ echo 'SOMETHING00000076XYZ' | awk '{sub(/^[^0-9]+/,""); print $0+0}'
76
You can use sed as
echo "SOMETHING00000076XYZ" | sed "s/[a-zA-Z]//g" | sed "s/^0*//"
The first step is for removing all letters
The second step is for removing leading zeroes

How to delete lines before a match perserving it?

I have the following script to remove all lines before a line which matches with a word:
str='
1
2
3
banana
4
5
6
banana
8
9
10
'
echo "$str" | awk -v pattern=banana '
print_it {print}
$0 ~ pattern {print_it = 1}
'
It returns:
4
5
6
banana
8
9
10
But I want to include the first match too. This is the desired output:
banana
4
5
6
banana
8
9
10
How could I do this? Do you have any better idea with another command?
I've also tried sed '0,/^banana$/d', but seems it only works with files, and I want to use it with a variable.
And how could I get all lines before a match using awk?
I mean. With banana in the regex this would be the output:
1
2
3
This awk should do:
echo "$str" | awk '/banana/ {f=1} f'
banana
4
5
6
banana
8
9
10
sed -n '/^banana$/,$p'
Should do what you want. -n instructs sed to print nothing by default, and the p command specifies that all addressed lines should be printed. This will work on a stream, and is different than the awk solution since this requires the entire line to match 'banana' exactly whereas your awk solution merely requires 'banana' to be in the string, but I'm copying your sed example. Not sure what you mean by "use it with a variable". If you mean that you want the string 'banana' to be in a variable, you can easily do sed -n "/$variable/,\$p" (note the double quotes and the escaped $) or sed -n "/^$variable\$/,\$p" or sed -n "/^$variable"'$/,$p'. You can also echo "$str" | sed -n '/banana/,$p' just like you do with awk.
Just invert the commands in the awk:
echo "$str" | awk -v pattern=banana '
$0 ~ pattern {print_it = 1} <--- if line matches, activate the flag
print_it {print} <--- if the flag is active, print the line
'
The print_it flag is activated when pattern is found. From that moment on (inclusive that line), you print lines when the flag is ON. Previously the print was done before the checking.
cat in.txt | awk "/banana/,0"
In case you don't want to preserve the matched line then you can use
cat in.txt | sed "0,/banana/d"

if first space is 2 space, make it 1 in a file

i have a text file and in some lines the first space from left is 2 space long and i want it to be 1 space long. whats the script for this in bash?
123 2 5//problem
1 2 5
1 2 5
1 32 5//problem
what i want
123 2 5
1 2 5
1 2 5
1 32 5
tr way:
cat test.txt | tr -s ' '
Using sed:
sed 's/^\([^ ][^ ]*[ ]\)[ ]*/\1/' input
Starting from the left
^
match and capture non-space characters and a space
\([^ ][^ ]*[ ]\)
and any number of additional spaces:
[ ]* # remove the star if you only care about exactly 2 spaces
and replace these with the captured part:
\1
Edit: I realized that David's answer was almost right.
You can use sed.
cat x | sed -e 's/ \+/ /'
This replaces the first occurrence of one or more spaces with a single space.
But you can do it purely in bash as well:
cat x | while read a b ; do echo "$a" "$b" ; done
This splits each line at the first word, and echos back the first word and the rest of the line. The result is that there is only one space between the first word and the rest of the line.