sed regex match non-whitespace or tab - regex

I am trying to parse input that looks like this:
i171_chr1_C_MSTA_K0.184_full i266_chr1_+_MSTA_K0.195_full 92.06 2255 125 21 1 2221 2235 1 0.0 3123
i172_chr1_+_MLT1D_K0.575_full i172_chr1_+_MLT1D_K0.575_full 100.00 2290 0 0 1 2290 1 2290 0.0 4229
i172_chr1_+_MLT1D_K0.575_full i172_chr1_+_MLT1D_K0.575_full 100.00 2290 0 0 1 2290 1 2290 0.0 4229
Desired output is:
i171 1 i266 1 92
i172 1 i172 1 100
i172 1 i172 1 100
In another words, I am extracting name before first "_" to the first column and part after chr into second column (similiarly for third and fourth column).
I wrote command that works properly for first four columns:
grep -v "#" blastGE90_lengthGE1000 | cut -f 1,2 | sed -r 's/(.+)_chr([0-9XY]+)_.+\t(.+)_chr([0-9XY]+).+/\1 \2 \3 \4/'
However, when I try to match third column in input, I am not successful. I always match the last match instead of one I want:
grep -v "#" blastGE90_lengthGE1000 | cut -f 1,2 | sed -r 's/(.+)_chr([0-9XY]+)_.+\t(.+)_chr([0-9XY]+).+([0-9]+\.).+/\1 \2 \3 \4 \5/'
Therefore, I would like to use regexp to match non-whitespace or tabulator, but I can't figure it out.

I have fixed your command:
grep -v "#" blastGE90_lengthGE1000 | cut -f 1-3 | sed -r 's/(.+)_chr([0-9XY]+)_.+\t(.+)_chr([0-9XY]+)_.+\t([0-9]+).+/\1 \2 \3 \4 \5/'
You need to use cut -f 1-3 not cut -f 1,2 because you need the first three columns.
I also fixed the last capture group in the sed expression.

I would use awk here:
$ awk -F'_| +' '{gsub(/chr/,"");print $1,$2,$7,$8,int($13)}' file
i171 1 i266 1 92
i172 1 i172 1 100
i172 1 i172 1 100

Related

Bash: How do I extract the count of all the "n" digit numbers in a string? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am trying to extract the total number of n digited numbers from a string using bash.
E.g. For a 3 digit number,
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1
Unfortunately, I will not be able to use sed or grep with perl-regexp.
Appreciate any suggestions!
You can use regular expressions in bash.
#! /bin/bash
cat <<EOF |
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3, but should ignore 12345
I have 243 pens for sale #should return 1
123 should work at text boundaries, too 123
EOF
while read line ; do
c=0
while [[ $line =~ ([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$) ]] ; do
m=${BASH_REMATCH[0]}
line=${line#*$m}
((++c))
done
echo $c
done
The regex explained:
([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$)
~~~~~~~~~~~~~ non-digit
~~ or the very beginning
~~~~~~~~~~~~~~~ three digits
~~~~~~~~~~~~ non-digit
~~ or the very end
As bash can't match the same string several times, we need to remove the already processed part from the string before trying another match.
echo "$str" | grep -o '\b[0-9]\{3\}\b' | wc -l
This way we match 3-digit numbers inside word bountaries, which are allowed to be re-used (e.g. if two numbers are separated by one char that is a bountary, like comma or space).
Or like this:
echo "$str" | grep -o '\<[[:digit:]]\{3\}\>' | wc -l
Using POSIX shell grammar only:
#!/usr/bin/env sh
# Should return 3
str1='I am trying to extract 3 digited numbers 333, 334, 335 from this string'
# Should return 1
str2='I have 243 pens for sale'
# should return 2
str3='This is 123 456'
_OIFS=$IFS
IFS=$IFS' ,.:;!?-_+=*#$ยง^&{}[]|`#"()\\/'\'
for str in "$str1" "$str2" "$str3"
do
count=0
for word in $str
do
case $word in
[[:digit:]][[:digit:]][[:digit:]])
count=$((count + 1 ))
;;
esac
done
printf 'String:\n%s\n-> Count: %d\n\n' "$str" "$count"
done
IFS=$_OIFS
Output:
String:
I am trying to extract 3 digited numbers 333, 334, 335 from this string
-> Count: 3
String:
I have 243 pens for sale
-> Count: 1
String:
This is 123 456
-> Count: 2
Could you please try following, written and tested with following link
https://ideone.com/bh6zjR#stdin
in shown samples. Since OP said in comments digits can't have anything else before/in between/after (apart from , I believe as per samples) so going with traversing all fields of current line and using regex to find matches for them.
awk '
{
for(i=1;i<=NF;i++){
if(match($i,/^[0-9]{3}[,]?$/)){
count++
}
}
print "Line " FNR " has " count " number of 3 digits."
count=""
}
' Input_file
Output will be as follows.
Line 1 has 3 number of 3 digits.
Line 2 has 1 number of 3 digits.
Assuming the OP only wants exactly 3-digit numbers and is not interested in breaking longer numbers down into 3-digit segments, eg, the string 12345 will return a zero count as opposed to a 3 count ( 123 / 234 / 345 ).
Some sample data:
$ cat numbers.dat
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1
123 xyz
def 456
def 789-345 abc # should match 7-8-9 and 3-4-5
tester876tester # should match 8-7-6
testing9999testing # should not match 9-9-9-9
$ str=$(cat numbers.dat) # load data into a variable
A 2-pass grep solution:
NOTE: borrowed thanasisp's word boundary flag (\b)
Find patterns of 3-digits with non-digit book ends (including front/end of line)
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}"
333,
334,
335
243
123
456
789-
345
r876t
Now strip off the non-digits:
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}'
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}'
333
334
335
243
123
456
789
345
876
Pipe to wc -l for a count:
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l
9
Storing count in a variable:
$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l)
# or
$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l)
$ echo "${counter}"
9

grep match distinguish between 1 and -1

Let's say I have a following:
>>tmp='1 1 1 1 1 -1 -1 -1 -1 -1'
>>echo $tmp
1 1 1 1 1 -1 -1 -1 -1 -1
And I use the commands:
>>echo $tmp | grep -ow 1 | wc -l
10
>>echo $tmp | grep -ow "\-1" | wc -l
5
How am I able to get just the counts of 1 (which the answer should be 5 given the example above) without including the negative 1's?
You may use
echo "$temp" | grep -oE '(^|[^-0-9])1\b' | wc -l
Or, if the numbers are separated with whitespace, use whitespace boundaries you may use a PCRE regex with a GNU grep, or a Perl equivalent:
echo "$temp" | grep -oP '(?<!\S)1(?!\S)' | wc -l
perl -lne 'END {print $c} map ++$c, /(?<!\S)1(?!\S)/g'
See the online demo #1 and online demo #2.
Details
-o - output matches only
-E - enable POSIX ERE syntax
-P - enables PCRE syntax
(^|[^-0-9]) - matches start of string (^) or (|) a char other than - and a digit
(?<!\S) - left-hand whitespace boundary
1 - a 1 digit
\b - a word boundary
(?!\S) - right-hand whitespace boundary

RegEx: How to get the file which name has contain hour 13 to 20

I want to get the file name which name contain the hour between 13 to 20.
Ex. I have below file in folder.
$ ls
A_13_a.txt A_14_a.txt A_17_a.txt A_20_a.txt A_21_a.txt
where number represent the hour.
I want to execute the command which will return below name.
A_13_a.txt A_14_a.txt A_17_a.txt A_20_a.txt
I have tried below command but not giving right output.
ls | egrep 'A_[1][3-9]_a.txt | A_[2][0-0]_a.txt'
ls | grep 'A_[1][3-9]_a.txt'
Another option is:
ls | awk -F_ '{ if ( $2 > 12 && $2 < 21 ) print $0 }'
You need to escape the dot to be parsed as a literal dot and use an alternation group (1[3-9]|20) with egrep like this:
ls | egrep 'A_(1[3-9]|20)_a\.txt'
^^^^^^^^^^ ^
The (1[3-9]|20) matches either of the 2 alternatives:
1[3-9] - 1 followed with a digit from 3 to 9
| - or
20 - a literal char sequence 20.

using sed to insert whitespaces between a number and word

I have a series of files that uses fixed with delimiting, instead of comma separated delimiting. They all look like this:
2015/09/29 659027 RIH619 25 105.80IN921186
2015/09/29 659027 RIH619 25 105.80IN921186
2015/09/29 659027 RIH619 25 105.80IN921186
2015/09/29 659027 RIH619 25 105.80IN921186
I would like to replace all the spaces with commas. I have a piece of code that accomplish this:
sed -r 's/^\s+//;s/\s+/,/g'
After running the code I get this result:
2015/09/29,659027,RIH619,25,105.80IN921186
2015/09/29,659027,RIH619,25,105.80IN921186
2015/09/29,659027,RIH619,25,105.80IN921186
2015/09/29,659027,RIH619,25,105.80IN921186
My problem is the files I get doesn't have a space between the amount and the reference. My output needs to look like this:
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
What I tried is:
sed -r 's/^\s+//;s/\.\d\d\D+/\.\d\d,\D/;s/\s+/,/g'
But it didn't seem to do anything
with tr and sed
tr ' ' ',' <file | sed -r 's/(\.[0-9]{2})/\1,/'
You can use this single sed for both:
sed -r 's/[[:blank:]]+/,/g; s/([[:digit:]])([[:alpha:]])/\1,\2/g' file
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
2015/09/29,659027,RIH619,25,105.80,IN921186
([[:digit:]]) matches a digit and captures it in group#1
([[:alpha:]]) matches an alphabet and captures it in group#2
\1,\2 places a comma between 2 groups.
awk has fixed field width support that is good for this sort of thing:
$ echo "2015/09/29 659027 RIH619 25 105.80IN921186" |
awk 'BEGIN { FIELDWIDTHS="10 1 6 1 6 1 2 1 6 8"; OFS="," }{ print $1,$3,$5,$7,$9,$10 }'
2015/09/29,659027,RIH619,25,105.80,IN921186

if first space is 2 space, make it 1 in a file

i have a text file and in some lines the first space from left is 2 space long and i want it to be 1 space long. whats the script for this in bash?
123 2 5//problem
1 2 5
1 2 5
1 32 5//problem
what i want
123 2 5
1 2 5
1 2 5
1 32 5
tr way:
cat test.txt | tr -s ' '
Using sed:
sed 's/^\([^ ][^ ]*[ ]\)[ ]*/\1/' input
Starting from the left
^
match and capture non-space characters and a space
\([^ ][^ ]*[ ]\)
and any number of additional spaces:
[ ]* # remove the star if you only care about exactly 2 spaces
and replace these with the captured part:
\1
Edit: I realized that David's answer was almost right.
You can use sed.
cat x | sed -e 's/ \+/ /'
This replaces the first occurrence of one or more spaces with a single space.
But you can do it purely in bash as well:
cat x | while read a b ; do echo "$a" "$b" ; done
This splits each line at the first word, and echos back the first word and the rest of the line. The result is that there is only one space between the first word and the rest of the line.