Match single digits only sed [duplicate] - regex

This question already has answers here:
regex: find one-digit number
(5 answers)
Closed 6 years ago.
I have to make a regex to match one digit only.
it should match 7 and a7b but not 77.
I made this but it doesn`t seem to work in sed.
(?<![\d])(?<![\S])[1](?![^\s.,?!])(?!^[\d])
(?<![\d])(?<!^[\a-z])\d(?![^a-z])(?!^[\d])
What am I doing wrong?
Edit:
I need to replace only 1-digit numbers with something like
sed 's/regex/#/g' file //regex to match "1"
file content
1 2 3 4 5 11 1
agdse1tg1xw
6 97 45 12
Should become
# 2 3 4 5 11 #
agdse#tg#xw
6 97 45 12

Input
a77
a7b
2ab
882
9
abcfg9
9fg
ab9
Script
sed -En '/^[^[:digit:]]*[[:digit:]]{1}[^[:digit:]]*$/p' filename
Output
a7b
2ab
9
abcfg9
9fg
ab9

To do what you show in the Example in your question is:
$ sed -r 's/(^|[^0-9])1([^0-9]|$)/\1#\2/g' file
# 2 3 4 5 11 #
agdse#tg#xw
6 97 45 12
but that only works because you didn't have 1 1 in your data. If you did you'd need 2 passes:
$ echo '1 1' | sed -r 's/(^|[^0-9])[0-9]([^0-9]|$)/\1#\2/g'
# 1
$ echo '1 1' | sed -r 's/(^|[^0-9])[0-9]([^0-9]|$)/\1#\2/g; s/(^|[^0-9])[0-9]([^0-9]|$)/\1#\2/g'
# #
and if you wanted to do that for any single digit it would be:
$ sed -r 's/(^|[^0-9])[0-9]([^0-9]|$)/\1#\2/g; s/(^|[^0-9])[0-9]([^0-9]|$)/\1#\2/g' file
# # # # # 11 #
agdse#tg#xw
# 97 45 12

sed only supports BRE and ERE, but you can enable PCRE with grep -P:
% printf 'a77\na7b\n2ab\n82\n' | grep -P '(?<!\d)\d(?!\d)'
a7b
2ab
grep will as demonstrated print matched lines, but have an option to print the match only:
% printf 'a77\na7b\n2ab\n82\n' | grep -oP '(?<!\d)\d(?!\d)'
7
2

Related

Bash: How do I extract the count of all the "n" digit numbers in a string? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am trying to extract the total number of n digited numbers from a string using bash.
E.g. For a 3 digit number,
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1
Unfortunately, I will not be able to use sed or grep with perl-regexp.
Appreciate any suggestions!
You can use regular expressions in bash.
#! /bin/bash
cat <<EOF |
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3, but should ignore 12345
I have 243 pens for sale #should return 1
123 should work at text boundaries, too 123
EOF
while read line ; do
c=0
while [[ $line =~ ([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$) ]] ; do
m=${BASH_REMATCH[0]}
line=${line#*$m}
((++c))
done
echo $c
done
The regex explained:
([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$)
~~~~~~~~~~~~~ non-digit
~~ or the very beginning
~~~~~~~~~~~~~~~ three digits
~~~~~~~~~~~~ non-digit
~~ or the very end
As bash can't match the same string several times, we need to remove the already processed part from the string before trying another match.
echo "$str" | grep -o '\b[0-9]\{3\}\b' | wc -l
This way we match 3-digit numbers inside word bountaries, which are allowed to be re-used (e.g. if two numbers are separated by one char that is a bountary, like comma or space).
Or like this:
echo "$str" | grep -o '\<[[:digit:]]\{3\}\>' | wc -l
Using POSIX shell grammar only:
#!/usr/bin/env sh
# Should return 3
str1='I am trying to extract 3 digited numbers 333, 334, 335 from this string'
# Should return 1
str2='I have 243 pens for sale'
# should return 2
str3='This is 123 456'
_OIFS=$IFS
IFS=$IFS' ,.:;!?-_+=*#$ยง^&{}[]|`#"()\\/'\'
for str in "$str1" "$str2" "$str3"
do
count=0
for word in $str
do
case $word in
[[:digit:]][[:digit:]][[:digit:]])
count=$((count + 1 ))
;;
esac
done
printf 'String:\n%s\n-> Count: %d\n\n' "$str" "$count"
done
IFS=$_OIFS
Output:
String:
I am trying to extract 3 digited numbers 333, 334, 335 from this string
-> Count: 3
String:
I have 243 pens for sale
-> Count: 1
String:
This is 123 456
-> Count: 2
Could you please try following, written and tested with following link
https://ideone.com/bh6zjR#stdin
in shown samples. Since OP said in comments digits can't have anything else before/in between/after (apart from , I believe as per samples) so going with traversing all fields of current line and using regex to find matches for them.
awk '
{
for(i=1;i<=NF;i++){
if(match($i,/^[0-9]{3}[,]?$/)){
count++
}
}
print "Line " FNR " has " count " number of 3 digits."
count=""
}
' Input_file
Output will be as follows.
Line 1 has 3 number of 3 digits.
Line 2 has 1 number of 3 digits.
Assuming the OP only wants exactly 3-digit numbers and is not interested in breaking longer numbers down into 3-digit segments, eg, the string 12345 will return a zero count as opposed to a 3 count ( 123 / 234 / 345 ).
Some sample data:
$ cat numbers.dat
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1
123 xyz
def 456
def 789-345 abc # should match 7-8-9 and 3-4-5
tester876tester # should match 8-7-6
testing9999testing # should not match 9-9-9-9
$ str=$(cat numbers.dat) # load data into a variable
A 2-pass grep solution:
NOTE: borrowed thanasisp's word boundary flag (\b)
Find patterns of 3-digits with non-digit book ends (including front/end of line)
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}"
333,
334,
335
243
123
456
789-
345
r876t
Now strip off the non-digits:
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}'
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}'
333
334
335
243
123
456
789
345
876
Pipe to wc -l for a count:
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l
9
Storing count in a variable:
$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l)
# or
$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l)
$ echo "${counter}"
9

Grep everything before a specific character [duplicate]

This question already has answers here:
How can I print all the characters until a certain pattern (excluding the pattern itself) using grep/awk/sed?
(2 answers)
Closed 2 years ago.
I have a file, my_file.
The contents of the file look like this:
4: something
5: something
7: another thing
I want to print out the following:
4
5
7
Basically I want to get all the numbers before the character :
Here is what I tried:
grep -i "^[0-9]+(?=(:)" my_file
This returned nothing. How can I change this command to make it work?
This is a use-case for awk:
$ awk -F":" '{print $1}' < inputfile
because you're using : as a field delimiter.
Try this:
grep -Eo "^[0-9]+" my_file # you can use either E (extended) or P (pearl) regular expressions
-o is for only matching
We also need to specify that we are using regex.
Both of the following will work:
-E extended regular expressions
-P pearl regular expressions
Breakdown:
^ signifies the start
[0-9] match a digit
+ match 1 or more from [0-9]
Output:
4
5
7
Using grep
grep -oE '^[0-9]+:' my_file | tr -d ':'
using sed
sed 's#:.*$##g' my_file
Demo :
$cat test.txt
4: something
5: something
7: another thing
$sed 's#:.*$##g' test.txt
4
5
7
$grep -oE '^[0-9]+:' test.txt | tr -d ':'
4
5
7

How to use grep to extract multiple groups

Say I have this file data.txt:
a=0,b=3,c=5
a=2,b=0,c=4
a=3,b=6,c=7
I want to use grep to extract 2 columns corresponding to the values of a and c:
0 5
2 4
3 7
I know how to extract each column separately:
grep -oP 'a=\K([0-9]+)' data.txt
0
2
3
And:
grep -oP 'c=\K([0-9]+)' data.txt
5
4
7
But I can't figure how to extract the two groups. I tried the following, which didn't work:
grep -oP 'a=\K([0-9]+),.+c=\K([0-9]+)' data.txt
5
4
7
I am also curious about grep being able to do so. \K "removes" the previous content that is stored, so you cannot use it twice in the same expression: it will just show the last group. Hence, it should be done differently.
In the meanwhile, I would use sed:
sed -r 's/^a=([0-9]+).*c=([0-9]+)$/\1 \2/' file
it catches the digits after a= and c=, whenever this happens on lines starting with a= and not containing anything else after c=digits.
For your input, it returns:
0 5
2 4
3 7
You could try the below grep command. But note that , grep would display each match in separate new line. So you won't get the format like you mentioned in the question.
$ grep -oP 'a=\K([0-9]+)|c=\K([0-9]+)' file
0
5
2
4
3
7
To get the mentioned format , you need to pass the output of grep to paste or any other commands .
$ grep -oP 'a=\K([0-9]+)|c=\K([0-9]+)' file | paste -d' ' - -
0 5
2 4
3 7
use this :
awk -F[=,] '{print $2" "$6}' data.txt
I am using the separators as = and ,, then spliting on them

delete lines with specific pattern

Hi I have to delete some lines in a file:
file 1
1 2 3
4 5 6
file 2
1 2 3 6
5 7 8 7
4 5 6 9
I have to delete all the lines of file 1 that i find in file 2:
output
5 7 8 7
I used sed:
for sample_index in $(seq 1 3)
do
sample=$(awk 'NR=='$sample_index'' file1)
sed "/${sample}/d" file2 > tmp
done
but it doesnt work.it doesn't print anything. do you have any idea?It gives me error of 'sed: -e expression #1, char 0: precedent regular expression needed'
This could be a start:
$ grep -vf file1 file2
5 7 8 7
One potential pitfall here is that the output won't change if you put 5 6 9 as the second line of file1. I'm not sure if if you want that or not. If not, you can try
grep -vf <(sed 's/^/^/' file1) file2
This should work if your real data as 3 columns:
awk 'NR==FNR{a[$1$2$3]++;next}!($1$2$3 in a)' file{1,2}
For variable columns:
awk 'NR==FNR{a[$0]++;next}{for(x in a) if(index($0,x)>0) next}1' file{1,2}
And the code for GNU sed
sed -r 's#(.*)#/\1/d#' file1 | sed -f - file2

sed regex match non-whitespace or tab

I am trying to parse input that looks like this:
i171_chr1_C_MSTA_K0.184_full i266_chr1_+_MSTA_K0.195_full 92.06 2255 125 21 1 2221 2235 1 0.0 3123
i172_chr1_+_MLT1D_K0.575_full i172_chr1_+_MLT1D_K0.575_full 100.00 2290 0 0 1 2290 1 2290 0.0 4229
i172_chr1_+_MLT1D_K0.575_full i172_chr1_+_MLT1D_K0.575_full 100.00 2290 0 0 1 2290 1 2290 0.0 4229
Desired output is:
i171 1 i266 1 92
i172 1 i172 1 100
i172 1 i172 1 100
In another words, I am extracting name before first "_" to the first column and part after chr into second column (similiarly for third and fourth column).
I wrote command that works properly for first four columns:
grep -v "#" blastGE90_lengthGE1000 | cut -f 1,2 | sed -r 's/(.+)_chr([0-9XY]+)_.+\t(.+)_chr([0-9XY]+).+/\1 \2 \3 \4/'
However, when I try to match third column in input, I am not successful. I always match the last match instead of one I want:
grep -v "#" blastGE90_lengthGE1000 | cut -f 1,2 | sed -r 's/(.+)_chr([0-9XY]+)_.+\t(.+)_chr([0-9XY]+).+([0-9]+\.).+/\1 \2 \3 \4 \5/'
Therefore, I would like to use regexp to match non-whitespace or tabulator, but I can't figure it out.
I have fixed your command:
grep -v "#" blastGE90_lengthGE1000 | cut -f 1-3 | sed -r 's/(.+)_chr([0-9XY]+)_.+\t(.+)_chr([0-9XY]+)_.+\t([0-9]+).+/\1 \2 \3 \4 \5/'
You need to use cut -f 1-3 not cut -f 1,2 because you need the first three columns.
I also fixed the last capture group in the sed expression.
I would use awk here:
$ awk -F'_| +' '{gsub(/chr/,"");print $1,$2,$7,$8,int($13)}' file
i171 1 i266 1 92
i172 1 i172 1 100
i172 1 i172 1 100