Use grep to match a pattern in a line only once - regex

I have this:
echo 12345 | grep -o '[[:digit:]]\{1,4\}'
Which gives this:
1234
5
I understand whats happening. How do I stop grep from trying to continue matching after 1 successful match?
How do I get only
1234

Do you want grep to stop matching or do you only care about the first match. You could use head if the later is true...
`grep stuff | head -n 1`
Grep is a line based util so the -m 1 flag tells grep to stop after it matches the first line which when combined with head is pretty good in practice.

You need to do the grouping: \(...\) followed by the exact number of occurrence: \{<n>\} to do the job:
maci:~ san$ echo 12345 | grep -o '\([[:digit:]]\)\{4\}'
1234
Hope it helps. Cheers!!

Use sed instead of grep:
echo 12345 | sed -n '/^\([0-9]\{1,4\}\).*/s//\1/p'
This matches up to 4 digits at the beginning of the line, followed by anything, keeps just the digits, and prints them. The -n prevents lines from being printed otherwise. If the digit string might also appear mid-line, then you need a slightly more complex command.
In fact, ideally you'll use a sed with PCRE regular expressions since you really need a non-greedy match. However, up to a reasonable approximation, you can use: (A semi-solution to a considerably more complex problem...now removed!)
Since you want the first string of up to 4 digits on the line, simply use sed to remove any non-digits and then print the digit string:
echo abc12345 | sed -n '/^[^0-9]*\([0-9]\{1,4\}\).*/s//\1/p'
This matches a string of non-digits followed by 1-4 digits followed by anything, keeps just the digits, and prints them.

If – as in your example – your numeric expression will appear at the beginning of the string you're starting with, you could just add a start-of-line anchor ^:
echo 12345 | grep -o '^\([[:digit:]]\)\{1,4\}'
Depending on which exact digits you want, an end-of-line anchor $ might help also.

grep manpage says on this topic (see chapter 'regular expressions'):
(…)
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
(…)
So the answer should be:
echo 12345 | grep -o '[[:digit:]]\{4\}'
I just tested it on cygwin terminal (2018) and it worked!

Related

Bash script to extract 10 most common double-vowels word form a file

So I have tried to write a Bash script to extract the 10 most common double-vowels words from a file, like good, teeth, etc.
Here is what I have so far:
grep -E -o '[aeiou]{2}' $1|tr 'A-Z' 'a-z' |sort|uniq -c|sort -n | tail -10
I tried to use grep with flag E, then find the pattern match, such as 'aa', 'ee', 'ii' , etc, but it is not working at all,
enter image description here, what I got back, just 'ai', 'ea', something like this. Can anyone help me figure how to do pattern match in bash script?
You can simply match any amount of letters before or after a repeated vowel with this POSIX ERE regex with a GNU grep:
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' words.txt
FreeBSD (non-GNU) grep does not support a backreference in the pattern, so you will have to list all possible vowel sequences:
grep -oE '[[:alpha:]]*(aa|ee|ii|oo|uu)[[:alpha:]]*' words.txt
See the online demo:
#!/bin/bash
s='Some good feed
Soot and weed'
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' <<< "$s"
Details:
[[:alpha:]]* - zero or more letters
(aa|ee|ii|oo|uu) - one of the char sequences, aa, ee, ii, oo or uu (| is an alternation operator in a POSIX ERE regex)
([aeiou]) - Group 1: a vowel
\1 - the same vowel as in Group 1
[[:alpha:]]* - zero or more letters
See the diagram:
Simple way to change your regex: replace [aeiou]{2} with aa|ee|ii|oo|uu. (This does not fix the issue of only finding the match rather than the full word.)
Building on Andrew's answer (re: matching double vowels):
$ cat words.txt
good food;foul make chicken,eek too brave
eye you yuu something:three food too tu too
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt
good
food
eek
too
yuu
three
food
too
too
The grep finds only words (\< and \> represent word boundaries) with letters and/or digits and containing a dual vowel, printing each word on a separate line.
Applying the rest of OP's counting/sorting logic:
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt | sort | uniq -c | sort -n
1 eek
1 good
1 three
1 yuu
2 food
3 too

sed - capture a group and replace only one character

I have the following question, a file with pattern like this:
1XYZ00
so the result would be
2XYZ00
I want to change only the first number with another number, for example 9 and not changing anything for the rest of the patern.
I really need to capture this pattern 1XYZ00 and only this one and replace the first number with another one.
I this file I can have numbers but with different pattern and those must not be modified.
The OS is CentOS 7.
Here is what I have tested
sed -E 's/1{1}[A-Z]{3}[0-9]{2}/9{1}[A-Z]{3}[0-9]{2}/g' ~/input.txt > ~/output.txt
I also tried with capture group:
sed --debug -E -r 's/1\(1{1}[A-Z]{3}[0-9]{2}\)/9\1/g' ~/input.txt > ~/output.txt
The sed's debug mode tells me that the pattern matches but no replacement is made.
I think I am pretty close, does any expert could help me please ?
Thanks a lot,
$ cat ip.txt
foo 1XYZ00
xyz 1 2 3
hi 3XYZ00
1XYZ0A
cool 3ABC23
$ # matches any number followed by 3 uppercase and 2 digit characters
$ sed -E 's/[0-9]([A-Z]{3}[0-9]{2})/9\1/' ip.txt
foo 9XYZ00
xyz 1 2 3
hi 9XYZ00
1XYZ0A
cool 9ABC23
$ # matches digit '1' followed by 3 uppercase and 2 digit characters
$ sed -E 's/1([A-Z]{3}[0-9]{2})/9\1/' ip.txt
foo 9XYZ00
xyz 1 2 3
hi 3XYZ00
1XYZ0A
cool 3ABC23
Issue with OP's attempts:
1{1}[A-Z]{3}[0-9]{2} is same as 1[A-Z]{3}[0-9]{2}
Using 9{1}[A-Z]{3}[0-9]{2} in replacement section will give you those characters literally. They don't have any special meaning.
s/1\(1{1}[A-Z]{3}[0-9]{2}\)/9\1/ this one does use capture groups but () shouldn't be escaped with -E option active and 1{1} shouldn't be part of the capture group
I'm not sure if this is enough for you.
sed -i 's/1/9/' input.txt

Why doesn't grep work in pattern with colon

I know a colon: should be literal, so I'm not clear why a grep matches all lines. Here's a file called "test":
cat test
123|4444
4546|4444
666666|5678
7777777|7890675::1
I need to match the line with::1. Of course, the real case is more complicated, so I can't simply search for "::1". I tried many iterations, like
grep -E '^[0-9]|[0-9]:' test
grep -E '^[0-9]|[0-9]::1' test
But they return all lines:
123|4444
4546|4444
666666|5678
7777777|7890675::1
I am expecting to match just the last line. Any idea why that is?
This is GNU/Linux bash.
The pipe needs to be escaped and you need to allow repeated digits:
grep -E '^[0-9]+\|[0-9]+:' test
Otherwise ^[0-9] is all that needs to match for a line to be retained by the grep.
Given:
$ echo "$txt"
123|4444
4546|4444
666666|5678
7777777|7890675::1
Use repetition (+ means 'one or more') and character classes:
$ echo "$txt" | grep -E '^[[:digit:]]+[|][[:digit:]]+[:]+'
7777777|7890675::1
Since | is a regex meta character, it has to be either escaped (\|) or in a character class.
There are two issues:
The regex [0-9] matches any single digit. Since you have multiple digits, you need to replace those parts with [0-9]+, which matches one or more digits. If you want to allow an empty sequence with no digits, replace the + with a *, which means “zero or more”.
The pipe character | means “alternative”s in regex. What you provided will match either a digit at the start of the line, or a digit followed by a colon. Since every line has at least one of those, you match every line. To get a literal | character, you can use either [|] or \|; the second option is usually preferred in most styles.
Applying both of these, you get ^[0-9]+\|[0-9]+::1.
Another approach is to use a tool like awk that can process the fields of each line, and match lines where the 2nd field ends with "::1"
awk -F'|' '$2 ~ /::1$/' test

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?
You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input
Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.

regex and grep match only string with only single or double digit

I need to extract a string with only single or double digit number in them. my file (test) looks like
test1correct
test12something
test123wrong
In the above example, i want to grep only for
test1correct and test12something
I tried this
grep "test[0-9]{1,2}" test but it gives me all the 3 lines.
use: grep "test[0-9]{1,2}[^0-9]"
Using lookaheads and lookbehinds you can specify "exactly one digit" or "exactly three digits" or whatever. This does exactly one digit:
echo 'WB123_4' | grep -Po '(?<![[:digit:]])([[:digit:]]{1})(?![[:digit:]])'
Result: 4
What it is doing is, find a digit that is not preceded by a digit, and also not followed by a digit. Also works for more than one digit. This does three digits, then at least one of anything else, then one digit:
echo 'WB123_4' | grep -Po '(?<![[:digit:]])([[:digit:]]{3})(?![[:digit:]]).+(?<![[:digit:]])([[:digit:]]{1})(?![[:digit:]])'
Result: 123_4
While I'm at it, this combination of grep and sed will find a string with three digits, then one or more of anything else, then one digit, and extract just those parts nicely. (There might have been another way to do that just in grep with groups.)
echo 'WB123_4' | grep -Po '(?<![[:digit:]])([[:digit:]]{3})(?![[:digit:]]).+(?<![[:digit:]])([[:digit:]]{1})(?![[:digit:]])' | sed -r -e 's/[^[:digit:]]+/ /'
Result: 123 4
Note: the -P flag to grep means to use Perl-style regular expressions, which lets you use lookaheads and lookbehinds.
Try this:
test[0-9]{1,2}[A-Za-z]+
cat tst--- tst file contains the following data
1
0
operator
4
5
5
cat tst | grep [0-9]--- while i grrp using using it return only 1
1
how to grep all the numbers in tst file?