Why doesn't grep work in pattern with colon - regex

I know a colon: should be literal, so I'm not clear why a grep matches all lines. Here's a file called "test":
cat test
123|4444
4546|4444
666666|5678
7777777|7890675::1
I need to match the line with::1. Of course, the real case is more complicated, so I can't simply search for "::1". I tried many iterations, like
grep -E '^[0-9]|[0-9]:' test
grep -E '^[0-9]|[0-9]::1' test
But they return all lines:
123|4444
4546|4444
666666|5678
7777777|7890675::1
I am expecting to match just the last line. Any idea why that is?
This is GNU/Linux bash.

The pipe needs to be escaped and you need to allow repeated digits:
grep -E '^[0-9]+\|[0-9]+:' test
Otherwise ^[0-9] is all that needs to match for a line to be retained by the grep.

Given:
$ echo "$txt"
123|4444
4546|4444
666666|5678
7777777|7890675::1
Use repetition (+ means 'one or more') and character classes:
$ echo "$txt" | grep -E '^[[:digit:]]+[|][[:digit:]]+[:]+'
7777777|7890675::1
Since | is a regex meta character, it has to be either escaped (\|) or in a character class.

There are two issues:
The regex [0-9] matches any single digit. Since you have multiple digits, you need to replace those parts with [0-9]+, which matches one or more digits. If you want to allow an empty sequence with no digits, replace the + with a *, which means “zero or more”.
The pipe character | means “alternative”s in regex. What you provided will match either a digit at the start of the line, or a digit followed by a colon. Since every line has at least one of those, you match every line. To get a literal | character, you can use either [|] or \|; the second option is usually preferred in most styles.
Applying both of these, you get ^[0-9]+\|[0-9]+::1.

Another approach is to use a tool like awk that can process the fields of each line, and match lines where the 2nd field ends with "::1"
awk -F'|' '$2 ~ /::1$/' test

Related

Extract capture group only from string

I have the following rule:
https://regex101.com/r/noX9lj/4
I want to make this work in a script so I'm using grep like this:
echo "\$this->table('test')" | grep -Po "qr/\$this->table\(\'(test)\'\);/"
The output should be "test"
It's not working, not sure why..
You may use
echo "\$this->table('test');" | grep -oP "\\\$this->table\\('\\K[^']+(?='\\);)"
Or, if you feed a file path to grep:
grep -oP "\\\$this->table\\('\\K[^']+(?='\\);)" file
See the online grep demo
To match $, you need to escape it with a literal backslash, and inside a double quoted string, you need to escape $ itself with one backslash char in order to stop variable expansion, and then you need to add two more backslashes to regex-escape the literal $ char, hence is the "\\\$" in the pattern.
To match any text between two single quotes, you may use [^']+ - 1 or more chars other than '.
See the regex demo
Pattern details
\$this->table\(' - $this->table(' string
\K - match reset operator that discards the text matched so far from the overall match buffer
[^']+ - one or more chars other than '
(?='\);) - a positive lookahead that requires '); string to be present immediately to the right of the current position.
There were multiple issues:
had to use "cat" instead of echo for some reason
used this rule instead:
grep -oP "this->table\('\K\w+(?='\);)"

regular expression to extract number from string

I want to extract number from string. This is the string
#all/30
All I want is 30. How can I extract?
I try to use :
echo "#all/30" | sed 's/.*\/([^0-9])\..*//'
But nothing happen.
How should I write for the regular expression?
Sorry for bad english.
You may consider using grep to extract the numbers from a simple string like this.
echo "#all/30" | grep -o '[0-9]\+'
-o option shows only the matching part that matches the pattern.
You could try the below sed command,
$ echo "#all/30" | sed 's/[^0-9]*\([0-9]\+\)[^0-9]*/\1/'
30
[^0-9]* [^...] is a negated character class. It matches any character but not the one inside the negated character class. [^0-9]* matches zero or more non-digit characters.
\([0-9]\+\) Captures one or more digit characters.
[^0-9]* Matches zero or more non-digit characters.
Replacing the matched characters with the chars inside group 1 will give you the number 30
echo "all/30" | sed 's/[^0-9]*\/\([0-9][0-9]*\)/\1/'
Avoid writing '.*' as it consumes entire string. Default matches are always greedy.
echo "all/30" | sed 's/[^0-9]*//g'
# OR
echo "all/30" | sed 's#.*/##'
# OR
echo "all/30" | sed 's#.*\([0-9]*\)#\1#'
without more info about possible input string we can only assume that structure is #all/ followed by the number (only)

Extract numbers from a string using sed and regular expressions

Another question for the sed experts.
I have a string representing an pathname that will have two numbers in it. An example is:
./pentaray_run2/Trace_220560.dat
I need to extract the second of these numbers - ie 220560
I have (with some help from the forums) been able to extract all the numbers together (ie 2220560) with:
sed "s/[^0-9]//g"
or extract only the first number with:
sed -r 's|^([^.]+).*$|\1|; s|^[^0-9]*([0-9]+).*$|\1|'
But what I'm after is the second number!! Any help much appreciated.
PS the number I'm after is always the second number in the string.
is this ok?
sed -r 's/.*_([0-9]*)\..*/\1/g'
with your example:
kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/.*_([0-9]*)\..*/\1/g'
220560
You can extract the last numbers with this:
sed -e 's/.*[^0-9]\([0-9]\+\)[^0-9]*$/\1/'
It is easier to think this backwards:
From the end of the string, match zero or more non-digit characters
Match (and capture) one or more digit characters
Match at least one non-digit character
Match all the characters to the start of the string
Part 3 of the match is where the "magic" happens, but it also limits your matches to have at least a non-digit before the number (ie. you can't match a string with only one number that is at the start of the string, although there is a simple workaround of inserting a non-digit to the start of the string).
The magic is to counter-act the left-to-right greediness of the .* (part 4). Without part 3, part 4 would consume all it can, which includes the numbers, but with it, matching makes sure that it stops in order to allow at least a non-digit followed by a digit to be consumed by parts 1 and 2, allowing the number to be captured.
If grep is welcome :
$ echo './pentaray_run2/Trace_220560.dat' | grep -oP '\d+\D+\K\d+'
220560
And more portable with Perl with the same regex :
echo './pentaray_run2/Trace_220560.dat' | perl -lne 'print $& if /\d+\D+\K\d+/'
220560
I think the approach is cleaner & more robust than using sed
This might work for you (GNU sed):
sed -r 's/([^0-9]*([0-9]*)){2}.*/\2/' file
This extracts the second number:
sed -r 's/([^0-9]*([0-9]*)){1}.*/\2/' file
and this extracts the first.

Use grep to match a pattern in a line only once

I have this:
echo 12345 | grep -o '[[:digit:]]\{1,4\}'
Which gives this:
1234
5
I understand whats happening. How do I stop grep from trying to continue matching after 1 successful match?
How do I get only
1234
Do you want grep to stop matching or do you only care about the first match. You could use head if the later is true...
`grep stuff | head -n 1`
Grep is a line based util so the -m 1 flag tells grep to stop after it matches the first line which when combined with head is pretty good in practice.
You need to do the grouping: \(...\) followed by the exact number of occurrence: \{<n>\} to do the job:
maci:~ san$ echo 12345 | grep -o '\([[:digit:]]\)\{4\}'
1234
Hope it helps. Cheers!!
Use sed instead of grep:
echo 12345 | sed -n '/^\([0-9]\{1,4\}\).*/s//\1/p'
This matches up to 4 digits at the beginning of the line, followed by anything, keeps just the digits, and prints them. The -n prevents lines from being printed otherwise. If the digit string might also appear mid-line, then you need a slightly more complex command.
In fact, ideally you'll use a sed with PCRE regular expressions since you really need a non-greedy match. However, up to a reasonable approximation, you can use: (A semi-solution to a considerably more complex problem...now removed!)
Since you want the first string of up to 4 digits on the line, simply use sed to remove any non-digits and then print the digit string:
echo abc12345 | sed -n '/^[^0-9]*\([0-9]\{1,4\}\).*/s//\1/p'
This matches a string of non-digits followed by 1-4 digits followed by anything, keeps just the digits, and prints them.
If – as in your example – your numeric expression will appear at the beginning of the string you're starting with, you could just add a start-of-line anchor ^:
echo 12345 | grep -o '^\([[:digit:]]\)\{1,4\}'
Depending on which exact digits you want, an end-of-line anchor $ might help also.
grep manpage says on this topic (see chapter 'regular expressions'):
(…)
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
(…)
So the answer should be:
echo 12345 | grep -o '[[:digit:]]\{4\}'
I just tested it on cygwin terminal (2018) and it worked!

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?
You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input
Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.