Matching arbitrary number of digits using grep regex - regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.

Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename

Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$

You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.

You should put a "+" (which means one or several) instead of "*" (which means zero, one or several

The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH

grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$

🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit

Related

How to get the release value?

I've a file with the below name formats:
rzp-QAQ_SA2-5.12.0.38-quality.zip
rzp-TEST-5.12.0.38-quality.zip
rzp-ASQ_TFC-5.12.0.38-quality.zip
I want the value as: 5.12.0.38-quality.zip from the above file names.
I'm trying as below, but not getting the correct value though:
echo "$fl_name" | sed 's#^[-[:alpha:]_[:digit:]]*##'
fl_name is the variable containing the file name.
Thanks a lot in advance!
You are matching too much with all the alpha, digit - and _ in the same character class.
You can match alpha and - and optionally _ and alphanumerics
sed -E 's#^[-[:alpha:]]+(_[[:alnum:]]*-)?##' file
Or you can shorten the first character class, and match a - at the end:
sed -E 's#^[-[:alnum:]_]*-##' file
Output of both examples
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
With GNU grep you could try following code. Written and tested with shown samples.
grep -oP '(.*?-){2}\K.*' Input_file
OR as an alternative use(with a non-capturing group solution, as per the fourth bird's nice suggestion):
grep -oP '(?:[^-]*-){2}\K.*' Input_file
Explanation: using GNU grep here. in grep program using -oP option which is for matching exact matched values and to enable PCRE flavor respectively in program. Then in main program, using regex (.*?-){2} means, using lazy match till - 2 times here(to get first 2 matches of - here) then using \K option which is to make sure that till now matched value is forgotten and only next mentioned regex matched value will be printed, which will print rest of the values here.
It is much easier to use cut here:
cut -d- -f3- file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
If you want sed then use:
sed -E 's/^([^-]*-){2}//' file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
Assumptions:
all filenames contain 3 hyphens (-)
the desired result always consists of stripping off the 1st two hyphen-delimited strings
OP wants to perform this operation on a variable
We can eliminate the overhead of sub-process calls (eg, grep, cut and sed) by using parameter substitution:
$ f1_name='rzp-ASQ_TFC-5.12.0.38-quality.zip'
$ new_f1_name="${f1_name#*-}" # strip off first hyphen-delimited string
$ echo "${new_f1_name}"
ASQ_TFC-5.12.0.38-quality.zip
$ new_f1_name="${new_f1_name#*-}" # strip off next hyphen-delimited string
$ echo "${new_f1_name}"
5.12.0.38-quality.zip
On the other hand if OP is feeding a list of file names to a looping construct, and the original file names are not needed, it may be easier to perform a bulk operation on the list of file names before processing by the loop, eg:
while read -r new_f1_name
do
... process "${new_f1_name)"
done < <( command-that-generates-list-of-file-names | cut -d- -f3-)
In plain bash:
echo "${fl_name#*-*-}"
You can do a reverse of each line, and get the two last elements separated by "-" and then reverse again:
cat "$fl_name"| rev | cut -f1,2 -d'-' | rev
A Perl solution capturing digits and characters trailing a '-'
cat f_name | perl -lne 'chomp; /.*?-(\d+.*?)\z/g;print $1'

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

Replace the separator between pairs of numbers

I want to replace all strings like [0-9][0-9]-[0-9][0-9] with [0-9][0-9]/[0-9][0-9] using sed.
In other words, I want to replace - with /.
If I have somewhere in my text:
09-36
32-43
54-65
I want this change:
09/36
32/43
54/65
Using GNU sed:
$ echo '09-36 32-43 54-65' | sed -r 's|\<([0-9]{2})-([0-9]{2})\>|\1/\2|g'
09/36 32/43 54/65
-r turns on extended regular expressions, which:
doesn't require \-escaping ( ) { } char.
enables use of \< and /> to only match at word boundaries (if the expression should only match full lines, use ^ and $ instead, and omit the g option)
| is used as an alternative regex delimiter so that / can be used without \-escaping.
A BSD/macOS sed solution would look slightly different:
echo '09-36 32-43 54-65' | sed -E 's|[[:<:]]([0-9]{2})-([0-9]{2})[[:>:]]|\1/\2|g'
sed -e 's/\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1\/\2/g'
Might not be the most elegant version, but works for me. The gazillion backslashes make this rather unreadable in my opinion. You might improve the readability by not using / to separate the pattern and the replacement maybe?
perl -C -npe 's/(?<!\d)(\d\d)-(\d\d)(?!\d)/\1\/\2/g' file
Input
维基 1-11 22-33 444-44 55-555 66-66百科
77-77
8 88-88
Output
维基 1-11 22/33 444-44 55-555 66/66百科
77/77
8 88/88
In the command above
-C enables Unicode;
-n causes Perl to process the script for each input line;
-p causes Perl to print the result of the script to the standard output;
-e accepts a Perl expression (particularly, it is a substitution).
In this mode (-npe), Perl works just like sed. The script substitutes each pair of digits separated with - to the same pair separated with a slash.
(?<!\d) and (?!\d) are negative lookaround expressions.
To edit the file in place use -i option: perl -C -i.backup -npe ....
If the input is not a file, you can pass the input to Perl via pipe, e.g.:
echo '维基 1-11 22-33 444-44 55-555 66-66百科' | \
perl -C -npe 's/(?<!\d)(\d\d)-(\d\d)(?!\d)/\1\/\2/g'

How do I use egrep to list words that match a regular expression?

I need to use egrep to count words that contain strings which match a regular expression. For instance, I need to do something like "Count the number of words containing three consecutive vowels" (not exactly that, but that's the gist of it).
I've figured out how to do it to count lines which contain these words, but when I add the -w tag I get an egrep: illegal option -- w error.
Here's the regular expression I'd use to count lines in the scenario above, which seems to work:
egrep -i -c '[aeiou][aeiou][aeiou]' full.html
Using the -w tag with this command causes the error I listed above, even if I add \b tags around the regex expression. e.g.:
egrep -i -c -w '\b.*[aeiou][aeiou][aeiou].*\b' full.html
What am I doing wrong?
EDIT: I'm running this on Solaris 10 out of the terminal.
use this way also to find the count of the words that contains strings
grep --color -Eow '[aeiou][aeiou][aeiou]' filename | wc -l
or
egrep -ow '[aeiou][aeiou][aeiou]' filename | wc -l
o for Print only the matched.
w for word.
finally, it will display the count of the word.
You'll have to consult your solaris man-pages to know if your egrep supports any/all/some of the GNU like extensions.
Does your system have /usr/xpg4/bin ? If yes, make sure your MANPATH includes /usr/xpg4/man. That dir used to have the newest versions, short of having something like /opt/gnu install added.
In any case, your regexp '\b.*[aeiou][aeiou][aeiou].*\b' reads to my eye as ...
1 word-boundary
followed by any number of any chars (including blanks and vowels)
followed by three vowels,
followed by any number of any chars (including blanks and vowels),
followed by 1 word-boundary.
Probably not what your really want.
To meet your need of words with 3 vowels in a row and using old/square reg-ex long hand, try
egrep -i -c '[a-z]*[aeiou][aeiou][aeiou][a-z]*' full.html
This says, match chars [a-z] any number (including none), before 3 vowels, followed by any number of chars [a-z] (including none). So space chars won't match [a-z]. YOu're using -i to ignore case, so you don't have to use [A-Za-z]. Obviously, if you find other chars that you want to consider as word chars, maybe the '_' char?, add that to both sides.
Sorry, but I'm going from memory here, I don't work in a Solaris shop, and can't test it there.
edit
Also note that the man page on my current system for grep says
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.
Note it's the number of matching lines, not the number of matches.
Might be easier to use
awk '{for (i=1;i<=NF;i++){if ($i ~ /.*[aeiou][aeiou][aeiou].*/) cnt++};}; END{print "count="cnt}'file
IHTH
I believe that egrep does not support \b for word boundaries. Try \< for beginning of word boundary and \> for end of word boundary.
EDIT
Hmm... never mind. According to the man page \b is supported.
Actually, I think the answer is that only grep supports the "-w" option. I don't think egrep does.
http://ss64.com/bash/egrep.html
Which platform and which version of egrep?
The -w option works for me (CentOS and Mac with GNU egrep) - see below. Also, \b works as expected - see below.
Also, I used a different regex - see below.
$ grep --version
grep (GNU grep) 2.5.1
$ cat test.txt
this and that iou and eai
not this aaih
not this haai
$ egrep -i -w '[aeiou]{3}' test.txt
this and that iou and eai
# with no -w
egrep -i '\b[aeiou]{3}\b' test.txt
this and that iou and eai
# with neither -w nor {3}
$ egrep -i '\b[aeiou][aeiou][aeiou]\b' /tmp/test.txt
this and that iou and eai
# using '\<' and '\>' works as well for word boundaries
$ egrep -i '\<[aeiou][aeiou][aeiou]\>' /tmp/test.txt
this and that iou and eai

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?
You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input
Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.