Regex match and grouping - regex

Here's a sample string which I want do a regex on
101-nocola_conte_-_fuoco_fatuo_(koop_remix)
The first digit in "101" is the disc number and the next 2 digits are the track numbers. How do I match the track numbers and ignore the disc number (first digit)?

Something like
/^\d(\d\d)/
Would match one digit at the start of the string, then capture the following two digits

Do you mean that you don't mind what the disk number is, but you want to match, say, track number 01 ?
In perl you would match it like so: "^[0-9]01.*"
or more simply "^.01.*" - which means that you don't even mind if the first char is not a digit.

^\d(\d\d)
You may need \ in front of the ( depending on which environment you intend to run the regex into (like vi(1)).

Which programming language? For the shell something with egrep will do the job:
echo '101-nocola_conte_-_fuoco_fatuo_(koop_remix)' | egrep -o '^[0-9]{3}' | egrep -o '[0-9]{2}$'

Related

grep regex unexpected match - literal matches against number

Why does the following literal string
1998-${year}
..match against the grep command:
grep "[0-9 ]*-[ 0-9]*" filename.txt ?
What I need is a regex to match any of the following strings containing either a year range or one value of year only.
sdkfmslf 1998-2008
asdassdadsa 1998 - 2008
mkklml mklsmdf 2006
..but NOT this one:
asdsad a s 1998-${year}
* means "match zero or more". You want + which means "one or more."
grep "[0-9 ]+-[0-9]+" filename.txt
Try [0-9]{4}(\s*-\s*[0-9]{4})?. This will match a 4 digit number, or if it is followed by (optional white space)-(optional whitespace) then that must be followed by another 4 digit number.
Your string "asdsad a s 1998-${year}" would still match, since it has a single 4 digit value in it.
I don't like answering my own question, but none of the above worked. Here is what I found by experimenting. I'm sure there could be more elegant solutions, but here is a working version:
grep "[0-9][0-9][0-9][0-9][ ]*[\-]*[ ]*[0-9]*" filename.txt

How to match this expression with regex?

I have a text with some lines (200+) in this format:
10684 - The jackpot ? discuss Lev 3 --- ? ---
10755 - Garbage Heap ? discuss Lev 5 --- ? ---
I hant to retrieve the first number (10684 or 10755) only if number after "Lev" is greater than 3.
I'm able to get the first number with this regex: ([0-9]+) - but without the 'level' restrictions.
How this could be made?
Thanks in advance.
(\d+) - .*?Lev (?:[4-9]|[1-9]\d+)
The first \d+ matches line number as you have done.
The next .*? is a lazy quantifier, which will not consume too many characters. And the following expression will guide it to the right place. (lazy quantifier is usually more efficient)
The second parenthesis, (?:[4-9]|[1-9]\d+), matches either single digital numbers greater than 3 or two digital numbers without leading zero.
Alright stackoverflow doesn't properly show my image. Take this link : http://regexr.com?36n5l
Example Output:
Regular expressions doesn't recognize numbers as numbers (only strings). You can do this though:
([0-9]+) - .*Lev (?:[4-9][^0-9]|[1-9][0-9]+)
Basically, we use the alternation operator (|) to accept only a single digit greater than 3 (enforced by checking that the following character is not a digit) or a multi-digit number not beginning with a zero.
In case that level number might be the end of the line, though, you might have to do this:
([0-9]+) - .*Lev (?:[4-9](?:[^0-9]|$)|[1-9][0-9]+)
(I'm assuming whatever regex engine you're using can't handle lookaround assertions. In the future, try to always include what language you're using when you're asking a regex question.)
Ah, I just read your edit that the number is always less than 10. Well, that's much easier then:
([0-9]+) - .*Lev [4-9]
A lookahead is really the best thing because it will leave just the number:
/\d+(?=.*Lev (0*[4-9]|[1-9]\d))/
A bit of Awk trickery:
awk -F '\? +discuss +Lev' '$2>3 { split($1,a,/ */); print a[1] }' file
In bash use this:
var=">3"
perl -lne '/(\d+) - .*Lev (\d+)/; print $1 if $2'"$var"
This is a good solution to be able to pass the condition by parameter.

Regex: plus sign not doing what I expect

I have a file with many lines including a string like this: blah blah num=12345; blah blah
I would like to find lines where the number after the equals sign is greater than 1, with no upper limit. (I do not expect a number to ever start with zero.)
I started with this expression that will match any number starting with any digit that's not a 1, and it works fine and I understand it.
grep 'num=[2-9][0-9]*;'
This next expression should, I thought, return any number starting with a 1 that has two or more digits, but I instead get nothing back:
grep 'num=1[0-9]+;'
I though the above meant: must match num=1, then must match something between 0-9 one or more times, then must match ;. Where am I going wrong?
With grep you must escape the + quantifier
grep 'num=1[0-9]\+;'
For your problem you can use this (for all numbers >1, if i understand well):
grep 'num=\([2-9]\|1[0-9]\)[0-9]*;'

Edit large textfile in mac terminal

I have this very large dictionary file with 1 word on each line, and I would like to trim it down.
What I would like to do is leave 3-6 letter improper nouns, so it has to detect the words based on these:
if the word is less than 3 letters, delete it
if the word is more than 6 letters, delete it
if the word has a capital letter, delete it
if the word has a single quote or space, delete it.
I used this:
cat Downloads/en-US/en-US.dic | egrep '[a-z]{3,6}' > Downloads/3-6.txt
but the output is incorrect. It outputs the words with greater than 3 characters alright, but that's about my progress so far.
So how do I go about doing this in the mac terminal? There must be a way to do this right?
The following command will select only words that consist of exactly three to six lowercase a-z letters:
egrep '^[a-z]{3,6}$' /usr/share/dict/words > filtered.txt
Replace /usr/share/dict/words with your input file, and filtered.txt with a name for your output file. I just verified that this works on my Mac. Hope this helps!
Use grep and write a regex rule to match the lines you want to keep. You can get info on grep by typing man grep in the terminal.

Regex find time values

I keep getting into situations where I end up making two regular expressions to find subtle changes (such as one script for 0-9 and another for 10-99 because of the extra number)
I usually use [0-9] to find strings with just one digit and then [0-9][0-9] to find strings with multiple digits, is there a better wildcard for this?
ex. what expression would I use to simultaneously find the strings
6:45 AM and 10:52 PM
You can specify repetition with curly braces. [0-9]{2,5} matches two to five digits. So you could use [0-9]{1,2} to match one or two.
[0-9]{1,2}:[0-9]{2} (AM|PM)
I personally prefer to use \d for digits, thus
\d{1,2}:\d{2} (AM|PM)
[0-9] 1 or 2 times followed by : followed by 2 [0-9]:
[0-9]{1,2}:[0-9]{2}\s(AM|PM)
or to be valid time:
(?:[1-9]|1[0-2]):[0-9]{2}\s(?:AM|PM)
If you are looking for a time patten, you'd do something like:
\d{1,2}:\d{1,2} (AM|PM)
Or for more specific time regex
[0-1]{0,1}[0-9]{1,2}:[0-5][0-9] (AM|PM)
Much like the other answers, except the AM/PM is not captured, which should be more efficient
\d{1,2}:\d{1,2}\s(?:AM|PM)
if I have a file containing:
1 ABC
2 123XYZ
3 6:45 AM
4 123DHD
5 ABC
6 10:52 PM
7 CDE
and run the following
$>grep -P '6:45\sAM|10:52\sPM' temp
6:45 AM
10:52 PM
$>.
should do the trick (-P is a perl regx)
EDIT:
Perhaps I misunderstood, the other answers are very good if I were looking to just find a time, but you seem to be after specific times. the others would match ANY time in HH:MM format.
overall, I believe the items you are after would be the | pipe character which is used in this case to allow alternative phrases and the {n,m} match n-m times {1,2} would match 1-2 times, etc.
It can be able to check all type of time formats :
e.g. 12:05PM, 3:19AM, 04:25PM, 23:52PM
my $time = "12:52AM";
if ($time =~ /^[01]?[0-9]\:[0-5][0-9](AM|PM)/) {
print "Right Time Dude...";
}
else { print "Wrong time Dude"; }
This is the regex you want.
/^[01]?[0-9]\:[0-5][0-9](AM|PM)/
Having this string as input:
Sat, 6 May 2017 02:08:08 +0000
I did this regEx to get combinations of one or two digits:
[0-9]*:[0-9]*:[0-9]*