Strange regex behavior with grep

Strange regex behavior with grep - regex

grep '[:digit:]{1,}-{1,}' *.txt| wc -l
This command outputs: 0
grep '1-' *.txt| wc -l
However, this command outputs: 10598
Both commands are being run from the same directory. The first command should have returned greater than or equal to the output of the second command. Can anyone shed some insight about what is going on here?

echo 1 | grep '[:digit:]'
#nothing....
grep uses a different syntax, you need [[:digit:]] or [0-9].
The {1,} syntax is not supported by basic grep, you can use other modes, like the extended one with -E... Note: Normally one would use + for matching one or more characters....
General note: always test regexes in small parts to see that each part really does what you thought it does. Once the expression gets complicated, it's really hard to tell what went wrong.

Related

Can I perform a 'non-global' grep and capture only the first match found for each line of input?

I understand that what I'm asking can be accomplished using awk or sed, I'm asking here how to do this using GREP.
Given the following input:
.bash_profile
.config/ranger/bookmarks
.oh-my-zsh/README.md
I want to use GREP to get:
.bash_profile
.config/
.oh-my-zsh/
Currently I'm trying
grep -Po '([^/]*[/]?){1}'
Which results in output:
.bash_profile
.config/
ranger/
bookmarks
.oh-my-zsh/
README.md
Is there some simple way to use GREP to only get the first matched string on each line?

I think you can grep non / letters like:
grep -Eo '^[^/]+'
On another SO site there is another similar question with solution.

You don't need grep for this at all.
cut -d / -f 1

The -o option says to print every substring which matches your pattern, instead of printing each matching line. Your current pattern matches every string which doesn't contain slashes (optionally including a trailing slash); but it's easy to switch to one which only matches this pattern at the beginning of a line.
grep -o '^[^/]*' file
Notice the addition of the ^ beginning of line anchor, and the omission of the -P option (which you were not really using anyway) as well as the silly beginner error {1}.
(I should add that plain grep doesn't support parentheses or repetitions; grep -E would support these constructs just fine, of you could switch to toe POSIX BRE variation which requires a backslash to use round or curly parentheses as metacharacters. You can probably ignore these details and just use grep -E everywhere unless you really need the features of grep -P, though also be aware that -P is not portable.)

Output in grep command

I'm new with regexp and I believe this is a beginner's glitch in getting the output.
My regexp for AB(CD) digits/digits is: [A-Z]+[^a-zA-Z0-9][A-Z]+[^a-zA-Z0-9] [0-9]+[^a-zA-Z0-9][0-9]+
Grep command is: grep "regexp" xyz.txt
there's no output to above command, but when I use sublime editor for same regex, i gets the desired result. Tried many attempts with grep command , the only time it gave results is when I deleted the [0-9]+[^a-zA-Z0-9][0-9]+ portion from regex because there is a space in between but still the results were not desired. Tried grep -e and grep --regexp= too, no results.
Can someone tell me where I went wrong or the correct syntax for this command. Much grateful.
Edit:
The data looks like the following:
AB(C.D.) nnnnn/nnnnnn
A.B(C.D.) nnnnnn/nnnnn
A.B.(CD) nnnnn/nnnnnn
AB(CD) nnnnn/nnnnnn
AAB(CD) nnnnn/nnnnnn
....
....
further P & C
I was looking only for AB(CD) nnnn/nnnnnn. Would really like to learn the correct expression.

Use grep -E as it switches grep into a special mode so that the expression is evaluated as an ERE (Extended Regular Expression) as opposed to its normal pattern matching.

OS X groups seems to allow character classes to repeat many times by default

I am trying to grep a text file for somethings, and I noticed some odd behaviro on OS X. I feel that I have a pretty solid grasp on regular expressions, but maybe I don't know as much as I think. So, I apologize if the answer is obvious.
Each line of my text file has this format:
<number> <number> <text>
So just to start, I want to see if I could match lines starting with a 1:
grep "^1" dataset.txt
However, it seems grepped match any line starting with 1, 11, 111, etc. This is just incorrect I think. EDIT: grep is matching 1, 11, 111, etc. This was causing some confusion. My problem is that grep is matching too many 1's, not that it is returning lines starting with 11.
Next, I wanted to see what would happen if I searched for any line starting with any digit:
grep "^[0-9]" dataset.txt
This matched the whole number at the start of each line, such as 130380, which is also incorrect. I tried this to see if I could only match the first digit in the line:
grep "^[0-9]?" dataset.txt
This pattern returns nothing. I also tried specifying -P to use perl style regular expressions and got this:
grep -P "^[0-9]" dataset.txt
usage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
[-e pattern] [-f file] [--binary-files=value] [--color=when]
[--context[=num]] [--directories=action] [--label] [--line-buffered]
[--null] [pattern] [file ...]
Clearly P is in the list of arguments, although I read the man page on my system, and -P was not listed. Does anyone know why grep is acting like this?
Thanks

grep "^1" dataset.txt
However, it seems grepped match any line starting with 1, 11, 111, etc. This is just incorrect I think.
This is expected behavior: you're asking for lines whose first char is 1, without further constraining what comes after.
If, by contrast, you don't want to constrain matching, but instead want to constrain the output by only printing the matching part of the line, you must use grep's -o option.
Update: Turns out that the OP was referring to the --color option's behavior: --color is supposed to color (highlight) the matching part of every matching line, but does so incorrectly due to a bug - as of grep (BSD grep) 2.5.1-FreeBSD (OS X 10.9.2).
.
Clearly P is in the list of arguments, although I read the man page on my system, and -P was not listed. Does anyone know why grep is acting like this?
-P (Perl-style regexes) are indeed NOT supported on OSX - what you see is a typo in the error message (it should be -p (lowercase!), an entirely different option - see man grep).
grep "^[0-9]?" dataset.txt
This pattern returns nothing.
This is expected behavior: OSX grep defaults to basic (aka obsolete) regular expressions, which require escaping ? as \?.
If you want to use extended (aka modern) regular expressions - where such escaping is not needed - invoke grep either as egrep or with the -E option.

Simple but elusive expression

I feel even shy of asking this here but it took already more time than it should.
Say I have these four files:
IPCDR_ARB06067956VPLUS_T_201103
IPCDR_ARB06067957VPLUS_T_201103
IPCDR_MOV_ARB06067959VPLUS_T_20110
MOV_CDRARB06067959VPLUS_T_201103
I want to grep for only those starting with IPCDR_MOV and MOV_CDR.
First thing off was:
ls -1 | grep "^IPCDR_MOV|^MOV_CDR"
but didn't work.
I've done plenty of dumb tests (which I wont bother you with) and nothing comes out. Can someone please put me out of my pain?
Thanks!

Add the -E switch for using extended regex.
$ ls -1 | grep -E "^IPCDR_MOV|^MOV_CDR"
IPCDR_MOV_ARB06067959VPLUS_T_20110
MOV_CDRARB06067959VPLUS_T_201103

Use egrep instead of grep. Otherwise it's a literal string instead of a pattern.

Sed substitution not doing what I want and think it should do

I have am trying to use sed to get some info that is encoded within the path of a file which is passed as a parameter to my script (Bourne sh, if it matters).
From this example path, I'd like the result to be 8
PATH=/foo/bar/baz/1-1.8/sing/song
I first got the regex close by using sed as grep:
echo $PATH | sed -n -e "/^.*\/1-1\.\([0-9][0-9]*\).*/p"
This properly recognized the string, so I edited it to make a substitution out of it:
echo $PATH | sed -n -e "s/^.*\/1-1\.\([0-9][0-9]*\).*/\1/"
But this doesn't produce any output. I know I'm just not seeing something simple, but would really appreciate any ideas about what I'm doing wrong or about other ways to debug sed regular expressions.
(edit)
In the example path the components other than the numerical one can contain numbers similar to the numeric path component that I listed, but not quite the same. I'm trying to exactly match the component that that is 1-1. and see what some-number is.
It is also possible to have an input string that the regular expression should not match and should product no output.

The -n option to sed supresses normal output, and since your second line doesn't have a p command, nothing is output. Get rid of the -n or stick a p back on the end

It looks like you're trying to get the 8 from the 1-1.8 (where 8 is any sequence of numerics), yes? If so, I would just use:
echo /foo/bar/baz/1-1.8/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
No doubt you could get it working with one sed "instruction" (-e) but sometimes it's easier just to break it down.
The first strips out everything from the start up to and including 1-1., the second strips from the first non-numeric after that to the end.
$ echo /foo/bar/baz/1-1.8/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
8
$ echo /foo/bar/baz/1-1.752/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
752
And, as an aside, this is actually how I debug sed regular expressions. I put simple ones in independent instructions (or independent part of a pipeline for other filtering commands) so I can see what each does.
Following your edit, this also works:
$ echo /foo/bar/baz/1-1.962/sing/song | sed -e "s/.*\/1-1\.\([0-9][0-9]*\).*/\1/"
962
As to your comment:
In the example path the components other than the numerical one can contain numbers similar to the numeric path component that I listed, but not quite the same. I'm trying to exactly match the component that that is 1-1. and see what some-number is.
The two-part sed command I gave you should work with numerics anywhere in the string (as long as there's no 1-1. after the one you're interested in). That's because it actually deletes up to the specific 1-1. string and thereafter from the first non-numeric). If you have some examples that don't work as expected, toss them into the question as an update and I'll adjust the answer.

You can shorten you command by using + (one or more) instead of * (zero or more):
sed -n -e "s/^.*\/1-1\.\([0-9]\+\).*/\1/"

don't use PATH as your variable. It clashes with PATH environment variable
echo $path|sed -e's/.*1-1\.//;s/\/.*//'

You needn't divide your patterns with / (s/a/b/g), but may choose every character, so if you're dealing with paths, # is more useful than /:
echo /foo/1-1.962/sing | sed -e "s#.*/1-1\.\([0-9]\+\).*#\1#"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Strange regex behavior with grep - regex

Related

Can I perform a 'non-global' grep and capture only the first match found for each line of input?

Output in grep command

OS X groups seems to allow character classes to repeat many times by default

Simple but elusive expression

Sed substitution not doing what I want and think it should do

Categories

Resources