Matching the First Character on Each Line (UNIX egrep) - regex

I'm looking to match and return just the first character from each line in a plain-text UTF-8 encoded file using in a UNIX terminal using egrep. I presumed that the following egrep command with a simple RegEx would produce the desired result:
egrep -o "^." FILE.txt
However, the output appears to be matching and returning every character in the file; that is, it is behaving as if the command were:
egrep -o "." FILE.txt
Similar results occur with the following command,
egrep -o "^[a-z]" FILE.txt
namely, the results act as if the RegEx "[a-z]" were supplied (i.e., every lowercase ASCII character in the range a-z is matched).
Commands in which just one specific alphanumeric characters ist supplied seem, as expected, to return every line that begins with the specific character, e.g.,
egrep -o "^1" FILE.txt
or
egrep -o "^T" FILE.txt
return all lines beginning with "1" or "T", respectively.
I have tried pasting the entirety of the file into a RegEx tester, such as at https://regexr.com/, and the expression "^." indeed behaves as expected, so I don't think that my file has any further whitespace characters that could be interfering.
Is there some other behavior of the line-beginning metacharacter "^" with egrep that could be causing this problem?

This is a known bug in BSD grep and GNU grep 2.5.1-FreeBSD (also discussed here).
In -o mode, ^ anchor isn't handled properly (reported here, patched here):
$ echo abc | bsdgrep -o "^."
a
b
c
GNU grep on Linux behaves as expected:
$ echo abc | grep -o "^."
a
Related to what you are trying to achieve here (print the first character of every line), grep is an overkill. A simple cut would suffice:
$ echo abc | cut -c1
a

Related

grep -o multiple occurrences of variable string in same line

I have the following line of text in a file:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/), I checked Ronit's [DNased ctenidia RNA (from 20181016)](http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I would like to extract each of the strings that match this pattern:
(http://onsnetwork.org/kubu4/.*/)
I've tried the following command, but it returns the entire line, despite the -o flag:
grep -o "(http://onsnetwork.org/kubu4/.*/)" file.txt
The output I'd like is this:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/)
(http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I'll be applying the grep command to a series of files that will have different text after (http://onsnetwork.org/kubu4/, so the command needs to allow for that flexibility.
I'm just not sure why the regex portion of the grep causes grep to return the entire line instead of each matching occurrence.
You should check urls which are inside parenthesis:
grep -o '(http://onsnetwork.org/kubu4/[^)]*/)' # So, [^)]* and not .*
With .*/, grep while extract from ( to the last / encountered.

grep regex with backtick matches all lines

$ cat file
anna
amma
kklks
ksklaii
$ grep '\`' file
anna
amma
kklks
ksklaii
Why? How is that match working ?
This appears to be a GNU extension for regular expressions. The backtick ('\`') anchor matches the very start of a subject string, which explains why it is matching all lines. OS X apparently doesn't implement the GNU extensions, which would explain why your example doesn't match any lines there. See http://www.regular-expressions.info/gnu.html
If you want to match an actual backtick when the GNU extensions are in effect, this works for me:
grep '[`]' file
twm's answer provides the crucial pointer, but note that it is the sequence \`, not ` by itself that acts as the start-of-input anchor in GNU regexes.
Thus, to match a literal backtick in a regex specified as a single-quoted shell string, you don't need any escaping at all, neither with GNU grep nor with BSD/macOS grep:
$ { echo 'ab'; echo 'c`d'; } | grep '`'
c`d
When using double-quoted shell strings - which you should avoid for regexes, for reasons that will become obvious - things get more complicated, because you then must escape the ` for the shell's sake in order to pass it through as a literal to grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\`"
c`d
Note that, after the shell has parsed the "..." string, grep still only sees `.
To recreate the original command with a double-quoted string with GNU grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\\\`" # !! BOTH \ and ` need \-escaping
ab
c`d
Again, after the shell's string parsing, grep sees just \`, which to GNU grep is the start-of-the-input anchor, so all input lines match.
Also note that since grep processes input line by line, \` has the same effect as ^ the start-of-a-line anchor; with multi-line input, however - such as if you used grep -z to read all lines at once - \` only matches the very start of the whole string.
To BSD/macOS grep, \` simply escapes a literal `, so it only matches input lines that contain that character.

grep not returning expected result with regex on xml

I'm running a grep command on some xml, and it appears to be misinterpretting the regular expression I'm trying to use.
Here's the command
grep '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
what appears to be happening is that the ?> aspect of the regex seems to cause no matching rather than matching to the first occurence of >
Any ideas?
If you want to get the text upto the first occurrence of > character then try the below command,
grep -o '<ernm:NewReleaseMessage[^>]*>' file
If you want the whole line then remove -o parameter.
Example:
$ cat aa1.txt
<ernm:NewReleaseMessage blah> foo bar>
$ grep -o '<ernm:NewReleaseMessage[^>]*>' aa1.txt
<ernm:NewReleaseMessage blah>
grep with -o prints only the matched text.
[^>]* - Not of > character zero or more. So it matches upto the first occurance of > character.
By default, grep uses basic regular expression and considers ? as a literal question-mark. For it to be considered regular expression syntax, you need to escape that character.
grep '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
You can use the -E option which interprets the pattern as an extended regular expression.
grep -E '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
Note: This above will return the whole line that matches your pattern, if you only want the matched text, use the -o option which prints only the matched parts of matching lines.
grep -o '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
OR
grep -Eo '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml

UNIX grep with $

I have a quick question:
Suppose I have a file contains:
abc$
$
$abc
and then I use grep "c\$" filename, then I got abc$ only. But if I use grep "c\\$", I got abc$.
I am pretty confused, doesn't back slash already turn off the special meaning of $? So grep "c\$" filename return me the line abc$?
Really hope who can kindly give me some suggestion.
Many thanks in advance.
The double quotes are throwing you off. That allows the shell to expand meta-characters. On my Linux box using single quotes only:
$ grep 'abc$' <<<'abc$'
$ grep 'abc\$' <<<'abc$'
$ grep 'abc\$' <<<"abc$"
abc$
$ grep 'abc$' <<<'abc$'
$ grep 'abc\\$' <<<'abc$'
$
Note that the only grep in the five commands above that found the pattern (and printed it out) was abc\$. If I didn't escape the $, it assumed I was looking for the string abc that was anchored to the end of the line. When I put a single backslash before the $, it recognized the $ as a literal character and not as a end of line anchor.
Note that the $ as an end of line anchor has some intelligence. If I put the $ in the middle of a regular expression, it's a regular character:
$ grep 'a$bc' <<<'a$bc'
a$bc
$ grep 'a\$bc' <<<'a$bc'
a$bc
Here, it found the literal string a$bc whether or not i escaped the $.
Tried things with double quotes:
$ grep "abc\$" <<<'abc$'
$ grep "abc\\$" <<<'abc$'
abc$
The single \ escaped the $ as a end of line anchor. Putting two \\ in front escaped the $ as a non-shell meta-character and as a regular expression literal.
If you're tempted to think that $ need to be escaped, then it's not so.
From the GNU grep manual, you'd figure:
The meta-characters that need to be escaped while using basic regular expressions are ?, +, {, |, (, and ).
I would suggest using fgrep if you want to search for literal $ and avoid escaping $ (which means end of line):
fgrep 'abc$' <<< 'abc$'
gives this output:
abc$
PS: fgrep is same as grep -F and as per the man grep
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
Sign $ has special meaning in regexp patterns as the end of line, so when you use double quotes
grep "c\$"
the string expanded as two characters c and $ and grep thinks that it is regexp clause c'mon, find all lines with 'c' at the end.
In case of singe quotes, all characters treated as each one, i.e.
grep 'c\$'
command will have three characters c, \ and $. So grep will got all those symbols at its input and therefore he gets escaped special $ symbol, i.e. as \$ and do what you have expected.

Matching arbitrary number of digits using grep regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.
Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename
Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$
You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.
You should put a "+" (which means one or several) instead of "*" (which means zero, one or several
The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH
grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$
🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit