grep -o multiple occurrences of variable string in same line - regex

I have the following line of text in a file:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/), I checked Ronit's [DNased ctenidia RNA (from 20181016)](http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I would like to extract each of the strings that match this pattern:
(http://onsnetwork.org/kubu4/.*/)
I've tried the following command, but it returns the entire line, despite the -o flag:
grep -o "(http://onsnetwork.org/kubu4/.*/)" file.txt
The output I'd like is this:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/)
(http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I'll be applying the grep command to a series of files that will have different text after (http://onsnetwork.org/kubu4/, so the command needs to allow for that flexibility.
I'm just not sure why the regex portion of the grep causes grep to return the entire line instead of each matching occurrence.

You should check urls which are inside parenthesis:
grep -o '(http://onsnetwork.org/kubu4/[^)]*/)' # So, [^)]* and not .*
With .*/, grep while extract from ( to the last / encountered.

Related

Matching the First Character on Each Line (UNIX egrep)

I'm looking to match and return just the first character from each line in a plain-text UTF-8 encoded file using in a UNIX terminal using egrep. I presumed that the following egrep command with a simple RegEx would produce the desired result:
egrep -o "^." FILE.txt
However, the output appears to be matching and returning every character in the file; that is, it is behaving as if the command were:
egrep -o "." FILE.txt
Similar results occur with the following command,
egrep -o "^[a-z]" FILE.txt
namely, the results act as if the RegEx "[a-z]" were supplied (i.e., every lowercase ASCII character in the range a-z is matched).
Commands in which just one specific alphanumeric characters ist supplied seem, as expected, to return every line that begins with the specific character, e.g.,
egrep -o "^1" FILE.txt
or
egrep -o "^T" FILE.txt
return all lines beginning with "1" or "T", respectively.
I have tried pasting the entirety of the file into a RegEx tester, such as at https://regexr.com/, and the expression "^." indeed behaves as expected, so I don't think that my file has any further whitespace characters that could be interfering.
Is there some other behavior of the line-beginning metacharacter "^" with egrep that could be causing this problem?
This is a known bug in BSD grep and GNU grep 2.5.1-FreeBSD (also discussed here).
In -o mode, ^ anchor isn't handled properly (reported here, patched here):
$ echo abc | bsdgrep -o "^."
a
b
c
GNU grep on Linux behaves as expected:
$ echo abc | grep -o "^."
a
Related to what you are trying to achieve here (print the first character of every line), grep is an overkill. A simple cut would suffice:
$ echo abc | cut -c1
a

Regex to match plurals only if an exact match is not found

I could use a little help figuring out regex. Given a list of words in a file:
Peril
Is
I
Non
No
I'm trying to find a regex that will match a plural if necessary but only if there is not another match available. What I have at the moment:
#!/bin/bash
findword(){
grep -iE "^$#?" file
}
If I run it like findword perils it returns Peril. That's what I want to happen.
But if I run it like findword non it matches both Non and No.
Same with findword is matches both Is and I. That's not what I want to happen. I only want non exact matches if it can't find an exact match in the list.
$ cat file
Peril
Is
I
Non
No
$ findword(){ grep -ix "$1" file || grep -ix "${1::-1}" file; }
$ findword no
No
$ findword non
Non
$ findword none
Non
$ findword i
I
$ findword is
Is
-x to force matching for entire line only
grep -ix "$1" file if there is a match found, it will be printed and exit status will be 0
else, the command after || comes in to play
grep -ix "${1::-1}" file check again with last character removed
can also use grep -ixE "$1?" file
Also, can add -F option incase words can contain metacharacters like . but you want to search literally

Not able to match colon using grep regexp

I have a large ASCII file. Each line contains a field like:
"id":"N119PM-1442267121-144-0"
The double quotes are actually in the file, not my addition. The fields are delimited by commas but they do not necessarily appear in the same order from line to line, which means that using cut is not a viable option.
I have been using:
grep -o '"id":"[A-Aa-z0-9-]\+' <filename>
and it works for the type of field shown above. But, there is a problem. A large number of these fields look like
"id":"JBU19-1442091600-schedule-0000:4"
In other words, they have an extra colon and number at the end. I have not been able to select fields with these extra characters.
I've tried:
grep -o '"id":"[A-Aa-z0-9:-]\+' <filename>
grep -o '"id":"[A-Aa-z0-9\:-]\+' <filename>
grep -o '"id":"[A-Aa-z0-9-]\+\(:[0-9]\+\)' <filename>
without success. Any help would be appreciated.
EDIT: I have also tried changing the : to % first then search on %, but this didn't work, either.
If you are using GNU GREP, you can use -P in grep command
grep -oP '"id":"[A-Za-z0-9-:]+"' <filename>
"id":"N119PM-1442267121-144-0"
"id":"JBU19-1442091600-schedule-0000:4"
-P, --perl-regexp PATTERN is a Perl regular expression

grep regular expression returns full line

Im trying to print everything after a keyword using grep but the command returns the whole line. Im using the following:
grep -P (\skeyword\s)(.*)
an example line is:
abcdefg keyword hello, how are you.
The result should be hello, how are you but instead it gives the full line. Am I doing something wrong here?
You need to use -o (only matching) parameter and \K (discards the previously matched characters) or a positive lookbehind.
grep -oP '\skeyword\s+\K.*' file
\K keeps the text matched so far out of the overall regex match. \s+ matches one or more space characters.
Example:
$ echo 'abcdefg keyword hello, how are you.' | grep -oP '\skeyword\s+\K.*'
hello, how are you.
By default, Grep prints lines that match. To print only matching expressions try the '-o' option.

grep not returning expected result with regex on xml

I'm running a grep command on some xml, and it appears to be misinterpretting the regular expression I'm trying to use.
Here's the command
grep '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
what appears to be happening is that the ?> aspect of the regex seems to cause no matching rather than matching to the first occurence of >
Any ideas?
If you want to get the text upto the first occurrence of > character then try the below command,
grep -o '<ernm:NewReleaseMessage[^>]*>' file
If you want the whole line then remove -o parameter.
Example:
$ cat aa1.txt
<ernm:NewReleaseMessage blah> foo bar>
$ grep -o '<ernm:NewReleaseMessage[^>]*>' aa1.txt
<ernm:NewReleaseMessage blah>
grep with -o prints only the matched text.
[^>]* - Not of > character zero or more. So it matches upto the first occurance of > character.
By default, grep uses basic regular expression and considers ? as a literal question-mark. For it to be considered regular expression syntax, you need to escape that character.
grep '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
You can use the -E option which interprets the pattern as an extended regular expression.
grep -E '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
Note: This above will return the whole line that matches your pattern, if you only want the matched text, use the -o option which prints only the matched parts of matching lines.
grep -o '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
OR
grep -Eo '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml