Regex to match plurals only if an exact match is not found

Regex to match plurals only if an exact match is not found - regex

I could use a little help figuring out regex. Given a list of words in a file:
Peril
Is
I
Non
No
I'm trying to find a regex that will match a plural if necessary but only if there is not another match available. What I have at the moment:
#!/bin/bash
findword(){
grep -iE "^$#?" file
}
If I run it like findword perils it returns Peril. That's what I want to happen.
But if I run it like findword non it matches both Non and No.
Same with findword is matches both Is and I. That's not what I want to happen. I only want non exact matches if it can't find an exact match in the list.

$ cat file
Peril
Is
I
Non
No
$ findword(){ grep -ix "$1" file || grep -ix "${1::-1}" file; }
$ findword no
No
$ findword non
Non
$ findword none
Non
$ findword i
I
$ findword is
Is
-x to force matching for entire line only
grep -ix "$1" file if there is a match found, it will be printed and exit status will be 0
else, the command after || comes in to play
grep -ix "${1::-1}" file check again with last character removed
can also use grep -ixE "$1?" file
Also, can add -F option incase words can contain metacharacters like . but you want to search literally

Related

grep -o multiple occurrences of variable string in same line

I have the following line of text in a file:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/), I checked Ronit's [DNased ctenidia RNA (from 20181016)](http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I would like to extract each of the strings that match this pattern:
(http://onsnetwork.org/kubu4/.*/)
I've tried the following command, but it returns the entire line, despite the -o flag:
grep -o "(http://onsnetwork.org/kubu4/.*/)" file.txt
The output I'd like is this:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/)
(http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I'll be applying the grep command to a series of files that will have different text after (http://onsnetwork.org/kubu4/, so the command needs to allow for that flexibility.
I'm just not sure why the regex portion of the grep causes grep to return the entire line instead of each matching occurrence.

You should check urls which are inside parenthesis:
grep -o '(http://onsnetwork.org/kubu4/[^)]*/)' # So, [^)]* and not .*
With .*/, grep while extract from ( to the last / encountered.

using grep in ubuntu

I am trying to search pattern in a file named test by using grep in ubuntu
The following is content of test
./foldera/[hello]this.mp4
./foldera/folderb/[hello]that.mp4
./folderc/[these]hello.mp4
On this website regexp simulator, I use the following pattern to search and it works, three lines got matched.
.*\/[A-Za-z0-9\[\]]+\.mp4
But in ubuntu, I ran the following command in terminal, it doesn't work, nothing has returned in the terminal.
timothy#ubuntu:~$ cat ~/Desktop/test
./foldera/[hello]this.mp4
./foldera/folderb/[hello]that.mp4
./folderc/[these]hello.mp4
timothy#ubuntu:~$ cat ~/Desktop/test | grep -E '.*\/[A-Za-z0-9\[\]]+\.mp4'
timothy#ubuntu:~$
What is the reasons that grep cannot search all the lines in the file?

grep extended regular expressions doesn't use backslash to escape square brackets inside square brackets. The proper way to do it is to put ] as the first character in the square brackets; this is treated as a literal character because you can't have an empty character set.
grep -E '/[]A-Za-z0-9[]+\.mp4' test.txt
There's also no need for .* at the beginning. grep simply checks whether anything on the line matches the pattern, so adding a match for anything at the beginning or end is redundant (it's only necessary if you're using -o to print just the part of the line that matches).

Capture group from regex in bash

I have the following string /path/to/my-jar-1.0.jar for which I am trying to write a bash regex to pull out my-jar.
Now I believe the following regex would work: ([^\/]*?)-\d but I don't know how to get bash to run it.
The following: echo '/path/to/my-jar-1.0.jar' | grep -Po '([^\/]*?)-\d' captures my-jar-1

In BASH you can do:
s='/path/to/my-jar-1.0.jar'
[[ $s =~ .*/([^/[:digit:]]+)-[[:digit:]] ]] && echo "${BASH_REMATCH[1]}"
my-jar
Here "${BASH_REMATCH[1]}" will print captured group #1 which is expression inside first (...).

You can do this as well with shell prefix and suffix removal:
$ path=/path/to/my-jar-1.0.jar
# Remove the longest prefix ending with a slash
$ base="${path##*/}"
# Remove the longest suffix starting with a dash followed by a digit
$ base="${base%%-[0-9]*}"
$ echo "$base"
my-jar
Although it's a little annoying to have to do the transform in two steps, it has the advantage of only using Posix features so it will work with any compliant shell.
Note: The order is important, because the basename cannot contain a slash, but a path component could contain a dash. So you need to remove the path components first.

grep -o doesn't recognize "capture groups" I think, just the entire match. That said, with Perl regexps (-P) you have the "lookahead" option to exclude the -\d from the match:
echo '/path/to/my-jar-1.0.jar' | grep -Po '[^/]*(?=-\d)'
Some reference material on lookahead/lookbehind:
http://www.perlmonks.org/?node_id=518444

Regex for uppercase matches with exclusions

I'm trying to come up with a regex for the following case: I need to find any matching paths using grep for the following paths:
Include all uppercase matching paths.
Example:
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Notice the capital B in Bar.
Exclude all uppercase matching paths that only contain SNAPSHOT and have no other uppercase letters.
Example:
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
Is this possible with grep?

Something like this might do:
grep -vE '^([^[:upper:]]*(SNAPSHOT)?)*$'
Breakdown:
-v will reverse the match (show all non matched lines. -E enabled Extended Regular Expressions.
^ # Start of line
( )* # Capturing group repeated zero or more times
[^[:upper:]]* # Match all but uppercase zero or more times
(SNAPSHOT)? # Followed by literal SNAPSHOT zero or one time
$ # End of line

Just use awk:
$ cat file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
With GNU awk or mawk for gensub():
$ awk 'gensub(/SNAPSHOT/,"","g")~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
With other awks:
$ awk '{r=$0; gsub(/SNAPSHOT/,"",r)} r~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar

Well, you need find to list all paths. Then you can do it with grep with two runs. One includes all capital cases. The other one excludes that contain no capitals except SNAPSHOT:
find . | grep '[A-Z]' | grep -v '.*\/[^A-Z]*SNAPSHOT[^A-Z]*$'
I think only the last grep needs some explanation:
grep -v excludes the matching lines
.*\/ greedily matches everything up to the first slash. There'll always be a slash due to find .
[^A-Z]* finds all characters that are non-capital letters. So we apply it before and after the SNAPSHOT literal, up to the end of the string.
Here you can play with it online.

If you only want to get the matching files. I'll do it like this.
find . -type f -regex '.*[A-Z].*' | while read -r line; do echo "$line" | sed 's/SNAPSHOT//g' | grep -q '.*[A-Z].*' && echo "$line"; done

Matching arbitrary number of digits using grep regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.

Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename

Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$

You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.

You should put a "+" (which means one or several) instead of "*" (which means zero, one or several

The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH

grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$

🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match plurals only if an exact match is not found - regex

Related

grep -o multiple occurrences of variable string in same line

using grep in ubuntu

Capture group from regex in bash

Regex for uppercase matches with exclusions

Matching arbitrary number of digits using grep regex

Categories

Resources