How do I create a RegEx which has multiple criteria? - regex

I am working through a lab on RegEx which asks me to:
Search the 'countries' file for all the words with nine characters and
the letter i.How many results are found?
I am working in a generic Linux command prompt in a online emulated environment. I am allowed to use grep, awk or sed though I am feeling a preference for grep.
(I am 100% a noob when it comes to RegEx so please explain it to me like I'm 5)
Per a previous lab I already used something like below which finds me all countries which have 9 characters, however I cannot find a way to make it find all words which have 9 characters AND contain the letter i in any position.
grep -E '\b\w{9}\b' countries
The | operator does not help because its an OR operator and will find me all instances that i is found, and all words which are 9 characters and I need both to happen at the same time. I tried multiple grep statements as well and it seems the emulator may not accept that.
I am also trying to stick to [] character sets as the next question asks for multiple letters within the 9 letter word.

One way of solving this problem is to use grep twice, and pipe one result to the next.
First, we find all words with length 9, like you did on the previous exercise:
grep -Eo '\b\w{9}\b' countries
I'm using the flag o that lists only the matching words, printing one word per line.
Next, we use Linux pipe (not regex OR) to feed the output of the first grep to a second grep:
grep -Eo '\b\w{9}\b' countries | grep 'i'
The final output will be all words with nine characters and i.
Depending on your requirements, this approach may be considered "cheating" if you're more focused on Regex, but a good solution if you're also learning Linux.
The fact you are looking for words complicates the regex (in contrary to lines in the file), but it is also possible to come up with a single regex to match these words.
\b(?=\w*i)\w{9}\b
This builds on \b\w{9}\b you already have. (?=\w*i) is the AND condition. After we find the beginning of the word (\b), we look ahead for \w*i (zero or more letters, and then our i). We're using \w* in the lookahead, not .*, so we are looking at the same word. (?=.*i) would have matched any i also after the nine characters.
After finding the i, we continue to make sure the word is only 9 letters.
Working example: https://regex101.com/r/G5EVdM/1

Related

grep with regex to match lines starting with exactly three lowercase letters

So normally I have a pretty good handle on using grep with regular expressions but I'm hung up on this simple thing. I have the following text file named first-paragraph.txt (shown below) and I want to find all lines that start with exactly three lowercase characters '^[a-z]{3}' but I cannot seam to make it work.
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
And trying this gives me all lines starting with three or more lowercase characters but, I don't understand why I'm not only getting the one line that actually starts with three characters 'its ....'
$ grep '^[a-z]\{3\}' first-paragraph.txt
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
$ grep '^[a-z]\{3\}$' first-paragraph.txt # no lines
$ grep '^([a-z])\{3\}' first-paragraph.txt # no lines
You may use
grep '^[a-z]\{3\}\b' file # \b is a word boundary
See this grep demo.
Or, if your grep does not support the word boundary construct, you may use
grep -e '^[a-z]\{3\}$' -e '^[a-z]\{3\}[^[:alnum:]]' file
where the first pattern will match 3-letter only lines and the second one will only match those where the fourth char is a non-alphanumeric char.
See this grep demo.
You may replace [^[:alnum:]] with [^[:alpha:]] if you want to allow any char other than a letter to appear inside a "word" (i.e. if you want to get a match for cat123). Or, you may replace [:alnum:] with [:space:] to only signal the end of a word when it ends with a whitespace.

Finding word permutations in text

I am trying to find words for a text file that have the same number of characters but are in a different order. For example. I input a word like "hyone" and I want to find a word with the same length and number of characters from the text file. In this case "honey" or "heony".
I have already tried using grep with regex but the code I used returns words that are the same length but don't have the same number of characters.
I used this command:
grep -E "^[hyone]{5}$" list.txt
This command return words that are 5 characters long but they include the words that are not made with all of the characters like "hoooo" or "yeehe".
Please note that the examples given are made up but they summarize the problem.
not the best-looking regexp but for your example it's working:
\b(?=.*h)(?=.*y)(?=.*o)(?=.*n)(?=.*e).{5}\b
This one checks if 5 signs (change . to \w for characters and digits only or use [a-z] for ASCII) are preceeded by the chars h, y, o, n, and e.
It might not work on other examples, though. And for usage as one-liner its creation could be a bit tricky for other characters to be checked for. So, regexps might not be the best solution for such problems. Levenshtein (as suggested by Thomas; maybe in addition to Soundex) could work a lot better - they are a bit more complicated, however
You can test the given regexp online at: https://regex101.com/r/7Cdu03/3/

I'm struggling with Bash regular expressions

I'm frustrated trying to find out how to use regex to do anything useful. I'm completely uncertain on everything that I do, and I've resorted to trial and error; which has not been effective.
I'm trying to list files in the current directory that starts with a letter, contains a number, end with a dot followed by a lowercase character, etc.
So I know starts with a letter would be:
^[a-zA-Z]
but I don't know how to follow that up with CONTAINS a number. I know ends with a dot can be [\.]*, but I'm not sure. I'm seeing that $ is also used to match strings at the end of the word.
I have no idea if I should be using find with regex to do this, or ls | grep .... I'm completely lost. Any direction would be appreciated.
I guess the specific question I was trying to ask, was how to I glue the expressions together. For example, I tried ls | grep ^[a-zA-Z][0-9] but this only shows files that start with letter, followed by a number. I don't know how write a regex that starts with a letter, and then add the next requirement, ie. contains a number.
Starts with a letter: ^[a-zA-Z]
Contains a number: .*[0-9].*
Ends with a dot and lowercase letter: \.[a-z]$
Together:
^[a-zA-Z].*[0-9].*\.[a-z]$
The best way to find files that match a regex is with find -regex. When you use that the ^ and $ anchors are implied, so you can omit them. You'll need to tack on .*/ at the front to match any directory, since -regex matches both the directory and file name.
find -regex '.*/[a-zA-Z].*[0-9].*\.[a-z]'
There's plenty of documentation online, eg. GNU's Reference Manual.
Your particular example, would require something like:
^[:alpha:].*[:digit:].*\.[:lower:]$
or if POSIX classes are not available:
^[a-zA-Z].*[0-9].*\.[a-z]$
You can read either as:
start of line
a letter (upper or lower case)
any character, zero or more times
a digit
any character, zero or more times
a dot (must be escaped with a backslash)
a lower case letter
end of line
Once you settle on a regular expression, you can use it with ls within the directory you wish to find the files in:
cd <dir>
ls -1 | grep '^[a-zA-Z].*[0-9].*\.[a-z]$'
NOTE: I tried to improve my answer based on some of the comments.

Finding words which match regex in multiple text files

So, I'm new to manipulating data from the command line, and also a beginner at regex.
I have multiple .txt files in multiple subdirectories. What I want to do is to find all words which have a certain number of consecutive consonants.
What I've tried so far is something like this:
find . | grep -orhn '[bdfghjklmnprstvxzþ]\{2\}' > ../words.txt
Which only prints out something like:
2:rt
2:gr
2:xl
3:gr
3:st
3:kk
I want to get the whole word, not just the two consecutive consonants (and the numbers and colon. I don't know where that comes from since it's not in the original data, but it really doesn't matter for what I am trying to do).
Do you have a tip?
The -n option is the line number in the text.
My suggestion is to try matching the word characters before and after.
This is what I tried and seemed to work.
grep -orh '\w\+[bdfghjklmnprstvxzþ]\{2\}\w\+'
The -o option will only show what is matching, which is the entire word.
The -r will look recursively which isn't relevant here given that find is doing the recursion for you.

Grep is messing up my understanding

For sometime I have been trying to play with grep to retrieve data from files and I noticed something funny.
It might be my ignorance but here is what happens...
Suppose I have a file ABC. the data is:
a
abc
ab
bac
bb
ac
Now ran this grep command,
grep a* ABC
I found the output to contain lines starting a with b.c. why is this happening?
You used 'a*' as your search pattern... the '*' means ZERO or MORE of the previous character, so 'b.c' matches, having ZERO or more 'a's in it.
On a semi-related note, I'd recommend quoting the 'a*' bit, since if you have ANY files in the current subdirectory which start with a, you'll be VERY surprised to see what you're really searching for, since the shell (bash,zsh,csh,sh,dash,wtfsh...) will perform wildcard expansion automatically BEFORE the command is executed.
if you want to search for lines which START with 'a', then you'll need to anchor the search pattern with a leading ^ character, so your pattern becomes '^a*', but again, the * means ZERO or more, so it's not useful in this situation where you only have one letter... use '^a' instead.
As a contrived example, if you wanted to find all the lines containing a 'c' AND those containing the letters 'bc', then you could use 'b*c' as the search pattern... meaning ZERO or more b's, and a c.
The power of the regex search pattern is immense, and takes some time to grok. Peruse the man pages for grep(1), regex(7), pcre(3), pcresyntax(3), pcrepattern(3).
Once you get the hang of them, regex's are useful in sed, grep, perl, vim, (probably emacs too), ... uh, it's late (early?) nothing more comes to mind, but they're VERY powerful.
As some bonus, '*' means ZERO or more, '+' means ONE or more, and '?' means ZERO or ONE.
So to search for things with two or more a's... 'aa+', which is 1 a, and 1+ a (1 or more)
I ramble.... (regex(7)!)
grep tries to find that pattern in the whole line. Use ^a to get line starting with a or ^a*$ to find lines containing only as (including the empty line).
also, please quote that shell argument (eg: '^a*$'), if you use a* and there is a file in the working directory starting with an a you will get very weird results...
Try this, it works for me. The ^ means beginning of a line - so it has to start with a.
grep ^a ABC
You need to put quotes around your pattern:
grep "a*" ABC
Otherwise the * is interpreted by the shell (which does wild-card filename matching), instead of by grep itself.