Finding words which match regex in multiple text files - regex

So, I'm new to manipulating data from the command line, and also a beginner at regex.
I have multiple .txt files in multiple subdirectories. What I want to do is to find all words which have a certain number of consecutive consonants.
What I've tried so far is something like this:
find . | grep -orhn '[bdfghjklmnprstvxzþ]\{2\}' > ../words.txt
Which only prints out something like:
2:rt
2:gr
2:xl
3:gr
3:st
3:kk
I want to get the whole word, not just the two consecutive consonants (and the numbers and colon. I don't know where that comes from since it's not in the original data, but it really doesn't matter for what I am trying to do).
Do you have a tip?

The -n option is the line number in the text.
My suggestion is to try matching the word characters before and after.
This is what I tried and seemed to work.
grep -orh '\w\+[bdfghjklmnprstvxzþ]\{2\}\w\+'
The -o option will only show what is matching, which is the entire word.
The -r will look recursively which isn't relevant here given that find is doing the recursion for you.

Related

How do I create a RegEx which has multiple criteria?

I am working through a lab on RegEx which asks me to:
Search the 'countries' file for all the words with nine characters and
the letter i.How many results are found?
I am working in a generic Linux command prompt in a online emulated environment. I am allowed to use grep, awk or sed though I am feeling a preference for grep.
(I am 100% a noob when it comes to RegEx so please explain it to me like I'm 5)
Per a previous lab I already used something like below which finds me all countries which have 9 characters, however I cannot find a way to make it find all words which have 9 characters AND contain the letter i in any position.
grep -E '\b\w{9}\b' countries
The | operator does not help because its an OR operator and will find me all instances that i is found, and all words which are 9 characters and I need both to happen at the same time. I tried multiple grep statements as well and it seems the emulator may not accept that.
I am also trying to stick to [] character sets as the next question asks for multiple letters within the 9 letter word.
One way of solving this problem is to use grep twice, and pipe one result to the next.
First, we find all words with length 9, like you did on the previous exercise:
grep -Eo '\b\w{9}\b' countries
I'm using the flag o that lists only the matching words, printing one word per line.
Next, we use Linux pipe (not regex OR) to feed the output of the first grep to a second grep:
grep -Eo '\b\w{9}\b' countries | grep 'i'
The final output will be all words with nine characters and i.
Depending on your requirements, this approach may be considered "cheating" if you're more focused on Regex, but a good solution if you're also learning Linux.
The fact you are looking for words complicates the regex (in contrary to lines in the file), but it is also possible to come up with a single regex to match these words.
\b(?=\w*i)\w{9}\b
This builds on \b\w{9}\b you already have. (?=\w*i) is the AND condition. After we find the beginning of the word (\b), we look ahead for \w*i (zero or more letters, and then our i). We're using \w* in the lookahead, not .*, so we are looking at the same word. (?=.*i) would have matched any i also after the nine characters.
After finding the i, we continue to make sure the word is only 9 letters.
Working example: https://regex101.com/r/G5EVdM/1

Finding word permutations in text

I am trying to find words for a text file that have the same number of characters but are in a different order. For example. I input a word like "hyone" and I want to find a word with the same length and number of characters from the text file. In this case "honey" or "heony".
I have already tried using grep with regex but the code I used returns words that are the same length but don't have the same number of characters.
I used this command:
grep -E "^[hyone]{5}$" list.txt
This command return words that are 5 characters long but they include the words that are not made with all of the characters like "hoooo" or "yeehe".
Please note that the examples given are made up but they summarize the problem.
not the best-looking regexp but for your example it's working:
\b(?=.*h)(?=.*y)(?=.*o)(?=.*n)(?=.*e).{5}\b
This one checks if 5 signs (change . to \w for characters and digits only or use [a-z] for ASCII) are preceeded by the chars h, y, o, n, and e.
It might not work on other examples, though. And for usage as one-liner its creation could be a bit tricky for other characters to be checked for. So, regexps might not be the best solution for such problems. Levenshtein (as suggested by Thomas; maybe in addition to Soundex) could work a lot better - they are a bit more complicated, however
You can test the given regexp online at: https://regex101.com/r/7Cdu03/3/

I'm struggling with Bash regular expressions

I'm frustrated trying to find out how to use regex to do anything useful. I'm completely uncertain on everything that I do, and I've resorted to trial and error; which has not been effective.
I'm trying to list files in the current directory that starts with a letter, contains a number, end with a dot followed by a lowercase character, etc.
So I know starts with a letter would be:
^[a-zA-Z]
but I don't know how to follow that up with CONTAINS a number. I know ends with a dot can be [\.]*, but I'm not sure. I'm seeing that $ is also used to match strings at the end of the word.
I have no idea if I should be using find with regex to do this, or ls | grep .... I'm completely lost. Any direction would be appreciated.
I guess the specific question I was trying to ask, was how to I glue the expressions together. For example, I tried ls | grep ^[a-zA-Z][0-9] but this only shows files that start with letter, followed by a number. I don't know how write a regex that starts with a letter, and then add the next requirement, ie. contains a number.
Starts with a letter: ^[a-zA-Z]
Contains a number: .*[0-9].*
Ends with a dot and lowercase letter: \.[a-z]$
Together:
^[a-zA-Z].*[0-9].*\.[a-z]$
The best way to find files that match a regex is with find -regex. When you use that the ^ and $ anchors are implied, so you can omit them. You'll need to tack on .*/ at the front to match any directory, since -regex matches both the directory and file name.
find -regex '.*/[a-zA-Z].*[0-9].*\.[a-z]'
There's plenty of documentation online, eg. GNU's Reference Manual.
Your particular example, would require something like:
^[:alpha:].*[:digit:].*\.[:lower:]$
or if POSIX classes are not available:
^[a-zA-Z].*[0-9].*\.[a-z]$
You can read either as:
start of line
a letter (upper or lower case)
any character, zero or more times
a digit
any character, zero or more times
a dot (must be escaped with a backslash)
a lower case letter
end of line
Once you settle on a regular expression, you can use it with ls within the directory you wish to find the files in:
cd <dir>
ls -1 | grep '^[a-zA-Z].*[0-9].*\.[a-z]$'
NOTE: I tried to improve my answer based on some of the comments.

Search and replace multiple words with notepad++ (the text uses pipe to separate words)

I have a file with thousands of lines and I need to find two terms that will always be on the same line, but there will be several lines of it in this file.
First, the file itself uses pipe | to separate data, like this:
|C485|01|2,50||0,0000|||0,00|1052|62103|
What I need to find is the lines that contains:
|C481|01| and |0,0000|
and replace the first word to:
|C481|04|
I found an answer to this question but when I chose to do the following, it did not work.
Using regular expression:
(|C481|01|)|(|0,0000|)
and
|C481|01|.*|0,0000|
I don't know much of regular expression, how can I find the two terms that use | ?
| is a regex special char, you have to escape it using \|.
In your case, replacing \|C485\|01\|(.*\|0,0000\|) by |C481|04|$1 should suit your needs.

sed issue - Extract specific words from file

I would like to get some help with SED.
I'm trying to extract some files from a file, all the words that I need start like this.
39;,bugs.pr~%3D~'TEXT23
I need to get TEXT23 for example.
What I did what, first, change 39;,bugs.pr~%3D~' for IDEX which is my flag, then seach for IDEX and extract 8 characters from that word.
The following sed command might eliminate all text but what you want.
sed 's/^39;,bugs.pr~%3D~'//p;d' file