Finding word permutations in text - regex

I am trying to find words for a text file that have the same number of characters but are in a different order. For example. I input a word like "hyone" and I want to find a word with the same length and number of characters from the text file. In this case "honey" or "heony".
I have already tried using grep with regex but the code I used returns words that are the same length but don't have the same number of characters.
I used this command:
grep -E "^[hyone]{5}$" list.txt
This command return words that are 5 characters long but they include the words that are not made with all of the characters like "hoooo" or "yeehe".
Please note that the examples given are made up but they summarize the problem.

not the best-looking regexp but for your example it's working:
\b(?=.*h)(?=.*y)(?=.*o)(?=.*n)(?=.*e).{5}\b
This one checks if 5 signs (change . to \w for characters and digits only or use [a-z] for ASCII) are preceeded by the chars h, y, o, n, and e.
It might not work on other examples, though. And for usage as one-liner its creation could be a bit tricky for other characters to be checked for. So, regexps might not be the best solution for such problems. Levenshtein (as suggested by Thomas; maybe in addition to Soundex) could work a lot better - they are a bit more complicated, however
You can test the given regexp online at: https://regex101.com/r/7Cdu03/3/

Related

How do I create a RegEx which has multiple criteria?

I am working through a lab on RegEx which asks me to:
Search the 'countries' file for all the words with nine characters and
the letter i.How many results are found?
I am working in a generic Linux command prompt in a online emulated environment. I am allowed to use grep, awk or sed though I am feeling a preference for grep.
(I am 100% a noob when it comes to RegEx so please explain it to me like I'm 5)
Per a previous lab I already used something like below which finds me all countries which have 9 characters, however I cannot find a way to make it find all words which have 9 characters AND contain the letter i in any position.
grep -E '\b\w{9}\b' countries
The | operator does not help because its an OR operator and will find me all instances that i is found, and all words which are 9 characters and I need both to happen at the same time. I tried multiple grep statements as well and it seems the emulator may not accept that.
I am also trying to stick to [] character sets as the next question asks for multiple letters within the 9 letter word.
One way of solving this problem is to use grep twice, and pipe one result to the next.
First, we find all words with length 9, like you did on the previous exercise:
grep -Eo '\b\w{9}\b' countries
I'm using the flag o that lists only the matching words, printing one word per line.
Next, we use Linux pipe (not regex OR) to feed the output of the first grep to a second grep:
grep -Eo '\b\w{9}\b' countries | grep 'i'
The final output will be all words with nine characters and i.
Depending on your requirements, this approach may be considered "cheating" if you're more focused on Regex, but a good solution if you're also learning Linux.
The fact you are looking for words complicates the regex (in contrary to lines in the file), but it is also possible to come up with a single regex to match these words.
\b(?=\w*i)\w{9}\b
This builds on \b\w{9}\b you already have. (?=\w*i) is the AND condition. After we find the beginning of the word (\b), we look ahead for \w*i (zero or more letters, and then our i). We're using \w* in the lookahead, not .*, so we are looking at the same word. (?=.*i) would have matched any i also after the nine characters.
After finding the i, we continue to make sure the word is only 9 letters.
Working example: https://regex101.com/r/G5EVdM/1

Pattern matching software builds

I need to check if a XML document contains some software build numbers
normally they're like ######, where:
First two characters are numbers
Third character is a letter in uppercase
last three characters are numbers
Example: 10B329 or 11A465.
But there could be some exceptions, like 8L1 or 11B465a. (if there's another character after the sixth, it's always a letter in lowercase).
I think they're always with a minimum length of 3 characters and a maximum length of 7 characters.
So what could be the best pattern to match? I tried this but it doesn't work since it takes also words...
Dim BuildPattern As String = "<key>[0-9A-Z]*</key>"
Try this Regex: \d{2}\w\d{3}
You can see a live demo here.
You can add the <key></key> tags too: <key>(\d{2}\w\d{3})<\/key>. This way, your match will be in group 1 of the match. Changed demo.
Note that you should rather use XML parser for this as it's safer and more accurate than working with regex on XML files.
EDIT: Can't help you with the non-standard length though, my knowledge of regexes is still too low. Perhaps you really should try XML parser instead?

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Finding words which match regex in multiple text files

So, I'm new to manipulating data from the command line, and also a beginner at regex.
I have multiple .txt files in multiple subdirectories. What I want to do is to find all words which have a certain number of consecutive consonants.
What I've tried so far is something like this:
find . | grep -orhn '[bdfghjklmnprstvxzþ]\{2\}' > ../words.txt
Which only prints out something like:
2:rt
2:gr
2:xl
3:gr
3:st
3:kk
I want to get the whole word, not just the two consecutive consonants (and the numbers and colon. I don't know where that comes from since it's not in the original data, but it really doesn't matter for what I am trying to do).
Do you have a tip?
The -n option is the line number in the text.
My suggestion is to try matching the word characters before and after.
This is what I tried and seemed to work.
grep -orh '\w\+[bdfghjklmnprstvxzþ]\{2\}\w\+'
The -o option will only show what is matching, which is the entire word.
The -r will look recursively which isn't relevant here given that find is doing the recursion for you.

Regex : Find a number between space

I am trying to extract a zip code of six numbers starting with the number 4 from a string. Right now I am using [4][0-9]{5}, but it is also matching starting from other numbers, like 020-25468811 and it's returning 468811. I don't want it to search in the middle of a number, only full numbers.
Try to use the following:
(?<!\d)4\d{5}(?!\d)
I.e. find 6-digit number starting with 4 and not preceded or followed by digit.
Your expression right now tries to match any six numbers consisting of a 4 with five numbers between 0 and 9. To fix this behavior you should add word boundaries as per Jon's suggestion.
\b[4][0-9]{5}\b
More on word boundaries here: http://www.regular-expressions.info/wordboundaries.html
You could simply add a space to the beginning of your regular expression " 4[0-9]{5}". If you need a more universal way of finding the beginning of the number (could it maybe be also be tabulator, a newline, etc?) you should have look at the predefined character class \s. Also have a look at boundary matchers. I dont know which language you are using, but regex work very similar in most languages. Check this Java regex documentation.
There is a start of line character in regex: ^
You could do:
^4[0-9]{5}
If the numbers are not always in the beginning of a line, you can more generally use:
\<4[0-9]{5}\>
To match only whole words.
Both examples work with egrep.