Find words with 3 consecutive consonants except specific combinations - regex

I have a large list of words and I want to select (filter) those words that have 3 or more consecutive consonants, except some specific combinations.
For example:
...
ikxzop
contribution
...
In that list I want to select the word ikxzop (it has kxz) but not contribution (it has ntr).
I was trying something like this:
\w*[^aeiou]{3,}\w*\n
But that also select the word contribution and I don't know how to omit the ntr combination (and others common combination as mpl, bst or rpr).
Regards.

How about:
\w*(?!ntr)(?!bst)(?!mpl)(?!rpr)[b-df-hj-np-tv-z]{3,}\w*
Will match any words containing atleast three consecutive constants which should be other than ntr or bst or mpl etc as defined.
[b-df-hj-np-tv-z] denotes constants instead of [^aeiou] because the later allows line terminators, symbols etc. as well
(?!ntr) Negative lookahead to ensure that ntr shouldn't be the three consecutive constants.
Regex101 Demo
Matches ikxzop
Doesn't match contribution
Note that it will match a string such as ntrd although it contains ntr because there is an alternate 3 consecutive constants trd which is acceptable

Related

Regex: match at least N number of search terms but with patterns dependent on position

My question is similar to that in regex: Match at least two search terms, but with added complexity:
Given a set of M numerical strings of same length:
11001100
11101010
10010010
00101101
And given substring patterns of the type "11 at position 0" or "10 at position 6" (with the position being any multiple of 2), how can I search for strings matching at least N of these patterns?
For example: ^(11|\d{2}10|\d{6}10) matches all strings. However if I add {3,} to the regex to match "11101010" only (because it satisfies three out of three of those OR cases), it fails. Does anyone know how I can structure a regex like this?
If it matters, the patterns can also cover the same substring position, so for example it could be (11|\d{6}10|\d{6}00), and this ideally would match both the first and second lines in my example if I wanted to only catch strings with two or more matches.
Is this the expected result?
(\b(11\d{6}|10\d{6}|\d{6}01)\n?){3,}

Check ICD10 via regex

I need to check icd10 code this code generate with few condition
min length is 3.
first character is letter and not is 'U'.
second and third is digit.
fourth is dot(.)
fifth to eight charactor is letter or digit.
Ex.:
Right : "A18.32","A28.2","A04.0","A18.R252", "A18", "A18.52", "R18", "R18."
Wrong : "A184.32","U18","111."
is this an icd-10-cm code you are looking to verify.
if so I believe that the 3rd digit is alpha or numeric
taken from page 7
https://www.cms.gov/Medicare/Coding/ICD10/downloads/032310_ICD10_Slides.pdf
if so the following regular expression should validate.
^([a-tA-T]|[v-zV-Z])\d[a-zA-Z0-9](\.[a-zA-Z0-9]{1,4})?$
otherwise you can edit the above regular expression to check characte 2 and 3 as numeric.
^([a-tA-T]|[v-zV-Z])\d{2}(\.[a-zA-Z0-9]{1,4})?$
You could try something like so: ^[A-TV-Z]\d{2}(\.[A-Z\d]{0,4})?$. An example is available here.
This is how the answer satisfies your condition:
Min length is 3: ^[A-TV-Z]\d{2}...$ attempts to match a letter and 2 digits. The ^ and $ ensure that there is nothing else in the string which does not satisfy the regular expression. This segment: (\.[A-Z\d]{0,4})? is surrounded by the ? operator: (...)?. This means that the content within the round brackets may or may not be there.
First character is letter and not is 'U'. This is satisfied by [A-TV-Z], which matches all the upper case letters which are between A and T, V and Z inclusive. This omits the letter U.
Second and third is digit. \d{2} means match two digits.
Fourth is dot(.): This is satisfied by \.. The extra \ is needed because the period character is a special character in regular expressions, which means match any character (exception new lines, unless a special option is passed along).
Fifth to eight charactor is letter or digit. [A-Z\d]{0,4} means any letter or digits, repeated between 0 and 4 times.
Try this:
\b[a-tv-zA-TV-Z]\d{2}(\.[a-zA-Z0-9]{,4})?\b
I assume by your example the dot and everything after it is optional
This regex will match a word boundary \b, a letter other than u or U [a-tv-zA-TV-Z], two digits \d{2} and then an optional dot followed by 0-4 letters or digits (\.[a-zA-Z0-9]{,4})? and a second word boundary \b
This question is old, but I had the same issue of validating ICD-10 codes, so it seemed worth an updated answer.
As it turns out, there are two flavors of ICD-10 codes: ICD-10-CM and ICD-10-PCS. From their usage guidelines:
The ICD-10-CM is a morbidity classification published by the United
States for classifying diagnoses and reason for visits in all health
care settings.
and
The ICD-10-PCS is a procedure classification published by the United
States for classifying procedures performed in hospital inpatient
health care settings.
Both Sets
In both the ICD-10-CM and ICD-10-PCS coding systems, you can validate the structure of a code with a regular expression, but validating the content (in terms of which specific combinations of letters and numbers are valid) may be technically possible, but is practically infeasible. A lookup table would be a better bet.
ICD-10-CM
From the Conventions section of the guidelines:
Format and Structure:
The ICD-10-CM Tabular List contains categories, subcategories and
codes. Characters for categories, subcategories and codes may be
either a letter or a number. All categories are 3 characters. A
three-character category that has no further subdivision is equivalent
to a code. Subcategories are either 4 or 5 characters. Codes may be 3,
4, 5, 6 or 7 characters. That is, each level of subdivision after a
category is a subcategory. The final level of subdivision is a code.
Codes that have applicable 7th characters are still referred to as
codes, not subcategories. A code that has an applicable 7th character
is considered invalid without the 7th character.
According to this specification, you'd expect a valid regular expression would look like this:
^\w{3,7}$
However, a review of the actual values shows that, in all cases, the first character is an upper case letter, the second character is a digit, and any alphabetic characters in the remaining available positions are upper case as well. As such, you can use this information to more precisely specify what you're validating:
^[A-Z]\d[A-Z\d]{1,5}$
If you want to allow for a possible period in the fourth position followed by up to four more characters as specified by the OP:
^[A-Z]\d[A-Z\d](\.[A-Z\d]{0,4})?$
ICD-10-PCS
From the Conventions section of the guidelines:
One of 34 possible values can be assigned to each axis of
classification in the seven character code: they are the numbers 0
through 9 and the alphabet (except I and O because they are easily
confused with the numbers 1 and 0). The number of unique values used
in an axis of classification differs as needed...As with words in their
context, the meaning of any single value is a combination of its axis
of classification and any preceding values on which it may be
dependent...Within a PCS table, valid codes include all combinations
of choices in characters 4 through 7 contained in the same row of the
table. [For example], 0JHT3VZ is a valid code, and 0JHW3VZ is
not a valid code.
So to validate the structure of an ICD-10-PCS code:
^[A-HJ-NP-Z\d]{7}$
Use this exp simple :
'^([A-TV-Za-tv-z]{1}[0-9]{1}[A-Za-z0-9]{1}|[A-TV-Za-tv-z]{1}[0-9]{1}[A-Za-z0-9]{1}.[A-Za-z0-9]{1,4})$'

How Can I Create a RegEx Pattern that will Get N Words Using Custom Word Boundary?

I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx

How would I write a regular express to parse a string with optional component, a random structure, symbols and a lot of noise

I need to parse a string that is comprised of codes and symbols that are designed to represent the performance of a horse in a horse race. I have provided some samples below. The string is comprised of three components: Prefix, Score and Suffix. The score and suffix are always present, however, the prefix is not depending on the particular circumstances and race conditions. The prefix and suffix are comprised of codes and symbols that represent things like racing surface, race conditions, equipment used, etc. There is a legend explaining all the codes. There are also some random characters mixed in that should not be extracted.
My objective is to extract the three components as well as the individual codes that may be present in the prefix and suffix.
1. 20- v[ 20Sr25A A UUU GGG
2. =19- V20Sr28 JJJ
3. 21+ VAWGP30
4. 16+ Yw16MT25
5. = 18 Vtf 75GP22 AAA
Here are explanations of the five examples above:
1. has no prefix, the score is a 20-, the suffix is v [ 20Sr25, nothing else is extracted
2. the prefix is = (turf race), the score 19- and the suffix is V20Sr28, not else is extracted
3. no prefix, score is 21+, suffix is VAWGP30
4. no prefix, score 16+, suffix is Yw16MT25
5. prefix is =, score is 18, suffix is Vtf 75GP22
Here are some general rules for the components:
Prefix - Generally simply a collection of symbols. Most symbols are one charter but some are two:(examples separated by commas) =,.,F,..,:,G,^
Score - the score is highly structured and comprised of the following characters [0-9,+-"]
Suffix - the suffix also has some structure. Its generally comprised of two parts, some optional symbols that are followed by the far right section. The far right section follows one of two pattern: vvLLdd where vv equals value of race, LL is the location and dd is the day. Alternatively, TTLLdd where TT is type of race, LL is location and dd is day.
My questions are:
1. How would a capture the three components, given the optional nature of the prefix?- 3 Capture Groups
2. Do I need to include every possible symbol from the legend in the brackets [ ]
3. How would I turn a suffix like Vtf 75GP22 into six pieces of info: V, t, f, 75,GP, 22
Any suggests, guidance or code sample appreciated. - Thanks.
You can try with this pattern:
(?i)(?<prefix>\S*?)\h*(?<score>\d+[+-]?)\s*(?<suffix>.*?(?:[A-Z]{2}|\d{2})[A-Z]{2}\d{2})
online demo
pattern details:
(?i) # make the pattern case-insensitive
(?<prefix>\S*?) # use a lazy * quantifier to allow an empty prefix
\h* # zero or more horizontal spaces
(?<score>\d+[+-]?) #
\s* # optional spaces (can be replaced with \h too)
(?<suffix> # suffix
.*? # all until two letters or two digits
(?: # two letters or two digits
[A-Z]{2}
|
\d{2}
)
[A-Z]{2}\d{2} # two letters and two digits
)
As you can see, the approach is relatively general without using an comprehensive list of predefined symbols. However, If you know the exact list of possible prefixes, you can write the prefix group like that: (?<prefix>sub1|sub2|sub3...)??
To extract the content of the suffix part, you only need to extract the 6 last characters (then you split them 2 by 2), and you split the begining with \s*. It's possible to do it in one regex but it isn't very handy nor efficient. (example)
You won't be able to accomplish all you want with just one single regex and nothing else.
Basically, you need to write a rather short script in a language of your choice (scripting languages like Perl, Ruby, Python are likely the fastest option) which uses regex match tests, regex match value extraction, conditional structures and data structures.

Excel Sort by 2nd character in alphanumeric string

I have a column in an Excel spreadsheet that contains the following:
### - 3-digit number
#### - 4-digit number
A### - character with 3-digits
#A## - digit followed by character then 2 more digits
There may also be superfluous characters to the right of these strings.
I would like to sort the entire spreadsheet by this column in the following order (ascending or descending):
the first three types of strings alphabetically as expected (NOT ASCII-Betically!)
Then the #A## by the character first, then by the first digit.
Example:
000...999, 0000...9999, A000...Z999, 0A00...9A99, 0B00...9B99...9Z99
I feel there is a very simple solution using a regular expression or macro but my VBa and RegExp are pretty rusty (a friend asked me for this but I' m more of a C-guy these days). I have read some solutions which involve splitting the data into additional columns which I would be fine with.
I would settle for a link to a good guide. Eternal thanks in advance.
If you want to sort by second character regardless of the content ahead and behind, then regex ^.(.) represents second character match...