A confusion about the porter stemming algorithm

A confusion about the porter stemming algorithm - c++

I am trying to implement porter stemming algorithm, but I stumbled at this point
where the square brackets denote
arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m
times, this may again be written as
[C](VC){m}[V].
m will be called the \measure\ of any
word or word part when represented in
this form. The case m = 0 covers the
null word. Here are some examples:
m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
I don't understand what is this "measure" and what does it stand for?

Looks like the measure is the number of times a vowel is immediately followed by a consonant. For example,
"TROUBLES" has:
Optional initial consonants [C] = "TR".
First vowels-consonants group (VC) = "OUBL".
Second vowels-consonants group (VC) = "ES".
Optional ending vowels [V] is empty.
So the measure is two, the number of times (VC) was "matched".

Related

Regex for validation of a street number

I'm using an online tool to create contests. In order to send prizes, there's a form in there asking for user information (first name, last name, address,... etc).
There's an option to use regular expressions to validate the data entered in this form.
I'm struggling with the regular expression to put for the street number (I'm located in Belgium).
A street number can be the following:
1234
1234a
1234a12
begins with a number (max 4 digits)
can have letters as well (max 2 char)
Can have numbers after the letter(s) (max3)
I came up with the following expression:
^([0-9]{1,4})([A-Za-z]{1,2})?([0-9]{1,3})?$
But the problem is that as letters and second part of numbers are optional, it allows to enter numbers with up to 8 digits, which is not optimal.
1234 (first group)(no letters in the second group) 5678 (third group)
If one of you can tip me on how to achieve the expected result, it would be greatly appreciated !

You might use this regex:
^\d{1,4}([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|)$
where:
\d{1,4} - 1-4 digits
([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|) - optional group, which can be
[a-zA-Z]{1,2}\d{1,3} - 1-2 letters + 1-3 digits
or
[a-zA-Z]{1,2} - 1-2 letters
or
empty

\d{0,4}[a-zA-Z]{0,2}\d{0,3}
\d{0,4} The first groupe matches a number with 4 digits max
[a-zA-Z]{0,2} The second groupe matches a char with 2 digit in max
\d{0,3} The first groupe matches a number with 3 digits max

You have to keep the last two groups together, not allowing the last one to be present, if the second isn't, e.g.
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
or a little less optimized (but showing the approach a bit better)
^\d{1,4}(?:[a-zA-z]{1,2}(?:\d{1,3})?)?$
As you are using this for a validation I assumed that you don't need the capturing groups and replaced them with non-capturing ones.
You might want to change the first number check to [1-9]\d{0,3} to disallow leading zeros.

Thank you so much for your answers ! I tried Sebastian's solution :
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
And it works like a charm ! I still don't really understand what the ":" stand for, but I'll try to figure it out next time i have to fiddle with Regex !
Have a nice day,
Stan

The first digit cannot be 0.
There shouldn't be other symbols before and after the number.
So:
^[1-9]\d{0,3}(?:[a-zA-Z]{1,2}\d{0,3})?$
The ?: combination means that the () construction does not create a matching substring.
Here is the regex with tests for it.

Find words with 3 consecutive consonants except specific combinations

I have a large list of words and I want to select (filter) those words that have 3 or more consecutive consonants, except some specific combinations.
For example:
...
ikxzop
contribution
...
In that list I want to select the word ikxzop (it has kxz) but not contribution (it has ntr).
I was trying something like this:
\w*[^aeiou]{3,}\w*\n
But that also select the word contribution and I don't know how to omit the ntr combination (and others common combination as mpl, bst or rpr).
Regards.

How about:
\w*(?!ntr)(?!bst)(?!mpl)(?!rpr)[b-df-hj-np-tv-z]{3,}\w*
Will match any words containing atleast three consecutive constants which should be other than ntr or bst or mpl etc as defined.
[b-df-hj-np-tv-z] denotes constants instead of [^aeiou] because the later allows line terminators, symbols etc. as well
(?!ntr) Negative lookahead to ensure that ntr shouldn't be the three consecutive constants.
Regex101 Demo
Matches ikxzop
Doesn't match contribution
Note that it will match a string such as ntrd although it contains ntr because there is an alternate 3 consecutive constants trd which is acceptable

Match list of incrementing integers using regex

Is it possible to match a list of comma-separated decimal integers, where the integers in the list always increment by one?
These should match:
0,1,2,3
8,9,10,11
1999,2000,2001
99,100,101
These should not match (in their entirety - the last two have matching subsequences):
42
3,2,1
1,2,4
10,11,13

Yes, this is possible when using a regex engine that supports backreferences and conditions.
First, the list of consecutive numbers can be decomposed into a list where each pair of numbers are consecutive:
(?=(?&cons))\d+
(?:,(?=(?&cons))\d+)*
,\d+
Here (?=(?&cons)) is a placeholder for a predicate that ensures that two numbers are consecutive. This predicate might look as follows:
(?<cons>\b(?:
(?<x>\d*)
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?:9(?= 9*,\g{x}\d (?<y>\g{y}?+ 0)))*
,\g{x}
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
(?(y)\g{y})
# handle the 999 => 1000 case separately
| (?:9(?= 9*,1 (?<z>\g{z}?+ 0)))+
,1\g{z}
)\b)
For a brief explanation, the second case handling 999,1000 type pairs is easier to understand -- there is a very detailed description of how it works in this answer concerned with matching a^n b^n. The connection between the two is that in this case we need to match 9^n ,1 0^n.
The first case is slightly more complicated. The largest part of it handles the simple case of incrementing a decimal digit, which is relatively verbose due to the number of said digits:
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
The first block will capture whether the digit is N into group aN and the second block will then uses conditionals to check which of these groups was used. If group aN is non-empty, the next digit should be N+1.
The remainder of the first case handles cases like 1999,2000. This again falls into the pattern N 9^n, N+1 0^n, so this is a combination of the method for matching a^n b^n and incrementing a decimal digit. The simple case of 1,2 is handled as the limiting case where n=0.
Complete regex: https://regex101.com/r/zG4zV0/1
Alternatively the (?&cons) predicate can be implemented slightly more directly if recursive subpattern references are supported:
(?<cons>\b(?:
(?<x>\d*)
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?<y>
,\g{x}
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
| 9 (?&y) 0
)
# handle the 999 => 1000 case separately
| (?<z> 9,10 | 9(?&z)0 )
)\b)
In this case the two grammars 9^n ,1 0^n, n>=1 and prefix N 9^n , prefix N+1 0^n, n>=0 are pretty much just written out explicitly.
Complete alternative regex: https://regex101.com/r/zG4zV0/3

Check ICD10 via regex

I need to check icd10 code this code generate with few condition
min length is 3.
first character is letter and not is 'U'.
second and third is digit.
fourth is dot(.)
fifth to eight charactor is letter or digit.
Ex.:
Right : "A18.32","A28.2","A04.0","A18.R252", "A18", "A18.52", "R18", "R18."
Wrong : "A184.32","U18","111."

is this an icd-10-cm code you are looking to verify.
if so I believe that the 3rd digit is alpha or numeric
taken from page 7
https://www.cms.gov/Medicare/Coding/ICD10/downloads/032310_ICD10_Slides.pdf
if so the following regular expression should validate.
^([a-tA-T]|[v-zV-Z])\d[a-zA-Z0-9](\.[a-zA-Z0-9]{1,4})?$
otherwise you can edit the above regular expression to check characte 2 and 3 as numeric.
^([a-tA-T]|[v-zV-Z])\d{2}(\.[a-zA-Z0-9]{1,4})?$

You could try something like so: ^[A-TV-Z]\d{2}(\.[A-Z\d]{0,4})?$. An example is available here.
This is how the answer satisfies your condition:
Min length is 3: ^[A-TV-Z]\d{2}...$ attempts to match a letter and 2 digits. The ^ and $ ensure that there is nothing else in the string which does not satisfy the regular expression. This segment: (\.[A-Z\d]{0,4})? is surrounded by the ? operator: (...)?. This means that the content within the round brackets may or may not be there.
First character is letter and not is 'U'. This is satisfied by [A-TV-Z], which matches all the upper case letters which are between A and T, V and Z inclusive. This omits the letter U.
Second and third is digit. \d{2} means match two digits.
Fourth is dot(.): This is satisfied by \.. The extra \ is needed because the period character is a special character in regular expressions, which means match any character (exception new lines, unless a special option is passed along).
Fifth to eight charactor is letter or digit. [A-Z\d]{0,4} means any letter or digits, repeated between 0 and 4 times.

Try this:
\b[a-tv-zA-TV-Z]\d{2}(\.[a-zA-Z0-9]{,4})?\b
I assume by your example the dot and everything after it is optional
This regex will match a word boundary \b, a letter other than u or U [a-tv-zA-TV-Z], two digits \d{2} and then an optional dot followed by 0-4 letters or digits (\.[a-zA-Z0-9]{,4})? and a second word boundary \b

This question is old, but I had the same issue of validating ICD-10 codes, so it seemed worth an updated answer.
As it turns out, there are two flavors of ICD-10 codes: ICD-10-CM and ICD-10-PCS. From their usage guidelines:
The ICD-10-CM is a morbidity classification published by the United
States for classifying diagnoses and reason for visits in all health
care settings.
and
The ICD-10-PCS is a procedure classification published by the United
States for classifying procedures performed in hospital inpatient
health care settings.
Both Sets
In both the ICD-10-CM and ICD-10-PCS coding systems, you can validate the structure of a code with a regular expression, but validating the content (in terms of which specific combinations of letters and numbers are valid) may be technically possible, but is practically infeasible. A lookup table would be a better bet.
ICD-10-CM
From the Conventions section of the guidelines:
Format and Structure:
The ICD-10-CM Tabular List contains categories, subcategories and
codes. Characters for categories, subcategories and codes may be
either a letter or a number. All categories are 3 characters. A
three-character category that has no further subdivision is equivalent
to a code. Subcategories are either 4 or 5 characters. Codes may be 3,
4, 5, 6 or 7 characters. That is, each level of subdivision after a
category is a subcategory. The final level of subdivision is a code.
Codes that have applicable 7th characters are still referred to as
codes, not subcategories. A code that has an applicable 7th character
is considered invalid without the 7th character.
According to this specification, you'd expect a valid regular expression would look like this:
^\w{3,7}$
However, a review of the actual values shows that, in all cases, the first character is an upper case letter, the second character is a digit, and any alphabetic characters in the remaining available positions are upper case as well. As such, you can use this information to more precisely specify what you're validating:
^[A-Z]\d[A-Z\d]{1,5}$
If you want to allow for a possible period in the fourth position followed by up to four more characters as specified by the OP:
^[A-Z]\d[A-Z\d](\.[A-Z\d]{0,4})?$
ICD-10-PCS
From the Conventions section of the guidelines:
One of 34 possible values can be assigned to each axis of
classification in the seven character code: they are the numbers 0
through 9 and the alphabet (except I and O because they are easily
confused with the numbers 1 and 0). The number of unique values used
in an axis of classification differs as needed...As with words in their
context, the meaning of any single value is a combination of its axis
of classification and any preceding values on which it may be
dependent...Within a PCS table, valid codes include all combinations
of choices in characters 4 through 7 contained in the same row of the
table. [For example], 0JHT3VZ is a valid code, and 0JHW3VZ is
not a valid code.
So to validate the structure of an ICD-10-PCS code:
^[A-HJ-NP-Z\d]{7}$

Use this exp simple :
'^([A-TV-Za-tv-z]{1}[0-9]{1}[A-Za-z0-9]{1}|[A-TV-Za-tv-z]{1}[0-9]{1}[A-Za-z0-9]{1}.[A-Za-z0-9]{1,4})$'

Regex Verification of String in Correct Order with Delimiters in PHP

I'm trying to make a expression to verify that the string supplied is a valid format, but it seems that if I don't use regex in a few months, I forget everything I learned and have to relearn it.
My expression is supposed to match a format like this: 010L0404FFCCAANFFCC00M000000XXXXXX
The four delimiters are (L, N, K, M) which arent in the 0-9A-F hexidecimal range to indicate uniqueness must be in that order or not in the list. Each delimiter can only exist once!
It breaks down to this:
Starts off with a 3 digit numbers, which is simply ^([0-9]{3}) and is always required
Second set begins with L, and must be 2 digits + 2 digits + 6 hexdecimal and is optional
Third set begins with N and must be a 6 digit hexdecimal and is optional
The fourth set K is simply any amount of numbers and is optional
The fifth set is M and can be any 6 hexdecimals or XXXXXX to indicate nothing, it must be in multiples of 6 excluding 0, like 336699 (6) or 336699XXXXXXFFCC00 (18) and is optional
The hardest part I cant figure out making it require it in that order, and in multiples, like the L delimiter must come before and K always if it's there (the reason so I don't get variations of the same string which means the same thing with delimiters swapped). I can already parse it, I just want to verify the string is the correct format.
Any help would be appreciated, thanks.

Requiring the order isn't too bad. Just make each set optional. The regex will still match in order, so if the L section, for example, isn't there and the next character is N, it won't let L occur later since it won't match any of the rest of the regex.
I believe a direct translation of your requirements would be:
^([0-9]{3})(L[0-9]{4}[0-9A-F]{6})?(N[0-9A-F]{6})?(K[0-9]+)?(M([0-9A-F]{6}|X{6})+)?$
No real tricks, just making each group optional except for the first three digits, and adding an internal alternative for the two patterns of six digits in the M block.

^([0-9]{3})(L[0-9]{4}[0-9A-F]{6})?(N[0-9A-F]{6})?(K[0-9]+)?(M([0-9A-F]{6})+|MX{6})$

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js