Extracting Customer Unique IDs from Text - regex

I need to extract customer IDs which are unique alphanumeric character sequences from text. They can contain digits only or digits and alphabetic characters or only alphabetic characters. We can assume that they are longer than 5 characters. They might be capitalized or not.
I thought about using a dictionary, if the character sequence is not a word in dictionary and a sequence longer than 5, it is a good candidate.
Any ideas or sample java code would help. Thanks

Here is a simple regular expression that will match alphanumeric sequences of 6 characters or more:
(?<![A-Za-z0-9])[A-Za-z0-9]{6,}
I used a negative lookbehind here instead of a word boundary (\b) in case there were underscores in your text. If your regex flavor doesn't have lookbehind then you'll want to use the word boundary instead (but I note now that you mentioned java in your question - and java does have lookbehind).
If the customer ID must contain a number, then a regular expression to match these would look like this:
(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,}
See Regex101 demo.
Is there a limit to how long your customer IDs can be? If so, then putting that limit in would probably be helpful - any alphanumeric character sequence longer than that number obviously won't be a match. If the limit is 25 characters, for example, the regex would look like this:
(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])
(I added the lookahead at the end, otherwise this could simply match the first 25 characters of a long alphanumeric sequence!)
Once you have the matches extracted from your text, then you could do a dictionary lookup. I know there are questions and answers on StackOverflow on this subject.
To actually use this regex in Java, you would use the Pattern and Matcher classes. For example,
String mypattern = "(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])";
Pattern tomatch = Pattern.compile(mypattern);
Etc. Hope this helps.
UPDATE
This just occurred to me, rather than trying a dictionary match, it might be better to store the extracted values in a database table and then compare that against your customers table.

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

regex for hashtags

I have found a lot of regex examples to retrieve hashtags from a text. Unfortunately, none of examples is what I need.
This is almost what I need but...
function hashtags(text) {
return text.replace(/(^|\s)#(\w*[a-zA-Z]+\w{2,50})/g,
"$1<a href='/h/$2' target='_blank'>#$2</a>");
}
Hashtags can not start with a number to avoid situations when for example Section #12 gets hashtaged.
The example above checks it but it does not allow characters like ÁÉÍÚ, it does not check the hashtag lenghth correctly and it does not allow character '-'.
So, I need the following:
A hashtag may start with any letter - A,z,B,Ñ,ó,Ú etc, but not with a number and not with a special sign &%$ or - _
The total lenght of a hashtag must be 3-50 characters. The regex must accept as hashtags only full words but not to cut them after first 50 characters. So, words that start with # but contain more than 50 characters must be ignored instead of converting first 50 characters into a hashtag link. In my example {2,50} does not work correct.
The rest part of a hashtag (when checked that it does not start with a number or a special sign) may contain numbers, any letters and _ - signs. \w allows only _ but not -
Is it possible?
For 1 - you need a character class. You can define these with square brackets. PCRE defines \w but that includes numbers too.
For 2 - You can either have a word followed by 'some whitespace' (PCRE: \s) or use a look around pattern (?![A-Z0-9]) - for 'not followed by this.
And for 3 - non-whitespace may be what you want - \S in PCRE definitions.
/(?<!\w)#[A-Z]\S{1,49}(?!\w)/i
Demo
Edit: Given this may be javascript specific, and you can't use lookbehind, then the above may not work for you. If you are tying our regex query to a particular language, it is useful to specify that constraint in the question.
Try this one:
/(^|\s)#([^\d&%$_-]\S{2,49})\b/g
Explaining:
(^|\s) #
#([^\d&%$_-] # not the characters you mentioned in the first position
\S{2,49}) # the first chracter was already matched
\b # a boundary to avoid overflow 50 characters
Hope it helps.

Regular expression start with specific letter

I am using this ^[S-s][0-9]{4}$ to validate my string, but not working properly. my string has to be in the form of the Letter S (upper-case or lower-case) followed by 4 digits, e.g. S1234. Looks like it works for Letters above S, meaning if I enter w1234 it validates correct, but if I enter a letter below s, like a1234 it doesn’t validate. Thanks.
You need to get rid of the dash:
^[Ss][0-9]{4}$
dashes within [...] denote character ranges. Thus S-s in regex would mean "every character in Unicode character table between S and s" and as those two are not adjacent, you end up with a bunch of matched chars.
Not answer directly the detail content of the question, but whom who end up to this question by the question's title and looking for the answer of regex to find match words begin with specific letter like :
This is a Zone
You should use this regex:
\bd[a-zA-Z]+
[a-zA-Z] should replace by the expected tail you want.
Take a look at this link
[S-s] means the range of all characters between capital S and lowercase s. Try ^[Ss][0-9]{4}$ instead. Or better yet, ^s\d{4}$ with a case-insensitivity modifier (/i in many languages).

Regular Expression for matching a phone number

I need a regular expression to match phone numbers. I just want to know if the number is probably a phone number and it could be any phone format, US or international. So I developed a strategy to determine if it matches.
I want it to accept the following characters: 0-9 as well as ,.()- and optionally start with a + (for international numbers). The string should not match if it has any other characters.
I tried this:
/\+?[0-9\/\.\(\)\-]/
But it matches phone numbers that have + in the middle of the number. And it matches numbers that contain alpha chars (I don't want that).
Lastly, I want to set the minimum length to 9 characters.
Any thoughts?
Thanks for any help, I'm obviously not too swift on RegEx stuff :)
Well, you're pretty close. Try this:
^\+?[0-9\/.()-]{9,}$
Without the start and end anchors you allow partial matching, so it can match +123 from the string :-)+123.
If you want a minimum of 9 digits, rather than any characters (so ---.../// isn't valid), you can use:
^\+?[\/.()-]*([0-9][\/.()-]*){9,}$
or, using a lookahead - before matching the string for [0-9/.()-]* the regex engine is looking for (\D*\d){9}, which is a of 9 digits, each digit possibly preceded by other characters (which we will validate later).
^\+?(?=(\D*\d){9})[0-9\/.()-]*$
The reason why it matches alpha character is because of the period. You have to escape it. I don't know what editor you are using for this, this is what I'll use for VIM:
^+\?[()\-\.]\?\([0-9][\.()\-]\?\)\{3,\}$
The juqeury has a plugin for US phone validation. Check this link. You can also see the regular expression in the source code.

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression