Are Email Host has a back end Archiving platform. I asked them if we could be alerted if certain formats are located within the Email and they asked me to submit them a RegEx Query String so they can put it into there system. I Know there Main Email System is Exchange, how ever I am not sure what there backend is.... I am assuming exchange and I apologize for I did not even think to ask for that piece of info.
NEED TO LOOK FOR THE FOLLOWING DIGIT PATTERNS (which is for Social Security # Formats)
9 digit formats:
XXX-XX-XXXX
XXXXXXXXX
XX-XXXXXXX
NEED TO LOOK FOR THE FOLLOWING DIGIT PATTERNS (which is for Customer Acct # Formats)
8 digit formats:
xxxxxxxx
I have not tried anything for I am not sure on how to test with out submitting to my host. But this is what I cam up with Two stings to accomplish the task for both 8 and 9 digit patterns
Regex.Match([/d]+[/d]+\-?[/d]+\-?[/d]+[/d-]+\-?[/d]+[/d]+[/d]+[/d]);
``
Regex.Match([/d]+[/d]+[/d]+[/d]+[/d]+[/d]+[/d]+[/d]);
I would expect the output to be any email that matches the numeric sequence pattern
I would use the following regex pattern, which is an alternation of three terms:
\d{8}\d?|\d{3}-\d{2}-\d{4}|\d{2}-\d{7}
Demo
Note that a single digit in regex is represented by \d, with a single backslash, not a forward slash.
Just for simplicity, you might want to design an easy expression, maybe something similar to this, to be easy to be changed whenever you wish:
([0-9]{3}-[0-9]{2}-[0-9]{4})|([0-9]{9})|([0-9]{2}-[0-9]{7})|([0-9]{8})
It has four groups for each of your numbers:
([0-9]{3}-[0-9]{2}-[0-9]{4}) for SSN type 1
([0-9]{9}) for SSN type 2
([0-9]{2}-[0-9]{7}) for SSN type 3
([0-9]{8}) which is for customer account number
You can simply join this groups with an OR:
([0-9]{3}-[0-9]{2}-[0-9]{4})|([0-9]{9})|([0-9]{2}-[0-9]{7})|([0-9]{8})
If you wish to have two separate expressions, you might want to join the first two:
([0-9]{3}-[0-9]{2}-[0-9]{4})|([0-9]{9})|([0-9]{2}-[0-9]{7})
and your second expression would work as:
([0-9]{8})
You can add additional boundaries to it if you wish. For example, you can bound them with start and end chars:
Both RegEx:
^([0-9]{3}-[0-9]{2}-[0-9]{4})|([0-9]{9})|([0-9]{2}-[0-9]{7})|([0-9]{8})$
SSN RegEx:
^([0-9]{3}-[0-9]{2}-[0-9]{4})|([0-9]{9})|([0-9]{2}-[0-9]{7})$
Customer Account RegEx:
^([0-9]{8})$
Graph
This graph shows how the expression would work and you can visualize other expressions in this link:
JavaScript Testing
const regex = /([0-9]{3}-[0-9]{2}-[0-9]{4})|([0-9]{9})|([0-9]{2}-[0-9]{7})|([0-9]{8})/gm;
const str = `111-22-3333
111223333
11-1234567
12345678`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Related
I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)
I have a regex formula that I'm using to find specific patterns in my data. Specifically, it starts by looking for characters between "{}" brackets, and looks for "p. " and grabs the number after. I noticed that, in some instances, if there's not a "p. " value shortly after the brackets, it will continue to go through the next brackets and grab the number after.
For example, here is my sample data:
{Hello}, [1234] (Test). This is sample data used to answer a question {Hello2} [Ch.8 p. 87 gives more information about...
Here is my code:
\{(.*?)\}(.*?)p\. ([0-9]+)
I want it to return this only:
{Hello2} [Ch.8 p. 87
But it returns this:
{Hello}, [123:456] (Test). This is stample data used to answer a
question {Hello2} [Ch.8 p. 87
Is there a way to exclude strings that contain, let's say, "{"?
Your pattern first matches from { till } and then matches in a non greedy way .*? giving up matches until it can match a p, dot space and 1+ digits.
It can do that because the dot can also match {}.
You could use negated character classes [^{}] to not match {}
\{[^{}]*\}[^{}]+p\. [0-9]+
Regex demo
Your expression seems to be working fine, my guess is that we wish to only capture that desired output and non-capture others, which we can do so by slight modification of your original expression:
(?:[\s\S]*)(\{(.*?)\}(.*?)p\. [0-9]+)
Demo 1
or this expression:
(?:[\s\S]*)(\{.*)
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:
Test
const regex = /(?:[\s\S]*)(\{.*)/gm;
const str = `{Hello}, [123:456] (Test). This is stample data used to answer a
question {Hello2} [Ch.8 p. 87`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Here's how you do it in Java. The regex should be fairly universal.
String test = "{Hello2} [Ch.8 p. 87 gives more information about..";
String pat = "(\\{.*?\\}.*p.*?\\d+)";
Matcher m = Pattern.compile(pat).matcher(test);
if (m.find()) {
System.out.println(m.group(1));
}
More specific ones can be provided if more is known about your data. For example, does each {} of information start on a separate line? What does the data look like and what do you want to ignore.
Based on your example text, you may be able to simplify your regex a bit and avoid matching a second open curly brace before you match the page number (unless you have some other purpose for the capture groups). For example:
{[^{]*p\.\s\d+
{ match an open curly brace
[^{]* match all following characters except for another open curly brace
p\.\s\d+ match "p" followed by period, space and one or more digits
I have to parse a file data into good and bad records the data should be of format
Patient_id::Patient_name (year of birth)::disease
The diseases are pipe separated and are selected from the following:
1.HIV
2.Cancer
3.Flu
4.Arthritis
5.OCD
Example: 23::Alex.jr (1969)::HIV|Cancer|flu
The regex expression I have written is
\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD)
(\|(HIV|Cancer|flu|Arthritis|OCD))*
But it's also considering the records with redundant entries
24::Robin (1980)::HIV|Cancer|Cancer|HIV
How to handle these kind of records and how to write a better expression if the list of diseases is very large.
Note: I am using hadoop maponly job for parsing so give answer in context with java.
What you might do is capture the last part with al the diseases in one group (named capturing group disease) and then use split to get the individual ones and then make the list unique.
^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$
For example:
String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
String[] parts = matcher.group("disease").split("\\|");
Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
System.out.println(uniqueDiseases);
}
Result:
[HIV, Cancer]
Regex demo | Java demo
You need the negative lookahead.
Try using this regex: ^\d*::[^(]+?\s*\(\d{4}\)::(?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1)((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$.
Explanation:
The initial string ^\d*::[^(]+?\s*\(\d{4}\):: is just an optimized one to match Alex.jr example (your version did not respect any non-alphabetic symbols in names)
The negative lookahead block (?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1) stands for "look forth for any disease name, encountered twice, and reject the string, if found any. Its distinctive feature is the (?! ... ) signature.
Finally, ((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$ is also an optimized version of your block (HIV|Cancer|flu|Arthritis|OCD)(\|(HIV|Cancer|flu|Arthritis|OCD))*, oriented to avoid redundant listing.
Probably the easier to maintain method is that you use a bit changed regex,
like below:
^\d*::[a-zA-Z.]+\s\(\d{4}\)::((?:HIV|Cancer|flu|Arthritis|OCD|\|(?!\|))+)$
It contains:
^ and $ anchors (you want that the entire string is matched,
not its part).
A capturing group, including a repeated non-capturing group (a container
for alternatives). One of these alternatives is |, but with a negative
lookahead for immediately following | (this way you disallow 2 or
more consecutive |).
Then, if this regex matched for a particular row, you should:
Split group No 1 by |.
Check resulting string array for uniqueness (it should not contain
repeating entries).
Only if this check succeeds, you should accept the row in question.
I have 2 input files. One is a list of prefix and lengths, like this:
101xxx
102xxx
30xx
31xx
(where x is any number)
And another is a list of numbers.
I want to iterate through the second file, matching each number against any of the prefix/lengths. This is fairly easy. I build a list of regexps:
my #regexps = ('101...', '102...', '30..', '31..');
Then:
foreach my $regexp (#regexps) {
if (/$regexp/) {
# do something
But, as you can guess, this is slow for a long list.
I could convert this to a single regexp:
my $super_regexp = '101...|102...|30..|31..';
...but, what I need is to know which regexp matched the item, and what the ..s matched.
I tried this:
my $catching_regexp = '(101)(...)|(102)(...)|(30)(..)|(31)(..)';
but then I don't know whether to look in $1, $3, %5 or $7.
Any ideas? How can I match against any of these prefix/lengths and know which prefix, and what the remaining digits where?
You can use the branch reset pattern, if your Perl is sufficiently recent (5.10 and newer):
my $regex = qr/^(?|(101)(...)|(102)(...)|(30)(..)|(31)(..))$/;
while (<>) {
print "$1, $2\n" if /$regex/;
}
Update:
I think I missed some of what you were going for. If different prefixes have different sub-expressions (... vs ..) and you want to capture/reference what the sub-expression matched...you can use a lookbehind:
((?<=101|102).{3}|(?<=30|31).{2})
This will capture everything, and if it is prefixed by 101|102 it will match 3 characters; if it is prefixed by 30|31, it will match 2 characters. We only use one capture group, so your xxx's will always be in $1.
Demo
And if you also want to capture the prefix, you can include a lazy capture group before the secondary grouping of lookbehinds:
(.*?)((?<=101|102).{3}|(?<=30|31).{2})
Your prefixes will be in group 1, and your suffixes in group 2.
Use alternation within a group:
(101|102|30|31)...
This will create an extra captured group, though..so you can also use a "non-capturing" group:
(?:101|102|30|31)...
Demo
You can do as much logic as you want to with this mentality. It's similar to how you would need to group conditionals in any language:
if(a === true && (b === false || b === null)) {}
I have a list of email addresses which take various forms:
john#smith.com
Angie <angie#aol.com>
"Mark Jones" <mark#jones.com>
I'm trying to cut only the email portion from each. Ex: I only want the angie#aol.com from the second item in the list. In other words, I want to match everything between < and > or match everything if it doesn't exist.
I know this can be done in 2 steps:
Capture on (?<=\<)(.*)(?=\>).
If there is no match, use the entire text.
But now I'm wondering: Can both steps be reduced into one simple regular expression?
What about:
(?<=\<).*(?=\>)|^[^<]*$
^[^>]*$ will match the entire string, but only if it doesn't contain a <. And that's OR'ed (|) with what you had.
Explanation:
^ - start of string
[^<] - not-< character
[^<]* - zero or more not-< characters
$ - end of string
You're after an exclusive or operator. Have a look here.
(\<.+\#.+\..+\>) matches those email addresses in side <> only...
(\<.+\#.+\..+\>)|(.+) matches everything instead of matching the first condition in the OR then skipping the second.
Depending on what language you are using to implement this regex, you might be able to use an inbuilt exclusive or operator. Otherwise, you might need to put a bit of logic in there to use the string if no matches are found. E.g. (pseudo type code):
string = 'your data above';
if( regex_finds_match ( '(\<.+\#.+\..+\>)', string ) ) {
// found match, use the match
str_to_use = regex_match(es);
} else {
// didn't find a match:
str_to_use = string;
}
It is possible, but your current logic is probably simpler. Here is what I came up with, email address will always be in the first capturing group:
^(?:.*<|)(.*?)(?:>|$)
Example: http://rubular.com/r/8tKHaYYY4T