Regex masking all phone numbers except a specific range - regex

Not 100% if this is possible but I would like to convert any outbound call that does not match my DID range to a set phone number. 
With our carrier in Australia if the ANI is not from their supplied range the call is blocked as part of new regulations. 
What I am looking for is something like this. 
if not +61 2 XXXX XXXX - +61 2 XXXX  XXXX  then send as +612XXXX XXXX
I apologise I have no true understanding of regex and do not know even where to begin.
I am starting to work on my knowledge of it though. please be kind. If anyone can point me to an "idiots guide" link I would be appreciative as I am just getting into this.

Of course it's possible. It's just a matter of how much work you want to do. I'm not quite sure what you want to mask and what you want to pass on unmutilated. A couple of particular examples would help. How many different formats, countries, and so on do you need to support?
With these problems, I tend to follow this approach:
Normalize the data. Make them all look the same. So, remove all non-digits, for example. +61 2 XXXX XXXX turns into 612XXXXXXXX. In this step, you'd also fill in implicit information, like a local number that does not include the country code. Number::Phone may be interesting, but, also note is was the largest distro on CPAN for awhile.
Now it should be easier to recognize the number and it's components (because if it isn't, you didn't do Step 1 right). Instead of a regex, you might use a parser. That is, get the country code, and then from that, decide what has to happen next. That's the sort of thing I have to do with ISBNs in Business::ISBN, which have a group code then a publisher code (both of which are variable length.
Once you can recognize the number, it's easy to select a range. If it's in the range, you know what to replace.

Related

Regular expression to allow only the valid time offset values

I am looking for a regular expression which allows only the time offset values.
I have used:
^(?:[+-](?:2[0-3]|[01][0-9]):[0-5][0-9])$
The ONLY strings I need to match:
-12:00
+14:00
-11:00
-10:00
-09:30
-09:00
-08:00
-07:00
-06:00
-05:00
-04:00
-03:30
-03:00
-02:00
-01:00
00:00
+01:00
+04:00
+03:30
+03:00
+02:00
+04:30
+05:00
+05:30
+05:45
+06:00
+06:30
+07:00
+08:00
+08:45
+09:00
+09:30
+10:00
+10:30
+11:00
+12:00
+12:45
+13:00
+14:00
Please check here for what I have tried so far, and the values I want it to allow.
It works fine for the all the values except for 00:00.
Also, it allows some extra values such as -19:30 +23:00 22:30 21:00 which should not be allowed.
I want it to allow only those values which have been mentioned in my aforesaid link.
I was able to achieve the results you wanted by slightly tweaking your regex.
This is also short and concise.
^(?:(?:[+-](?:1[0-4]|0[1-9]):[0-5][0-9])|00:00)$
You can check the results and test it further here
One point which should be noted here is that you would be able to pass other values between the current valid values of timezone(-12:00 to +14:00). By reading the comments in the question, I feel it is better to have it this way, for future proofing just in case they change. (You would need to tweak the regex to allow values greater than 14:00)
If you strictly want to limit it to the values which you have listed, enumeration would be a better approach to go about it.
To only match these strings of yours, you can use
^(?:\+(?:0(?:[12]:00|[34]:[03]0|5:(?:[03]0|45)|6:[03]0|7:00|8:(?:00|45)|9:[03]0)|1(?:0:[03]0|1:00|2:(?:00|45)|[34]:00))|-(?:0(?:[12]:00|3:[03]0|[4-8]:00|9:[03]0)|1[0-2]:00)|00:00)$
See the regex demo
Use online/external tools to build word list regexps like this (e.g. My Regex Tester, etc.).

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

Regex to capture transcript speaker names before colon

From text transcripts, I want to capture all names of speakers.
The target names start at the start of a line and should end at a ": " (ie. colon and space).
Optionally, for even finer control, it may be safe to assume the first colon and two spaces.
Example text:
Julian Z.: What's really exciting is the opportunity to be more intelligent about how you approach trying to reach your consumer. In a world where digital and the use of digital has exploded, to be able to have one-on-one conversations in the digital world, and to be able to eventually translate that into the TV space, whether that be addressable or data-driven, is really fantastic. Because at the end of the day, you want your brand, in our case, our networks, to be able to have a relationship with the consumer. Data is a proxy to allow for that to occur.
From an advertiser perspective, obviously now the ability to go to the broadcast networks and have a data-driven buy has absolutely blown up and proliferated. That's with us. That's with some of our competitors. Obviously, we think we're the best at it, but neither here nor there. I think it's a really wonderful foundational approach for advertisers to take. I think it's a great advancement in the market.
As a spender of money, and as somebody who is trying to get people to engage with our brands, the ability to use data to really have, again, these really one-on-one, unique conversations, and to be able to deliver creative content that's relevant for individual consumers, that's driven by what we know about the consumer, now, ultimately, where we can reach them effectively and in environments where we know they're engaged, is really a great, tremendous advancement. You'll see by our ratings numbers, which are on the upswing, that approach has really had a direct impact on what our linear ratings have resulted in.
Speaker 2: Great. Tell us a little bit about Viacom. It's a lot of fans, a lot of passion in people. How do you define the audience in broad strokes? How do they respond to advertising and what are some of the concerns that consumers have around ads?
Julian Z.: Well, I think, again, when you're talking about how we're reaching fans, it is using intelligence, and information, and data, not only to profile who our fans are, but ultimately where they're best reached. Our job is to deliver great, compelling content, which we believe we're really, really good at.
In order to do that, there's the linear side of the equation, but of course we want to make sure that we're reaching our fans in digital as well, and that there's a 360 kind of fan experience. We believe holistically that our fans are really the base of what we're trying to do. We're trying to please and create value for our fans. The more we engage with them, and the more we know about them, the better we're able to deliver customized content that fits their need.
Ultimately, as a content creator, what's more exciting than to delivery really great content to people that they really, really engage with and they build relationships with? That's all you can really hope for is, somebody that creates content, is to be able to develop compelling content and content that your audience really wants to engage with.
Speaker 2: When you look at targeting, is that a cross-platform? Where does that targeting happen?
Julian Z.: It absolutely is cross-platform. Of course, there is natural addressability in the digital market, because it is much more of a one-to-one. But now you see a lot of the MVPDs have obviously opened up addressable inventory. A lot of the MVPDs now have matured their addressable footprint, which allows you now to have a digital-like, not exactly the same obviously, but a digital-like experience in the linear space, to deliver content to the consumer or advertising to the consumer when it's relevant and when it's going to have the most impact for your message.
Ultimately, it's absolutely cross-platform because addressability is all about having that conversation, having that direct one-to-one with your audience. Our partners on the MVPD side have really matured over the last several years as of regard to addressable, and now you can have that 360 experience of having a conversation in linear and in digital that really is addressable.
Example strings to be captured are: Julian Z. and Speaker 2. Names will vary from text to text. I need all/multiple names present. As you see, names may include a mixture of alpha case, punctuation characters and numbers.
I will want to deduplicate names, which are repeated in the text, but believe I should shelve that for now, focusing this question on the capture.
I have tried plenty, for the last day or two.
eg. ^[^:]+\s* with /g comes close, but only captures the first, single Julian Z., whereas I want everything. For now, I am out of ideas and need to learn how to do this.
Regex to match any characters up until the first colon:
/^.*?(?=:)/gm
https://regex101.com/r/3uyXMM/3
^: match from beginning of line
.: match anything
*?: non-greedy search, so it stops at first colon (see next line)
(?=:): positive lookahead meaning next character should be colon but it doesn't capture
g: don't return after first match, returns all matches
m: run regex for each line
You can use this regex based on a negated character class:
/^\w[^:\n]*/mg
RegEx Demo 1
RegEx Demo 2
RegEx Breakup:
^\w: Match a word character at the start
[^:\n]*: Match zero or more of any character that is not a colon and not a newline.
Code:
var names = inputData.transcript.match(/^\w[^:\n]*/mg) || [];

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.

Regexp to parse out a person's name?

This might be a hard one (if not impossible), but can anyone think of a regular expression that will find a person's name, in say, a resume? I know this won't be 100% accurate, but I can't come up with something.
Let's assume the name only shows up once in the document.
No, you can't use regular expressions for this. The only chance you have is if the document is always in the same format and you can find the name based on the context surrounding it. But this probably isn't the case for you.
If you are asking your applicants to submit their résumé online you could provide a separate field for them to enter their name and any other information you need instead of trying to automatically parse résumés.
Forget it - seriously.
Or expect to get a lot of applications from a Mr C Vitae
In my experience, having written something very similar (but a very long time ago), about 95% of resumes have the person's name as the very first line. You could probably have a pretty loose regex checking for alpha, hyphens, periods, and assume that's the name.
Obviously there's no way to do this 100% accurately, as you said, but this would be close.
Unless you wanted to build an expression that contained every possible name, or-ed together, the expression you are referring to is not "Regular," with a capital R. A good guess might be to go looking for the largest-font words in the document. If they follow a pattern that looks like firstname-lastname, name-initial-name, etc., you could call it a good guess...
That's a really hairy problem to tackle. The regex has to match two words that could be someone's name. The problem with that is that some people, of Hispanic origin, for example, might have a name that's more than 2 words. Also, how would you define two words to match for a name? Would you use a database of common first and last name fields? That might work unless someone has an uncommon name.
I'm reminded of a story of a COBOL teacher in college told me about an individual of Asian origin who's name would break every rule the programmers defined for a bank's internal system. His first name was "O." just the letter O.
The only remotely dependable way to nail down the regex would be if you had something to set off your search with; maybe if a line of text in the resume began with "Name: " then you'd know where to start looking.
tl;dr: People's names and individual resumes are too heavily varied for a regular expression to pick apart.
You could do something like Amazon does for book overviews: SIPs. This would require some after-the-fact double checking by humans but you might find the person's name(s) in there.