google-speech-api and overriding phone number recognition - google-cloud-platform

Does anyone know if there is a way to manipulate the recognition of phone numbers when using the Google Speech API? I am trying to implement a transcription scenario where a caller will say a string of letters and numbers, but the logic out of the box seems to be to try to fit any sequence of numbers to a phone number scheme, even if it means rendering letters into numbers they may sound vaguely similar to (or not). I have tried using speech contexts to manipulate the values within the "phone number" by typing out and giving the entire thing as it should be as a speech context ("eight seven seven two bee three seven", for example), but it refuses to override the digits being interpreted as a phone number. Has anyone encountered this issue or is aware of any way in which this could be worked around?
Thanks!

I'm not aware of an easy way to do this. For the Web Speech API for JavaScript, doing the following seems to yield fewer results that are forced into phone number format.:
Set the maxAlternatives = 2, e.g.,
var recognition = new speechRecognition();
recognition.maxAlternatives = 2;
Then use the second result offered, e.g.,
constr speechToText = event.results[0][1].transcript
You can get pretty far by processing the result. A remaining challenge is that since the result often clumps digits together, you lose the distinction between a series of single digit numbers and one multi-digit number (e.g., '15' & '1', '5'). The utility of this approach depends on the specifics of the numbers your app is trying to capture.

In at least one case, setting the language to en-PH (English Philippines) seems to have fixed, or at least notably improved, this problem. Other English language options might work as well.
en-GB comes back as a UK formatted number where they put one digit first then the rest of the number.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

Pepper - Number recognition

I'd like to use Pepper as a calculator...
Now the first problem is the number recognition...
Considering that "*" does not work, and that it is not possible to add all numbers as concept....It is a problem.
I can write a concept composition to make recognize numbers in letters (not in numbers), e.g. "one hundred twenty one" instead of 121. Now I don't know how to convert the letters-number to digit-number in a simple way... (the only way I know is using a parser in a remote phyton function)
Another problem is that I cannot make a "sum" in qiChat language.
Is there a way to make a sum in qichat without using a %script?
If I use a script I cannot assign the result to a qichat variable, the only way is using events...
Thanks if you can suggest some simpler way to proceed.
Debora
Starting from the 100s, the numbers are expressed in a systematic way, like in "three million two hundreds fifty two thousands six hundreds and ninety one" for "3,252,691".
As you can see, you don't need neither to capture * nor "all numbers", but rather a combination of just some possible chunks.
concept:(digits) ["one", "two", ...]
concept:(tens) ["ten", "twenty", ...]
concept:(number_tens) {~tens} {~digits}
concept:(number_hundreds) {~digit hundred{s}} ~number_tens
concept:(number) {~number_hundreds million{s}} {~number_hundreds thousand{s}} ~number_hundreds
u:(_~number}
$number_out=$1
An ALMemory event named number_out should be raised with the value when a number is matched. You can subscribe to it, and process it with a script to convert the natural language into numbers, like propose here for example.

Understanding SpamAssassin HK_RANDOM regex

SpamAssassin has several rules that attempt to detect "random looking" values. For example:
/^(?!(?:mail|bounce)[_.-]|[^#]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer#)|[^#]{26}|.*?#.{0,20}\bcmp-info\.com$)[^#]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi
I understand that the first part of the regex prevents certain cases from matching:
(?!(?:mail|bounce)[_.-]|[^#]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer#)|[^#]{26}|.*?#.{0,20}\bcmp-info\.com$)
However, I am not able to understand how the second part detects "randomness". Any help would be greatly appreciated!
/[^#]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi
It will match strings containing 5 consecutive consonants (excluding h and s for some reason) :
[bcdfgjklmnpqrtvwxz]{5}
or 5 consecutive vowels :
[aeiouy]{5}
or the same letter or couple of letters repeated 3 times (present 4 times) :
([a-z]{1,2})(?:\1){3}
Here are a few examples of strings it will match :
somethingmkfkgkmsomething
aiaioe
totototo
aaaa
It obviously can't detect randomness, however it can identify patterns that don't often happen in meaningful strings, and mention these patterns look random.
It is also possible that these patterns are constructed "from experience", after analysis of a number of emails crafted by spammers, and would actually reflect the algorithms behind the tools used by these spammers or the process they use to create these emails (e.g. some degree of keyboard mashing ?).
Bottom note is that you can't detect randomness on a single piece of data. What you can do however is try to detect purpose, and if you don't find any then assume that to the best of your knowledge it is random. SpamAssasin assumes a few rules about human communication (which might fit different languages better or worse : as is it will flag a few forms of French's imperfect tense such as "échouaient"), and if the content doesn't match them it reports it as "random".

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.

US Phone Number Verification

I have a website form that requires a US phone number input for follow up purposes, and this is very necessary in this case. I want try to eliminate users entering junk data 330-000-0000. I have seen some options of third parties that validate phone numbers for you, however idk if that is the best option for this situation. However if you have every used one of these third parties and can make a recommendation that would also be greatly appreciated here.
However I am considering checking the number against a set of rules to just try to narrow down the junk phone numbers received.
not a 555 number
does not contain 7 identical digits
valid area code (this is readily available)
not a 123-1234 or 123-4567
I guess I could also count out 867-5309 (heh*)
Would this result in any situations that you can think of that would not allow a user to enter their phone number? Could you think of any other rules that a phone number should not contain? Any other thoughts?
It seems to me that you're putting more effort into this than it warrants. Consider:
If your purpose is to guard against mis-entered phone numbers, then you can probably catch well over 90% of them with just a very simple check.
If your purpose is to try to force users to provide a valid number whether they want to give that information out or not, then you've taken on a hopeless task - even if you were able to access 100% accurate, up-to-the-second telco databases to verify that the exact number entered is currently live, you still don't gain any assurance that the number they gave you is their own. Once again, a simple check will foil the majority of people entering bogus numbers, but those who are willing to try more than two or three times will find a way to defeat your attempts to gain their numbers.
Either way, a simple test is going to get you good results and going into more complex rule sets will take up increasingly more time while providing increasingly little benefit to you (while also potentially adding false positives, as already shown with the "seven of the same digit" and 867-5309 cases).
You can do phone number validation internally in your app using regular expressions. Depending on your language you can call a function that will return true if a supplied phone number matches the expression.
In PHP:
function phone_number_is_valid($phone) {
return (eregi('^(?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4}$', $phone));
}
You can look up different regular expressions online. I found the one above one at http://regexlib.com/DisplayPatterns.aspx?categoryId=7&cattabindex=2
Edit: Some language specific sites for regular expressions:
PHP at php.net: http://php.net/regex
C# at MSDN
Java: http://java.sun.com/developer/technicalArticles/releases/1.4regex/
867-5309 is a valid phone number that is assigned to people in different area codes.
If you can verify the area code then unless you really, really need to know their phone number you're probably doing as much as is reasonable.
In Django there is a nice little contrib package called localflavor wich has a lot of country specific validation code, for example postal codes or phone numbers. You can look in the source too see how django handles these for the country you would like to use; For example: US Form validation. This can be a great recourse for information about countries you know little of as well.
Your customers can still do what I do, which is give out the local moviefone number.
Also, 123-1234 or 123-4567 are only invalid numbers because the prefix begins with a 1, but 234-5678 or 234-1234 would actually be valid (though it looks fake).
Maybe take a look at the answers to this question.
If you're sticking with just US- and Canada-format numbers, I think the following regex might work:
[2-9][0-9][0-9]-[2-9][0-9][0-9]-[0-9][0-9][0-9][0-9] & ![2-9][0-9][0-9]-555-[0-9][0-9][0-9][0-9]
You also need to take into account ten-digit dialing, which is used in some areas now: this is different from long-distance dialing (ie, 303-555-1234, as opposed to 1-303-555-1234). In some places, a valid phone number is ten digits long; in others, it is seven.
This is a quick function that I use (below). I do have access to a zipcode database that contains areacode and prefix data which is updated monthly. I have often thought about doing a data dip to confirm that the prefix exists for the area code.
public static bool isPhone(string phoneNum)
{
Regex rxPhone1, rxPhone2;
rxPhone1 = new Regex(#"^\d{10,}$");
rxPhone2 = new Regex(#"(\d)\1\1\1\1\1\1\1\1\1");
if(phoneNum.Trim() == string.Empty)
return false;
if(phoneNum.Length != 10)
return false;
//Check to make sure the phone number has at least 10 digits
if (!rxPhone1.IsMatch(phoneNum))
return false;
//Check for repeating characters (ex. 9999999999)
if (rxPhone2.IsMatch(phoneNum))
return false;
//Make sure first digit is not 1 or zero
if(phoneNum.Substring(0,1) == "1" || phoneNum.Substring(0,1) == "0")
return false;
return true;
}
I don't nkow if this is the right place, it's a formatting function rather than a validation function, I thought let's share it with the community, maybe one day it will be helpful..
Private Sub OnNumberChanged()
Dim sep = "-"
Dim num As String = Number.ToCharArray.Where(Function(c) Char.IsDigit(c)) _
.ToArray
Dim ext As String = Nothing
If num.Length > 10 Then ext = num.Substring(10)
ext = If(IsNullOrEmpty(ext), "", " x" & ext)
_Number = Left(num, 3) & sep & Mid(num, 4, 3) & sep & Mid(num, 7, 4) & ext
End Sub
My validation function is like so:
Public Shared Function ValidatePhoneNumber(ByVal number As String)
Return number IsNot Nothing AndAlso number.ToCharArray. _
Where(Function(c) Char.IsNumber(c)).Count >= 10
End Function
I call this last function # the OnNumberChanging(number As String) method of the entity.
For US and International Phone validation I found this code the most suitable:
((\+[1-9]{1,4}[ \-]*)|(\([0-9]{2,3}\)[ \-]*)|([0-9]{2,4})[ \-]*)*?[0-9]{3,4}?[ \-]*[0-9]{3,4}?$
You can find an (albeit somewhat dated) discussion here.
Those parameters look pretty good to me, I might also avoid numbers starting with 911 just to be safe.