Pepper - Number recognition - pepper

I'd like to use Pepper as a calculator...
Now the first problem is the number recognition...
Considering that "*" does not work, and that it is not possible to add all numbers as concept....It is a problem.
I can write a concept composition to make recognize numbers in letters (not in numbers), e.g. "one hundred twenty one" instead of 121. Now I don't know how to convert the letters-number to digit-number in a simple way... (the only way I know is using a parser in a remote phyton function)
Another problem is that I cannot make a "sum" in qiChat language.
Is there a way to make a sum in qichat without using a %script?
If I use a script I cannot assign the result to a qichat variable, the only way is using events...
Thanks if you can suggest some simpler way to proceed.
Debora

Starting from the 100s, the numbers are expressed in a systematic way, like in "three million two hundreds fifty two thousands six hundreds and ninety one" for "3,252,691".
As you can see, you don't need neither to capture * nor "all numbers", but rather a combination of just some possible chunks.
concept:(digits) ["one", "two", ...]
concept:(tens) ["ten", "twenty", ...]
concept:(number_tens) {~tens} {~digits}
concept:(number_hundreds) {~digit hundred{s}} ~number_tens
concept:(number) {~number_hundreds million{s}} {~number_hundreds thousand{s}} ~number_hundreds
u:(_~number}
$number_out=$1
An ALMemory event named number_out should be raised with the value when a number is matched. You can subscribe to it, and process it with a script to convert the natural language into numbers, like propose here for example.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

google-speech-api and overriding phone number recognition

Does anyone know if there is a way to manipulate the recognition of phone numbers when using the Google Speech API? I am trying to implement a transcription scenario where a caller will say a string of letters and numbers, but the logic out of the box seems to be to try to fit any sequence of numbers to a phone number scheme, even if it means rendering letters into numbers they may sound vaguely similar to (or not). I have tried using speech contexts to manipulate the values within the "phone number" by typing out and giving the entire thing as it should be as a speech context ("eight seven seven two bee three seven", for example), but it refuses to override the digits being interpreted as a phone number. Has anyone encountered this issue or is aware of any way in which this could be worked around?
Thanks!
I'm not aware of an easy way to do this. For the Web Speech API for JavaScript, doing the following seems to yield fewer results that are forced into phone number format.:
Set the maxAlternatives = 2, e.g.,
var recognition = new speechRecognition();
recognition.maxAlternatives = 2;
Then use the second result offered, e.g.,
constr speechToText = event.results[0][1].transcript
You can get pretty far by processing the result. A remaining challenge is that since the result often clumps digits together, you lose the distinction between a series of single digit numbers and one multi-digit number (e.g., '15' & '1', '5'). The utility of this approach depends on the specifics of the numbers your app is trying to capture.
In at least one case, setting the language to en-PH (English Philippines) seems to have fixed, or at least notably improved, this problem. Other English language options might work as well.
en-GB comes back as a UK formatted number where they put one digit first then the rest of the number.

Google-speech-api transcribing spoken numbers incorrectly

I started using google speech api to transcribe audio.
The audio being transcribed contains many numbers spoken one after the other.
E.g. 273 298
But the transcription comes back 270-3298
My guess is that it is interpreting it as some sort of phone number.
What i want is unparsed output e.g. "two seventy three two ninety eight' which i can deal with and parse on my own.
Is there a setting or support for this kind of thing?
thanks
So I had this exact same problem and I think we found a solution. If you're using English as input, switch to en-PH just when working with numbers. Google will then not format the result as a U.S. phone number or try to stick an extra digit in there.
Try passing a speech context with some phrase hints. How to use it is documented here: https://cloud.google.com/speech/docs/basics#phrase-hints
Give it the spelled out numbers that you want recognized.
"speech_context": {
"phrases":["zero", "one", "two", ... "nine", "ten", "eleven", ... "twenty", "thirty,..., "ninety"]
}
This isn't guaranteed to work, but it may help.
For the record, I tried blambert's solution above and it doesn't work, unfortunately. I posted another question recently seeing if anyone has found a way to defeat this behavior, as it is preventing me from implementing a transcription service that I had planned.
Have you tried Google Speech customClass?
You have some class tokens that you could use, telling the API that you are not expecting a phone number but a different type of numbers.
For instance, if you choose to use OOV_CLASS_AM_RADIO_FREQUENCY, you'll indicate the API to interpret numbers like this:
"twelve twenty" --> 1220
"seven hundred and thirty" --> 730
Probably (haven't read this) the API is using this class FULLPHONENUM by defaut for numbers:
"one eight hundred five five five four oh oh one" --> +1-800-555-4001
"seven one eight five five five six one oh one" --> 718-555-6101

Complex regex to check for two words and a quanity

Ok what I'm trying to do is to check for the presence of
"TestItem-1"
a number greater then 1
one of the possible words in the list of "KG. Kg, kg, Kilo(s) or Kilogram(s)"
Where any of the items could be in any order and within a 6 word limit of each other.
Has to be done in regex as there is no access to the underlying scripting engine
This is what I've got as there a way of checking greater then I decided to use a range of 1-999 for the number check.
\b(?:[T|t]estItem-1\W+(?:\w+\W+){1,6}(^[0-9]|[1-9][0-9]|[1-9][0-9][0-9])$)\W+(?:\w+\W+){1,6}[K|k]il[o|os]|[K|k][[G|GS]|[g|gs]]|[|K|k]ilogra[m|ms]\b
Examples of what I need to find would be like -
"TestItem-1 is unstable in quanties above 12 Kilograms"
"1 Kilogram of TestItem-1"
While I wouldn't want to find -
"15 units of TestItem-1"
I know that what I got isn't working each section appears to work independently of each other but not together.
I pass this over to far greater minds then mine :)
You can try something like this:
\b(?:[2-9]|\d\d+)\b\s\b(?:KG.|Kg,|kg,|Kilos?|Kilograms?)\b(?:\S+\s){0,6}\bTestItem-1\b|\bTestItem-1\b(?:\S+\s){0,6}\b(?:[2-9]|\d\d+)\b\s\b(?:KG.|Kg,|kg,|Kilos?|Kilograms?)\b
Not ideal with the duplication but without lookarounds that's the best I could think of. I'll try and improve it in a bit.

Can someone provide a regex for validating and parsing a csv of integers and reals

I am new to regex and struggling to create an expression to parse a csv containing 1 to n values. The values can be integers or real numbers. The sample inputs would be:
1
1,2,3,4,5
1,2.456, 3.08, 0.5, 7
This would be used in c#.
Thanks,
Jerry
Use a CSV parser instead of RegEx.
There are several options - see this SO questions and answers and this one for the different options (built into the BCL and third party libraries).
The BCL provides the TextFieldParser (within the VisualBasic namespace, but don't let that put you off it).
A third party library that is liked by many is filehelpers.
Using REGEX for CSV parsing has been a 10 year jihad for me. I have found it remarkably frustrating, due to the boundary cases:
Numbers come in a variety of forms (here in the US, Canada):
1
1.
1.0
1000
1000.
1,000
1e3
1.0e3
1.0e+3
1.0e+003
-1
-1.0 (etc)
But of course, Europe has traditionally been different with regard to commas and decimal points:
1
1,0
1000
1.000e3
1e3
1,0e3
1,0e+3
1,0e+003
Which just ruins everything. So, we ignore the German and French and Continental standard because the comma just is impossible to work out whether it is separating values, or part of values. (The Continent likes TAB instead of COMMA)
I'll assume that you're "just" looking for numerical values separated from each other by commas and possible space-padding. The expression:
\s*(\-?\d+(?:\.\d*)?(?:[eE][\-+]?\d*)?)\s*
is a pretty fair parser of A NUMBER. Catches just about every reasonable case. Doesn't deal with imbedded commas though! It also trims off spaces, either side of the number.
From there, you can either build an iterative CSV string decomposer (walking each field, absorbing commas, assigning to an array, say), or use the scanf type function to do the same thing. I do prefer the iterative decomposition method - as it also allows you to parse out strings, hexadecimal, and virtually any other pattern you find in the data.
The regex you want is
#"([+-]?\d+(?:\.\d+)?)(?:$|,\s*)"
...from which you'll want capture group 1. However, don't use regex for something like this. String manipulation is much better when the input is in a very static, predictable format:
string[] nums = strInput.split(", ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
List<float> results = (from n in nums
select float.Parse(n)).ToList();
If you do use regex, make sure you do a global capture.
I think you would have to loop it to check for an unknown number of ints... or else something like this:
/ *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) */
and you could keep that going ",?([0-9]*)" as far as you wanted to, to account for a lot of numbers. The result would be an array of numbers....
http://jsfiddle.net/8URvL/1/