I have two utterances:
SampleQuestion what's the {Currency} {Currencytwo} rate
SampleQuestion what is the {Currency} {Currencytwo} rate
The first one works (what's), while the second one doesn't (what is)
What could be a possible reason?
Voice recognition is something that is very hard to test. What is and is not recognized can vary depending on the person speaking, background noise, etc. There are a few things to try to debug your problem.
In the companion app Alexa often types "what it thought it heard". You might check this to see what Alexa thinks it heard when it didn't recognize something.
You can type specific phrases into the simulator on the development page for your skill. This can test specific renderings. However, because it bypasses the voice recognition layer it is only good for debugging the specifics of your interaction model.
Alexa performs poorly when you have two lots that are not separated by static text. You might consider if you can re-phrase your utterance to have a separating words between the two, or to ask for it as two separate utterances.
If either of your slots are custom slots, you might consider what their content is. Alexa doesn't recognize things one word at a time. It looks at the entire sequence of sounds holistically. It matches each possibility against what it heard, and picks the one with the highest confidence value. Since currencies are often foreign words, that might be perturbing things. Try cutting down your list and see if that improves things.
Related
For example, scanning the wine to tell you its name by Vivino.
But, seriously, scanning its barcode is faster and easier with high accuracy. Why do they design to use AI to scan the product by computer vision (object identification)? Is it necessary, mate?
This is not a question that is appropriate for this site.
Your question is likely to be a statement of opinion.
The answer is, not which is better, but the issue of segregation.
The product scanner does not replace the barcode scanner. At least soon.
For example, there are products that do not have source marking and are difficult to apply in-store marking.
In order to save labor, reduce time, standardize, and mechanize these products, product scanners that use AI will be good parts.
In addition, even for products that already have barcodes, it is possible to acquire the information of the target actual product and display it additionally.
At first, every product has a part that seems useless and exaggerated.
However, according to market evaluations, some of them will disappear and some will be sophisticated and essential.
That's always happening and your question seems to be one of them.
I have an important question, at the moment i am writing my last essay before starting with my bachelor thesis. It is about voice apps, which includes the alexa skills for sure.
But i need some informations about the word tolerance of the utterances. And I was not able to find some information on the internet yet. Does Alexa only recognize the utterances typed in by the developer or does Alexa uses machine learning like Google Assistant to learn about new utterances ? It is really important for my essay. So I would be very happy if you can help me with this question.
Thank you!
Alexa also recognize slightly different sentences than what you defined as utterances. But if your intent is matched also depends on how many intents you have and how similar they are.
So what happens on Amazon side is behind the scenes and I don't think they use machine learning to get your utterances to intent connection right. Because you would need to train the algorithm somehow what is right and what was a wrong connection from phrase to intent.
In their documentation they suggest to use as many utterances as possible:
It is better to provide too many samples than to provide too few
https://developer.amazon.com/de/docs/custom-skills/best-practices-for-sample-utterances-and-custom-slot-type-values.html
It will be too difficult to develop an alexa app if you need to configure all the possible variations of intent. Alexa learns from the phrases that you provide for an intent and uses machine learning to not just recognize the intents that you have configured but also the subtle variations too.
You can easily verify this by setting up a basic alexa app and testing it on online simulator.
Based on what I saw using the echo device to test the skill and not only the online simulator (they are way too different, so be sure to test the skill with the real device because the behaviour is completely different between simulator and echo) I think that yes, Alexa use ML to understand what you have say to "enforce" the understanding into something that you have put into the slot.
This is a strange behaviour, because yes, you can say something different to fill the slot but there is no guarantee that Alexa will understand correctly what you have say and will trigger the correct slot.
You can try this behaviour simply putting some random or non-real word into the slots. If you say to Alexa something similar to that word, even if it doesn't exists, you will get a match, but if you say something that is completely different, there is no guarantee that the intent will be triggered.
(eg. if you put in the slot the word "blues", even if you say "blue" Alexa try to enforce her understanding into "blues". Or even better, try putting a completely random string like "asdajhfjkak" and say to Alexa something that is similar to that and you will get a match)
I'm trying to design a desktop UI for schematics, layout, drawing stuff. Just looking for high level advice from actual software designers.
Assuming an in-memory "database", (clojure map of arbitrary depth for all user data, and possibly another one for application preferences, etc.), I'm examining how to do the model-view-controller thing on these, where the data may be rendered and modified by any one or more of:
A standalone text field that shows a single parameter, such as box width.
An "inspector" type of view that shows multiple parameters of a selected object, such as box width, height, color, checkboxes, etc.
A table/spreadsheet type of view that shows multiple parameters of multiple objects, potentially the whole database
A graphical rendering of the whole thing, such as both schematic and layout view.
Modifying any one of these should show up immediately in every other active view, both text and graphical, not after clicking "ok"... so no modal boxes allowed. If for some reason the table view, an inspector view, and a graphical rendering are all in view, dragging the corner of the box graphically should immediately show up in the text, etc.
The platform in question is JavaFX, but I'd like a clean separation between UI and everything else, so I want to avoid binding in the JFX sense, as that ties my design data very tightly to JFX Properties, increases the graininess of the model, and forces me to work outside the standard clojure functions for dealing with data, and/or deal heavily with the whole getValue/setValue world.
I'm still assuming at least some statefulness/mutability, and the use of built-in Clojure functionality such as the ability to add-watch on an atom/var/ref and let the runtime signal dependent functions.
Platform-specific interaction will rest tightly with the actual UI, such as reifying ActionListeners, and dealing with ObservableValues etc., and will attempt to minimize the reliance on things like JavaFX Property for actual application data. I'm not entertaining FRP for this.
I don't mind extending JFX interfaces or making up my own protocols to use application-specific defrecords, but I'd prefer for the application data to remain as straight Clojure data, unsullied by the platform.
The question is how to set this all up, with closest adherence to the immutable model. I see a few options:
Fine-grain: Each parameter value/primitive (ie Long, Double, Boolean, or String) is an atom, and each view which can modify the value "reaches in" as far as it needs to in the database to change the value. This could suck as there could potentially be thousands of individual values (for example points on a hand-drawn curve), and will require lots of (deref...) junk. I believe this is how JFX would want to do this, with giant arrays of Properties at the leaf nodes, etc., which feels bloated. With this approach it doesn't seem much better than just coding it up in Java/C++.
Medium-grain: Each object/record in the database is an atom of a Clojure map. The entire map is replaced when any one of its values changes. Fewer total atoms to deal with, and allows for example long arrays of straight-up numbers for various things. But this gets complicated when some objects in the database require more nesting than others.
Coarse-grain: There is just one atom: the database. Any time anything changes, the entire database is replaced, and every view needs to re-render its particular portion. This feels a bit like using a hammer to swat a fly, and a naive implementation would require everything to re-render all the time. But I still think this is the best trade off, as any primitive has a clear access path from the root node, whether it is accessed on a per-primitive level or per-record level.
I also need the ability for one data template to be instantiated many times. So for example if the user changes a symbol or shape which is used in multiple places, a single edit will apply everywhere. I believe this also requires some type of "pointer"-like behavior. I think I can store a atom to the model, then instantiate as needed, and it can work in any of the above grain models.
Any other approaches? Is trying to do a GUI editor-like tool in a functional language just stupid?
Thanks
I don't think is stupid to use a functional language to do a GUI editor-like tool. But I can't claim to have an answer to your question. Here are some links that might help you in your journey:
Stuart Sierra - Components Just Enough Structure
Chris Granger - Light Table: Explains how Light Table (source) is structured.
Chris Granger - The IDE as a Value: blog post related to the video above
Conal Elliott - Tangible Functional Programming: Using Functional Reactive Programming to create a composable UI, but his code is in Haskell.
Nathan Herzing & Chris Shea - Helping voters with Pedestal, Datomic, Om and core.async
David Nolen - Comparative Literate Programming: Shows all to use core.async to simplify UI programming in ClojureScript. The ideas here can be used in a desktop UI.
Rich Hickey - The Language of the System: Amazing talk about system programming by the creator of Clojure.
Erik Meijer has a good quote about functional vs imperative code:
...no matter whether it's Haskell, C# Java, F#, Scala, Python, PHP think about the idea of having a sea of imperative code that interacts with the outside world and in there have islands of pure code where you write your functions in a pure way.
But you have to decide how big the islands are and how big the sea is. But the answer is never that there are only islands or only sea. A good programmer knows exactly the right balance.
I'm on a project that among other video related tasks should eventually be capable of extracting the audio of a video and apply some kind of speech recognition to it and get a transcribed text of what's said on the video. Ideally it should output some kind of subtitle format so that the text is linked to a certain point on the video.
I was thinking of using the Microsoft Speech API (aka SAPI). But from what I could see it is rather difficult to use. The very few examples that I found for speech recognition (most are for Text-To-Speech which mush easier) didn't perform very well (they don't recognize a thing). For example this one: http://msdn.microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx
Some examples use something called grammar files that are supposed to define the words that the recognizer is waiting for but since I haven't trained the Windows Speech Recognition thoroughly I think that might be adulterating the results.
So my question is... what's the best tool for something like this? Could you provide both paid and free options? Well the best "free" (as it comes with Windows) option I believe it's SAPI, all the rest should be paid but if they are really good it might be worth it. Also if you have any good tutorials for using SAPI (or other API) on a context similar to this it would be great.
On the whole this is a big ask!
The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.
Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.
In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:
-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)
-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?
-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.
The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!
--
There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).
--
There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.
--
The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.
Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.
--
As much as I hate citing the big W there's a goldmine of useful links here!
Good luck :)
This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.
Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.
It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.
But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:
1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.
For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.
For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.
For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.
(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.
Hope this helps.
I want to allow my users to input HTML.
Requirements
Allow a specific set of HTML tags.
Preserve characters (do not encode ã into ã, for example)
Existing options
AntiSamy. Unfortunately AntiSamy encodes special characters and breaks requirement 2.
Native ColdFusion functions (HTMLCodeFormat() etc...) don't work as they encode HTML into entities, and thus fail requirement 1.
I found this set of functions somewhere, but I have no way of telling how secure this is: http://pastie.org/2072867
So what are my options? Are there existing libraries for this?
Portcullis works well for Cold Fusion for attack-specific issues. I've used a couple of other regex solutions I found on the web over time that have worked well, though they haven't been nearly as fleshed out. In 15 years (10 as a CMS developer) nothing I've built has been hacked....knock on wood.
When developing input fields of any type, it's good to look at the problem from different angles. You've got the UI side, which includes both usability and client-side validation. Yes, it can be bypassed, but javascript-based validation is quicker, more responsive, and rates higher on the magical UI scale than backend-interruption method or simply making things "disappear" without warning. It will speed up the back-end validation because it does the initial screening. So, it's not an "instead of" but an "in-addition to" type solution that can't be ignored.
Also on the UI front, giving your users a good quality editor also can make a huge difference in the process. My personal favorite is CKeditor simply because it's the only one that can handle Microsoft Word code on the front-side, keeping it far away from my DB. It seems silly, but Word HTML is valid, so it won't setoff any red flags....but on a moderately sized document it will quickly overload a DB field insert max, believe it or not. Not only will a good editor reduce the amount of silly HTML that comes in, but it will also just make things faster for the user....win/win.
I personally encode and decode my characters...it's always just worked well so I've never changed practice.