I'm trying to dive into building an Alexa skill and I have a specific game in mind that involves the Alexa and the user taking turns counting up. Alexa would begin saying one and the user would then say two. Alexa would in turn make sure the inputted number is correct before outputting the next number. I'm having a hard time understanding where to start. From what I read, it seems like each user input links to an intent. Is that the only way going about this? Sorry if this question isn't very clear due to my lack of understanding.
For this example, you can map an intent invocation to AMAZON.NUMBER. That way no matter what number
a user says will invoke the same intent.
Additionally you can keep track of the user's session using the sessionAttributes and conditionally handle the same intent that way.
Related
I have an important question, at the moment i am writing my last essay before starting with my bachelor thesis. It is about voice apps, which includes the alexa skills for sure.
But i need some informations about the word tolerance of the utterances. And I was not able to find some information on the internet yet. Does Alexa only recognize the utterances typed in by the developer or does Alexa uses machine learning like Google Assistant to learn about new utterances ? It is really important for my essay. So I would be very happy if you can help me with this question.
Thank you!
Alexa also recognize slightly different sentences than what you defined as utterances. But if your intent is matched also depends on how many intents you have and how similar they are.
So what happens on Amazon side is behind the scenes and I don't think they use machine learning to get your utterances to intent connection right. Because you would need to train the algorithm somehow what is right and what was a wrong connection from phrase to intent.
In their documentation they suggest to use as many utterances as possible:
It is better to provide too many samples than to provide too few
https://developer.amazon.com/de/docs/custom-skills/best-practices-for-sample-utterances-and-custom-slot-type-values.html
It will be too difficult to develop an alexa app if you need to configure all the possible variations of intent. Alexa learns from the phrases that you provide for an intent and uses machine learning to not just recognize the intents that you have configured but also the subtle variations too.
You can easily verify this by setting up a basic alexa app and testing it on online simulator.
Based on what I saw using the echo device to test the skill and not only the online simulator (they are way too different, so be sure to test the skill with the real device because the behaviour is completely different between simulator and echo) I think that yes, Alexa use ML to understand what you have say to "enforce" the understanding into something that you have put into the slot.
This is a strange behaviour, because yes, you can say something different to fill the slot but there is no guarantee that Alexa will understand correctly what you have say and will trigger the correct slot.
You can try this behaviour simply putting some random or non-real word into the slots. If you say to Alexa something similar to that word, even if it doesn't exists, you will get a match, but if you say something that is completely different, there is no guarantee that the intent will be triggered.
(eg. if you put in the slot the word "blues", even if you say "blue" Alexa try to enforce her understanding into "blues". Or even better, try putting a completely random string like "asdajhfjkak" and say to Alexa something that is similar to that and you will get a match)
so I am basically trying to tell an interactive story with Alexa. However, I am not sure how to edit an intent response while being said by Alexa. This should happen in a way, that keeps it updating while Alexa tells the story.
In my scenario Alexa is only one daemon fetching strings from DynamoDB and while new story lines are being generated she is supposed to read and then tell them as soon as they're processed. It seems that Alexa needs completed Strings as a return value for its response.
Here an example:
User: Alexa, tell me a story.
Alexa: (checks DynamoDB table for a new sentence, if found) says sentence
Other Device: updates story
Alexa: (checks DynamoDB table for a new sentence, if found) says sentence
...
This would keep on going until the other Device puts an end-signifier to the DynamoDB table making Alexa respond a final This is how the story ends output.
Has anyone experience with such a model or an idea of how to solve it? If possible, I do not want a user to interact more then once for a story.
I am thinking of a solution where I would 'fake' user-intents by simply producing JSON Strings and pushing them through the speechlet requesting the new story-sentences hidden from the user... Anyhow, I am not sure wether this is even possible not to think of the messy solution this would be.. :D
Thanks in regard! :)
The Alexa skills programming model was definitely not designed for this. As you can tell, there is no way to know, from a skill, when a text to speech utterance has finished, in order to be able to determine when to send the next.
Alexa also puts a restriction on how long a skill may take to respond, which I believe is somewhere in the 5-8 seconds range. So that is also a complication.
The only way I can think of accomplishing what you want to accomplish would be to use the GameEngine interface and use input handlers to call back into your skill after you send each TTS. The only catch is that you have to time the responses and hope no extra delays happen.
Today, the GameEngine interface requires you to declare support for Echo Buttons but that is just for metadata purposes. You can certainly use the GameEngine based skills without buttons.
Have a look at: the GameEngine interface docs and the time out recognizer to handle the time out from an input handler.
When you start your story, you’ll start an input handler. Send your TTS and sent a timeout in the input handler of however long you expect the TTS to take Alexa to say. Your skill will receive an event when the timeout expires and you can continue with the story.
You’ll probably want to experiment with setting the playBehavior to ENQUEUE.
I want to write an Alexa skill that would read a list of items out to me and let me interrupt when I wanted and have the backend know where I was in the list that was interrupted.
For example:
Me: Find me a news story about pigs.
Alexa: I found 4 news stories about pigs. The first is titled 'James the pig goes to Mexico', the second is titled 'Pig Escapes Local Farm' [I interrupt]
Me: Tell me about that.
Alexa: The article is by James Watson, is dated today, and reads, "Johnny the Potbelly Pig found a hole in the fence and..."
I can't find anything to indicate that my code can know where an interruption occurs. Am I missing it?
I believe you are correct: the ASK does not provide any way to know when you were interrupted, however, this is all happening in real-time so you could figure it out by observing the amount of time that passes between doing the first ASK 'tell' (ie. where you call context.success( response )), and when you receive the "Tell me that" intent.
Note that the time it takes to read in US-en could be different then for US-gb, so you'll have to do separate calibrations. Also, you might have to add some pauses into your speech text to improve accuracy since there will of course be some variability in the results due to processing times.
If you are using a service like AWS Lambda or Google App Engine that add extra latency when there are no warm instances available, then you will probably need to take that into account.
I have two utterances:
SampleQuestion what's the {Currency} {Currencytwo} rate
SampleQuestion what is the {Currency} {Currencytwo} rate
The first one works (what's), while the second one doesn't (what is)
What could be a possible reason?
Voice recognition is something that is very hard to test. What is and is not recognized can vary depending on the person speaking, background noise, etc. There are a few things to try to debug your problem.
In the companion app Alexa often types "what it thought it heard". You might check this to see what Alexa thinks it heard when it didn't recognize something.
You can type specific phrases into the simulator on the development page for your skill. This can test specific renderings. However, because it bypasses the voice recognition layer it is only good for debugging the specifics of your interaction model.
Alexa performs poorly when you have two lots that are not separated by static text. You might consider if you can re-phrase your utterance to have a separating words between the two, or to ask for it as two separate utterances.
If either of your slots are custom slots, you might consider what their content is. Alexa doesn't recognize things one word at a time. It looks at the entire sequence of sounds holistically. It matches each possibility against what it heard, and picks the one with the highest confidence value. Since currencies are often foreign words, that might be perturbing things. Try cutting down your list and see if that improves things.
If I have fields that will only ever be displayed to the user that enters them, is there any reason to sanitize them against cross-site scripting?
Edit: So the consensus is clear, that it should be sanitized. What I'm trying to understand is why? If the only user that can ever view the script they insert into the site is the user himself, then the only thing he can do is execute the script himself, which he could already do without my site being involved. What's the threat vector here?
Theoretically: no. If you are sure that only they will ever see this page, then let them script whatever they want.
The problem is that there are a lot of ways in which they can make other people view that page, ways you do not control. They might even open the page on a coworker's computer and have them look at it. It is undeniably an extra attack vector.
Example: a pastebin without persistent storage; you post, you get the result, that's it. A script can be inserted that inconspicuously adds a "donate" button to link to your PayPal account. Put it up on enough people's computer, hope someone donates, ...
I agree that this is not the most shocking and realistic of examples. However, once you have to defend a security-related decision with "that is possible but it does not sound too bad," you know you crossed a certain line.
Otherwise, I do not agree with answers like "never trust user input." That statement is meaningless without context. The point is how to define user input, which was the entire question. Trust how, semantically? Syntactically? To what level; just size? Proper HTML?
Subset of unicode characters? The answer depends on the situation. A bare webserver "does not trust user input" but plenty of sites get hacked today, because the boundaries of "user input" depend on your perspective.
Bottom line: avoid allowing anybody any influence over your product unless it is clear to a sleepy, non-technical consumer what and who.
That rules out almost all JS and HTML from the get-go.
P.S.: In my opinion, the OP deserves credit for asking this question in the first place. "Do not trust your users" is not the golden rule of software development. It is a bad rule of thumb because it is too destructive; it detracts from the subtleties in defining the frontier of acceptable interaction between your product and the outside world. It sounds like the end of a brainstorm, while it should start one.
At its core, software development is about creating a clear interface to and from your application. Everything within that interface is Implementation, everything outside it is Security. Making a program do the things you want it to is so preoccupying one easily forgets about making it not do anything else.
Picture the application you are trying to build as a beautiful picture or photo. With software, you try to approximate that image. You use a spec as a sketch, so already here, the more sloppy your spec, the more blurry your sketch. The outline of your ideal application is razor thin, though! You try to recreate that image with code. Carefully you fill the outline of your sketch. At the core, this is easy. Use wide brushes: blurry sketch or not, this part clearly needs coloring. At the edges, it gets more subtle. This is when you realize your sketch is not perfect. If you go too far, your program starts doing things that you do not want it to, and some of those could be very bad.
When you see a blurry line, you can do two things: look closer at your ideal image and try to refine your sketch, or just stop coloring. If you do the latter, chances are you will not go too far. But you will also make only a rough approximation of your ideal program, at best. And you could still accidentally cross the line anyway! Simply because you are not sure where it is.
You have my blessing in looking closer at that blurry line and trying to redefine it. The closer you get to the edge, the more certain you are where it is, and the less likely you are to cross it.
Anyway, in my opinion, this question was not one of security, but one of design: what are the boundaries of your application, and how does your implementation reflect them?
If "never trust user input" is the answer, your sketch is blurry.
(and if you don't agree: what if OP works for "testxsshere.com"? boom! check-mate.)
(somebody should register testxsshere.com)
Just because you don't display a field to someone, doesn't mean that a potential Black Hat doesn't know that they're there. If you have a potential attack vector in your system, plug the hole. It's going to be really hard to explain to your employer why you didn't if it's ever exploited.
I don't believe this question has been answered entirely. He wants to see an accuall XSS attack if the user can only attack himself. This is actually done by a combination of CSRF and XSS.
With CSRF you can make a user make a request with your payload. So if a user can attack himself using XSS, you can make him attack himself (make him make a request with your XSS).
A quote from The Web Application
Hacker’s Handbook:
COMMON MYTH:
“We’re not worried about that low-risk XSS bug. A user could exploit it only to attack himself.”
Even apparently low-risk vulnerabilities can, under the right circumstances, pave the way for a devastating attack. Taking a defense-in-depth approach to security entails removing every known vulnerability, however insignificant it may seem. The authors have even used XSS to place file browser dialogs or ActiveX controls into the page response, helping to break out of a kiosk-mode system bound to a target web application. Always assume that an attacker will be more imaginative than you in devising ways to exploit minor bugs!
Yes, always sanitize user input:
Never trust user input
It does not take a lot of effort to do so.
The key point being 1.
If the script, or service, that the form submits the values to is available via the internet then anyone, anywhere, can write a script that will submit values to it. So: yes, sanitize all inputs received.
The most basic model of web-security is pretty simple:
Do not trust your users
It's also worth linking to my answer in another post (Steps to become web-security savvy): Steps to become web security savvy.
I can't believe I answered without referring to the title-question:
Is there any reason to sanitize user input to prevent them from cross site scripting themself?
You're not preventing the user's being cross-site scripted, you're protecting your site (or, more importantly, you're client's site) from being the victim of cross-site scripting. If you don't close known security holes because you couldn't be bothered it will become very hard to get repeat business. Or good word-of-mouth advertising and recommendation from previous clients.
Think of it less as protecting your client, think of it -if it helps- as protecting your business.