I have a buffer of audio and I'd like to perform speech recognition/transcription on it. I have limited CPU and RAM locally so I want to perform recognition on a server.
Are there any (web) services that allow me to do this?
My searches so far have led nowhere...
Google has just introduced browser-based access to its speech engine through HTML5.
http://slides.html5rocks.com/#speech-input
To get this page to work, I launched the Chromium browser as follows in Ubuntu:
$ chromium-browser --enable-speech-input
I believe that the idea is to be able to build applications that use Google's speech recognizer, but I haven't had a chance to look deeply into it.
Another interesting project is WAMI from MIT:
http://wami.csail.mit.edu
Lumenvox offers such a service but seems expensive for your needs.
Related
I am working on an AWS cloud project which involves video streaming. For this case, a media player needs be to used is Exoplayer along with required AWS services but I am not able to find anything regarding Exoplayer's use on Websites. Can anyone help me with the usage of Exoplayer for web applications?
ExoPlayer is an Android player - for websites, you would typically use a HTML5/JavaScript player assuming your service is using a typical video streaming technology such as HLS or DASH.
OpenSource HTML5/JavaScript players are readily available:
https://github.com/videojs/video.js
https://github.com/shaka-project/shaka-player
There are also multiple commercial players also like BitMovin, TheoPlayer, JWPlayer etc.
Bottom Line: Cloud Run and Cloud Functions seem to have bizarrely limited bandwidth to the Google Drive API endpoints. Looking for advice on how to work around, or, ideally, #Google support to fix the underlying issue(s) as I will not be the only like use case.
Background: I have what I think is a really simple use case. We're trying to automate private domain Google Drive users to take existing audio recordings and send them off to Speech API to generate a transcript on an ad hoc basis, and to dump the transcript back into the same Drive folder with email notification to the submitter. Easy, right? Only hard part is that Speech API will only read from Google Cloud Storage, so the 'hard part' should be moving the file over. 'Hard' doesn't really cover it...
Problem: Writing in nodejs and using the latest version of the official modules for Drive and GCS, the file copying was going extremely slow. When we broke things down, it became apparent that the GCS speed was acceptable (mostly -- honestly it didn't get a robust test, but was fast enough in limited testing); it was the Drive ingress which was causing the real problem. Using even the sample Google Drive Download app from the repo was slow as can be. Thinking the issue might be either my code or the library, though, I ran the same thing from the Cloud Console, and it was fast as lightning. Same with GCE. Same locally. But in Cloud Functions or Cloud Run, it's like molasses.
Request:
Has anyone in the community run into this or a like issue and found a workaround?
#Google -- Any chance that whatever the underlying performance bottleneck is, you can fix it? This is a quintessentially 'serverless' use case, and it's hard to believe that the folks who've been doing this the longest can't crack it.
Thank you all in advance!
Updated 1/4/19 -- GCS is also slow following more robust testing. Image base also makes no difference (tried nodejs10-alpine, nodejs12-slim, nodejs12-alpine without impact), and memory limits equally do not impact results locally or on GCP (256m works fine locally; 2Gi fails in GCP).
Google Issue at: https://issuetracker.google.com/147139116
Self-inflicted wound. Google-provided code seeks to be asynchronous and do work in the background. Cloud Run and Cloud Functions do not support that model (for now at least). Move to promise-chaining and all of a sudden it works like it should -- so long as the CPU keeps the attention it needs. Limits what we can do with CR / CF, but hopefully that too will evolve.
I'm in need of a plug and play text recognition system after having tried some solutions such as Tesseract OCR, Google's Vision API seemed to produce the best results for me.
However I have never used any of their cloud API before but I've noticed it is able to work offline? How would billing work for this? As I understand the online version charges for every 1000 images, wouldn't the offline library circumvent this? What is the quality difference between online and offline?
Both online and offline charge based on the features used. Here is the pricing chart: https://cloud.google.com/vision/pricing
Quality should be similar for online and offline. You could run a small experiment with your own files to verify this.
I have read official documentation of Microsoft SAPI but I couldn't find about whether the api can be used on offline mode or not.
in there, they said that Microsoft SAPI is server based speech recognition api. So It seems like it doesn't support but I have to make sure.
Can I use Microsoft SAPI on offline just like system.speech ?
That link does not say what you think it says. Both Microsoft.Speech.Recognition (server engine) and System.Speech.Recognition (desktop engine) run entirely on the host CPU. The underlying SR engines are different, however.
The reason why the Microsoft.Speech.Recognition engine is called "Server SR" is that it was designed to run as part of Microsoft Speech Server, which ran on an on-premises server.
If you want online (network) SR, you would need to use Windows.Media.Speech.Recognition, which has both online and offline recognition.
I've developed with SAPI using MS's stock recognizer and synthesizers for 2+ years now. I don't think I've ever needed to have a network connection for my projects to work.
According to Microsoft's Speech API Overview it states directly that:
"The SAPI application programming interface (API) dramatically reduces the code overhead required for an application to use speech recognition and text-to-speech, making speech technology more accessible and robust for a wide range of applications."
So, between my personal experience, and the overview, it's safe it say you can recognize/synthesize speech in an offline mode.
What is a better mBaaS that supports offline sync and caching?
I am evaluating several mBaaS solutions for my hybrid mobile app under development. I looked at Kinvey, Kii, buddy, and Telerik BackEnd platform. I have also came across some open source solutions like openmobster and dreamfactory. I am looking to store data in sql-lite on mobile app and then sync it back with an online data store. Kinvey has this support, but their pricing model (per user) is not suitable in my scenario. I can see that openmobster does this but, how is what I need to understand? Can I host in on Azure VM or something? Also please suggest if there is any other solution commercial/open source capable of doing offline sync and caching with push notifications and data storage?
DreamFactory could be a good fit for your scenario. It is open source and comes with a full 30 days of free support. After which it's only like $25/month for a developer account - and this isn't even a requirement to use its product. It's specifically a support package.
To address your question a little more in-depth... I don't believe DreamFactory supports offline syncing at the moment, though they plan to very soon. In regards to sql-lite, DreamFactory's (DSP) product has a built in sql-lite driver to connect to that DB. However, it hasn't been tested enough for them to say it is a fully supported RDBMS. One of the beautiful things about DreamFactory is you're able to host the DSP (DreamFactory Service Platform) on Azure and Amazon EC2 instances (cloud solutions), host locally on your own server, or even use its own free hosted edition!
I would definitely take a little time to look into DF. It doesn't seem to me like you have much to lose. Especially, considering it's a free open-source product!
Feel free to ask me any questions you may have about DreamFactory!
-Mark