AWS transcribe speaker diarization, segments single speaker sentences into multiple different speakers

AWS transcribe speaker diarization, segments single speaker sentences into multiple different speakers - amazon-web-services

A bit of context, we have been using AWS Transcribe for English transcription since last one year. When the number of speakers is unknown, transcribe asks you to provide max number of speakers, by default we are passing 5.
Since last month we observed that the ability to differentiate between speakers has gone down drastically. Spoken words from a single speaker gets broken into multiple speaker sentences. Even when there are only 2 speakers.
Any pointers would be helpful.

Related

Dialogflow cuts off number input at 3-4 digits

I'm currently making a voice assistent and I'm running into the following issue:
The voice detection of Google Dialogflow severely butchers my numeric inputs spoken into the assistent.
Like when I clearly pronounce "One nine, eight, seven, two, three" it will turn it into 987 or 1987. It just seems to cut off listening and continues straight away when it thinks it has a full entity.
I have made a custom composite entity that is built up out of three different recognition patterns.
#sys.number-integer:number-integer
NumberRegex ^([1-9]{1}([0-9]){1,4}[0-9]{1})$
NumberCardinals #sys.cardinal:cardinal (repeating 3-6 times as composite entity)
Basically what I want to detect is a numeric input consisting of 3 numbers minimum and 6 maximum.
Typing works great, it detects all combinations flawlessly whether it's cardinals or numbers...
But speech is just a huge problem, and it cuts off before the user has finished speaking.
Anyone got any suggestions on how to overcome this? And force DialogFlow to listen to the max amount of numbers?

Regex to identify Store Credit Card numbers

There are very detailed regex expressions to identify Visa, MasterCard, Discover and other popular credit card numbers.
However, there are tons of other credit cards; termed popularly as Store Credit Cards (these are not the Visa or Amex powered cards). Examples of these cards are Amazon, GAP brands, Williams Sonoma, Macy's and so on. Most of these are Synchrony Bank Credit Cards.
Is there a regex to identify these different brand credit card numbers?

It's ludicrous to use a regex to identify the network. All it takes is a prefix matching at most.
A card number has 16 digits. The first few identify the network and the bank.
Some people would say that Visa starts with 4 and MasterCard starts with 5 but that's a broad approximation at best. You can have a look at your card, should be right most of the time.
It would be easy to figure out what a card is if one could get a registry of known prefixes, but there is no public registry to my knowledge. I highly doubt that any of the parties involved would like to publish that information.

The first eight digits (until recently this was six digits) of an international card number are known as the Issuer Identification Number (IIN) and the registry that maintains this index is the American Bankers Association
The list of IINs is updated monthly and spans tens of thousands of rows. Unfortunately a fixed Regex isn't going to be accurate for any length of time.

How to mask credit card number mask in a text?

I have a form on my website and my customers send message to me with this form. Sometimes they write their credit card number on the message. So this is really critical. I want to mask these credit card numbers. But of course card numbers don't come on a regular basis.
Example 1: 1111222233334444
Example 2: 4444 3333 2222 1111
Example 3: 4444-3333-2222-1111
Example 4: 4444 - 3333 - 2222 - 1111
Example 5: 4444--3333--2222--1111
So I can mask for example 1, 2 and 3. But if there are more than one space or dash between numbers I can't.
And this is my last regex:
preg_replace("/(?:\b| )([3456]\d{3})([ -]+){0,1}\d{4}([ -]+){0,1}\d{4}([ -]+){0,1}(\d{0})/", "$1********$2", $a1);
And results for this regex:
Result 1: 4444********1111
Result 2: 4444******** 1111
Result 3: 4444********-1111
Result 4: 4444******** - 1111
Result 5: 4444********--1111
So what should I do in regex? Thanks.

May I suggest that you separate validation of your credit card number from the presentation of that number to your users via the UI? Assuming you have only stored valid credit card numbers, then it is probably safe to assume that every number has at least 8 digits. If so, then you can just use a blanket regex to only display the first 4 and last 8 digits:
$cc = "4444--3333--2222--1111";
echo preg_replace("/(\d{4}).*(\d{4})/", "$1********$2", $cc);
4444********1111
Demo
You might point out that this puts the same number of stars in between every card number. But, then again, this is a good thing, because it makes it even harder for a snooper to fish out what the real unmasked number actually is.
Edit:
Here is a smarter regex which will star out the middle portion of any number, leaving only the first and last 4 characters visible:
$cc = "4444--3333--2222--1111";
echo preg_replace("/(?<=.{4}).(?=.{4})/", "*", $cc);
4444**************1111
Note that this solution would not remove anything from 11114444 as a theoretical input.

How to mask credit card number mask in a text [with regex]?
Don't.
Sometimes they write their credit card number on the message.
They really shouldn't. Don't encourage this behavior. It is not PCI compliant:
What is PCI Compliance?
The Payment Card Industry Data Security Standard (PCI DSS) applies to companies of any size that accept credit card payments. If your company intends to accept card payment, and store, process and transmit cardholder data, you need to host your data securely with a PCI compliant hosting provider.
When you accept credit card data via a website, do so using an approved service provider like Stripe, PayPal, BlueSnap, SecurionPay, etc. These services are immensely popular not because it's hard to make payment systems, but because they're hard to make right (and legal). They all have PHP API's, so you can have people enter credit card data that you never see, and still charge them for amounts that you agree upon.
For example, if you were using Stripe and you wish to inform your customer what credit card they signed up with, their card object has a last4 property that gives the last four digits of the card: At this point you never knew the full credit card number, and you didn't even have to consider whether giving the first four and the last four was a security violation.
Further guidelines:
Never store electronic track data or the card security number in any form
While you may have a business reason for storing credit card information, processing regulations specifically forbid the storage of a card’s security code or any “track data” contained in the magnetic strip on the back of a credit card.
The card security number, called by many acronyms including CVV2, CID, and CSC, is the three digit number on the back of Visa/MasterCard/Discover cards or the 4 digit number on the front of American Express cards. It is designed to provide a way for merchants to know whether a customer authorizing a transaction over the phone or via the Internet actually has the card in their possession. This approach only works if the security code is never stored with the card number. Electronic storage makes this easy. You simply do not create a field for the security code. For paper storage, you need to redact (cross out with a dark pen to make unreadable) the security code after you successfully process the transaction and before you store a paper authorization form. [...]
Clearly you should store neither security codes nor track data purposely. But, you need to make sure you don’t store it inadvertently as well. To do this, be certain to use only approved hardware and software. [...]
Make sure all electronic storage of credit card account numbers is encrypted and all paper storage is secured
[...] Electronic storage of credit card numbers is also common if, for example, you process recurring or repeat transactions. If you do this, you need to make certain that you never store these files unencrypted. You need to make certain that any electronic storage is encrypted using a robust encryption algorithm. That way, if your computer is stolen or if someone in your office gains unauthorized access, you have some level of protection for the credit card numbers.
There are many service providers that offer secure storage—either as a standalone service or as part of a payment processing package. These services typically provide you with a “Token” for a card number they store. You can store the token in any unsecured file. When you’re ready to process a payment, you simply send the service provider the token and it retrieves the full card number for the sole purpose of processing the payment. (It’s technically more complicated than that, but you get the idea.) Just be certain to use a PCI DSS Verified provider [...]

Check the next regex \b([3-6]\d{3})(?: *-* *\d{4}){2} *-* *(\d{4})\b.

How can I restrict the output of an Amazon Machine Learning model? (Predicting cricket team results)

I am trying to predict match winner based on the historical data set as shown below,
The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)
Target is set as Match_winner_id.
Am I missing any configurations? Please help

The model is working perfectly correctly. There's just two problems:
Your input data is not very good
There's no way for the model to know that only one of those two teams should win
Data Quality
A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.
For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.
However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:
Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
Venue: Only relevant if a particular team always wins at a given venue
Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
Score: Again, not known until the actual game, so no good for future predictions.
Man of the Match: Not known for future games.
Umpire: How does an umpire influence the result of a game?
City: Yes, given that home teams often have an advantage.
You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.
Picking only one of the two teams
When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.
You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.
What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.
But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.

How to use Amazon MWS to indicate two different shipping times on items?

I have a bit of a unique problem here. I currently have two warehouses that I ship items out of for selling on Amazon, my primary warehouse and my secondary warehouse. Shipping out of the secondary warehouse takes significantly longer than shipping from the main warehouse, hence why it is referred to as the "secondary" warehouse.
Some of our inventory is split between the two warehouses. Usually this is not an issue, but we keep having a particular issue. Allow me to explain:
Let's say that I have 10 red cups in the main warehouse, and an additional 300 in the secondary warehouse. Let's also say it's Christmas time, so I have all 310 listed. However, from what I've seen, Amazon only allows one shipping time to be listed for the inventory, so the entire 310 get listed as under the primary warehouse's shipping time (2 days) and doesn't account for the secondary warehouse's ship time, rather than split the way that they should be, 10 at 2 days and 300 at 15 days.
The problem comes in when someone orders an amount that would have to be split across the two warehouses, such as if someone were to order 12 of said red cups. The first 10 would come out of the primary warehouse, and the remaining two would come out of the secondary warehouse. Due to the secondary warehouse's shipping time, the remaining two cups would have to be shipped out at a significantly different date, but Amazon marks the entire order as needing to be shipped within those two days.
For a variety of reasons, it is not practical to keep all of one product in one warehouse, nor is it practical to increase the secondary warehouse's shipping time. Changing the overall shipping date for the product to the longest ship time causes us to lose the buy box for the listing, which really defeats the purpose of us trying to sell it.
So my question is this: is there some way in MWS to indicate that the inventory is split up in terms of shipping times? If so, how?
Any assistance in this matter would be appreciated.

Short answer: No.
There is no way to specify two values for FulfillmentLatency, in the same way as there is no way to specify two values for Quantity in stock. You can only ever have one inventory with them (plus FBA stock)
Longer answer: You could.
Sign up twice with Amazon:
"MySellerName" has an inventory of 10 and a fulfillment latency of 2 days
"MySellerName Overseas Warehouse" has an inventory of 300 and a fulfillment latency of 30 days
I haven't tried by I believe Amazon will then automatically direct the customer to the best seller for them, which should be "MySellerName" for small orders and "MySellerName Overseas Warehouse" for larger quantities.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js