Represent international phone numbers in Cloud Spanner - google-cloud-platform

What would be the best way to represent international phone numbers in Google Cloud Spanner? I would like it to be consistent with the protos defined here.
I am going on the fact that the phone numbers that I need to capture are of the e164_number format and I have represented the extension (as in the proto) as a STRING column. This is a potential fix, but I feel like there could be a more correct solution. Any help would be appreciated.

It sounds like your use case is a good fit for JSON type which Cloud Spanner now supports. You should be able to create a corresponding JSON representation for the phone_number.proto that you have linked. Please see https://cloud.google.com/spanner/docs/working-with-json for how-to as well as examples of how to query individual fields during reads.

Related

Google Natural Language Sentiment Analysis incorrect result

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!
The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

Working with strings in GCP-Workflows and GCP-Admin

I'm integrating a project in GCP-Workflows with GCP-Admin, but I'm having trouble working with some data, when extracting a date it is delivered in this format: 2020-12-28T11: 20: 05.000Z, so I can't turn the string into int, and apparently there is no function in GCP like substring() either. I need to use the date with an IF, checking if it is greater or less than the reference.
How can I do this?
There is some lack of function implementation for now in Workflows. New ones are coming very soon. But I don't know if they will solve your problem
Anyway, with workflows, the correct pattern, if a built-in function isn't implemented, is to call an endpoint, for example a Cloud Function or a Cloud Run, which perform the transformation for you and return the expected result.
Quite boring to do, but don't hesitate to open feature request on the issues tracker product team is very reactive and love user feedbacks!
The Workflows standard library now includes a text module with functions for searching (including regular expressions), splitting, substrings and case transformations.

Google speech adds extra digits and mis-transcribes 9 and 10 digit strings

Scenario: a user speaks a 9 or 10 digit ID and Google speech is used to transcribe it.
Google STT sometimes forces the number into a phone number format, adding mystery digits to make it fit (and thus failing to capture the number accurately).
For example if the caller says "485839485", it may come out as "485-839-4850", with an extra digit that the caller never said. Digits are sometimes added in the middle of the number as well.
This happens even with added hints such as "one,two,three,four,five,six,seven,eight,nine,zero"
Has anyone found a workaround to this issue?
There are many open source speech recognition toolkits which will recognize number sequences reliably and for free, you just need to spend an hour to setup them.
This behavior seems to be related to the logic used by the API's model when performing the transcription tasks. Since this issue is part of an internal process that tries to fit the transcribed numbers into a phone format, I don't think there is a current workaround for this scenario; however, I recommend you to take a look on this ticket that has been created to review this issue, as well as the Release Notes documentation of Speech-to-Text API to keep the track of the new functionalities added to the service.

Text document classification with AWS Machine Learning

I've been looking at using AWS Machine Learning to implement a categorizer for my project. I have something on the order of 40,000 documents that have a several text-only features. For example: Name (< 200 chars) and Description (potentially hundreds / thousands of words).
In a nutshell, I'm looking to assign categories (0 or more) to each document based on it's content.
I've read through the AWS ML tutorial and checked out a few other sources but the available material seems to deal with feature fields that are numeric, boolean, datetime, or otherwise non-textual.
Is AWS Machine Learning capable of performing multi-class categorization on documents based primarily (or possibly only) on text fields? And if so, is there any reference material available for this particular avenue?
Mainly, you don`t need "text fields", first you have to create a vector space model (VTM) from your corpus (texts), than you can weight your VTM with tf-idf, and you can use numeric field.
Are you sure that do you want to apply AWS ML to train a corpus with only 40.000 documents?

OpenLDAP regex search with shell script

For an OpenLDAP database I need to find all users that have a telephone number matching a regex pattern, and are in a given Organizational Unit.
According to this: LDAP search using regular expression it is impossible by an ldapsearch (what would have been my first choice otherwise).
I would like to do the least possible work on clientside, and querying all users from an organizational unit and filter them by a grep or something similar seems too resource consuming. Is there a better way to do it?
Also I'm not very familiar with shell, so I'm a little afraid of "sed", but I heard it's powerful and performs well in a regex filtering. If I'd need to do the filtering client side which would be the easiest way (not compromising performance)?
And about batched inputs. If I get a lot of partial phone numbers in a CSV file, and each partial number could have the type "prefix"/"postfix"/"regex" (so it's tow coloumns: type, and partialnumber), what would be the best performance-wise?
Should I just get all the users in the organization unit and filter them by the shell script (iterating through all the users and trying to match any of the numbers)?
Or should I make a query for every number (this is only a viable option if regex filter for attributes is possible in an ldap query).
At my level of knowledge the first one is the way to go but is there a better solution?
I'm using OpenLDAP 2.4.23 it that matters in any way.
The results of using regular expressions with LDAP data might not be what you expect. LDAP data is not strings, but specific types of data defined by the schema, and application must always retrieve the schema to learn how to deal with attribute values. The telephoneNumber attribute has a specific syntax, and regular expressions may not work. In general, matching rules must be used by LDAP clients to compare and match data sooted in a directory server. In fact, best practices are that applications must alway sure matching rules, not native language comparison operators or regular expressions. For more information, please see LDAP: Programming Practices and LDAP: Using Matching Rules.