NLP with less than 20 Words on google cloud - google-cloud-platform

According to this documentation: the classifyText method requires at least 20 words.
https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs
If I send in less than 20 words I get this no matter how clear the content is:
Invalid text content: too few tokens (words) to process.
Looking for a way to force this without disrupting the NLP too much. Are there neutral vector words that can be appended to short phrases that would allow the classifyText to process anyways?
ex.
async function quickstart() {
const language = require('#google-cloud/language');
const client = new language.LanguageServiceClient();
//less than 20 words. What if I append some other neutral words?
//.. a, of , it, to or would it be better to repeat the phrase?
const text = 'The Atlanta Braves is the best team.';
const document = {
content: text,
type: 'PLAIN_TEXT',
};
const [classification] = await client.classifyText({document});
console.log('Categories:');
classification.categories.forEach(category => {
console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
});
}
quickstart();

The problem with this is you're adding bias no matter what kind of text you send.
Your only chance is to fill up your string up to the minimum word limit with empty words that will be filtered out by the preprocessor and tokenizer before they go to the neural network.
I would try to add a string suffix at the end of the sentence with just stopwords from NLTK like this:
document.content += ". and ourselves as herserf for each all above into through nor me and then by doing"
Why the end? Because usually text has more information at the beginning.
In case Google does not filter stopwords behind the scenes (which I doubt), this would add just white noise where the network has no focus or attention.
Remember: DO NOT add this string when you have enough words because you are billed for 1K character blocks before they are filtered.
I would also add that string suffix to sencences in your train/test/validation set that have less than 20 words and see how it works. The network should learn to ignore the whole sentence.

Related

How to get the most accurate term in regex?

I have an angular app using the mongodb sdk for js.
I would like to suggest some words on a input field for the user from my words collection, so I did:
getSuggestions(term: string) {
var regex = new stitch.BSON.BSONRegExp('^' +term , 'i');
return from(this.words.find({ 'Noun': { $regex: regex } }).execute());
}
The problem is that if the user type for example Bie, the query returns a lot of documents but the most accurated are the last ones, for example Bier, first it returns the bigger words, like Bieberbach'sche Vermutung. How can I deal to return the closests documents first?
A regular-expression is probably not enough to do what you are intending to do here. They can only do what they're meant to do – match a string. They might be used to give you a candidate entry to present to the user, but can't judge or weigh them. You're going to have to devise that logic yourself.

How to find a substring anywhere in a string

This should be easy, but I'm finding it difficult.
I just want to find whether a substring exists anywhere in a string. In my case, whether the name of a website exists in the title of a product.
My code is like this:
#FindNoCase("Amazon.com", "Google Chromecast available at Amazon")#
The above returns a 0 which is correct because the entire substring "Amazon.com" doesn't exist in the main string. But some of it does, namely the "Amazon" part.
How could I achieve what I'm trying to do which is just see if ANY of the substring (at least more than 2 character in length) exists in the main string?
So I need something like FindOneOf() but actually "find at least three of". It should then look at the word "Amazon" in the product title and check if at least 3 characters in the sequence of "Amazon.com" exists. When it sees that "Ama" exists, then it just needs to return a true value. Can it be done using the existing built-in functions somehow?
Update: Very simple solution. I used Left("amazon", 3).
There's a lot of danger in false positives, like if someone was buying the Alabama state flag.
Because of store names that contain spaces, this is a little tricky (Wal Mart is often written with a space).
If your string always contains at [store], you can extract the store name by finding the last at in the sentence and creating a string by chopping off everything else.
Because it looks for occurrences of at only as a whole word, there's no danger with store names such as Beats Audio, or Sam's Meat Shop. I can't think of any any stores with the word at in the name. While that would technically trip it up, there's much lower risk, and you can do a pre-replace on such store names.
<cfset mystring = "Google Chromecast available at Amazon">
<cfset SellerName = REReplaceNoCase(mystring,".*\b(?:at)\b(?!.*\b(?:at)\b)\s*","")>
<cfoutput>Seller: #Sellername#</cfoutput>
You can then do your comparisons much more safely.
Per your comment, If you know all possible patterns, you can still obtain the data if you want to (false positives can either be embarrassing or catastrophic, depending on the action). If you know the stores you're working with, you can use a regex to pull out the string like this
<cfset mystring = "Google Chromecast available at Amazon.co.uk">
<cfset SellerName = REReplaceNoCase(mystring,".*\b((Google|Amazon|Wal[\W]*Mart|E[\W]*bay)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #Sellername#</cfoutput>
The only part you need to update is the pipe-delimited list You might add K-Mart as K[\W]*Mart the [\W]* permits any special character or space so it covers kMart, K-Mart, k*Mart, but not Kwik-E-Mart.
Update #2, per more comments
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset SellerNameRE = REReplace(rsProduct.sellername,"[\W]+","[\W]*","ALL")>
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#sellernameRE#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
This replaces any symbols with the wildcard character so that symbols aren't required so that if something says Wal*Mart, it will still match WalMart.
You could also load a seperate column with "Regex Names" so that you're not doing this each time.
So your table would look something like
SellerID SellerName RegexName
1 Wal-Mart Wal[\W]*Mart
2 Toys-R-US Toys[\W]*R[\W]*US
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#rsProduct.RegexName#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
Solved it by doing this
#FindNoCase(left("Amazon.com", 3), "Google Chromecast available at Amazon")#
Yes there is potential it won't do what I need in cases where the seller name less than 3 characters long. But I think its rare enough to be ok.

Best way to compare phone numbers using Regex

I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));
Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.

map-reduce function in CouchDB

I have a java program, that reads all words of a PDF file. I saved the words with the pagenumbers in a database (couchDB). Now I want to write a map and a reduce function, which list each word with the pagenumbers where the word occurs, but if words occur more than once on a page, I want just one entry. The result should be a row with word and a second row with a list (String separated with comma) of pagenumbers. Each word with the pagenumber is a separate document in couchDB.
How can I do this with a map-reduce function (filter same entries of pagenumbers)?
Thanks for help.
Surely there is more than one way of doing it. I'd go for something simple. Lets say your documents look somewhat like this:
{ 'type': 'word-index', 'word': 'Great', 'page_number': 45 }
This is a result of finding the word 'Great' on page 45. Now your view index is created by a view function:
function map(doc) {
if (doc.type == 'word-index') {
emit([doc.word, doc.page_number], null);
}
}
For reduce part just use the "_count" builtin.
Now to get the list of all the occurrences of word "Great" in your book, just query your view with startkey=["Great"] and endkey=["Great", {}]. Now the result would look somewhat like:
["Great", 45], 4
["Great", 70], 7
Which means that world "Great" appeared 4 times on page 45 and 7 times on page 70. You can extract your comma separated list you needed from it. The number of occurrences is a bonus.
--edit--
You also have to use group_level=2 in your query. If you don't the result of the query would simply be a single row with the count off all the documents you have.

Delphi - User specified string manipulation

I have a problem in Delphi7. My application creates mpg video files according to a set naming convention i.e.
\000_A_Title_YYYY-MM-DD_HH-mm-ss_Index.mpg
In this filename the following rules are enforced:
The 000 is the video sequence. It is incremented whenever the user presses stop.
The A (or B,C,D) specifies the recording camera - so video files are linked with up to four video streams all played simultaneously.
Title is a variable length string. In my application it cannot contain a _.
The YYYY-MM-DD_HH-mm-ss is the starting time of the video sequence (not the single file)
The Index is the zero based ordering index and is incremented within 1 video sequence. That is, video files are a maximum of 15 minutes long, once this is reached a new video file is started with the same sequence number but next index. Using this, we can calculate the actual start time of the file (Filename decoded time + 15*Index)
Using this method my application can extract the starting time that the video file started recording.
Now we have a further requirement to handle arbitrarily named video files. The only thing i know for certain is there will be a YYYY-MM-DD HH-mm-ss somewhere in the filename.
How can i allow the user to specify the filename convention for the files he is importing? Something like Regular expressions? I understand there must be a pattern to the naming scheme.
So if the user inputs ?_(Camera)_*_YYYY-MM-DD_HH-mm-ss_(Index).mpg into a text box, how would i go about getting the start time? Is there a better solution? Or do i just have to handle every single possibility as we come accross them?
(I know this is probably not the best way to handle such a problem, but we cannot change the issue - the new video files are recorded by another company)
I'm not sure if your trying to parse the user input into components '?(Camera)*_YYYY-MM-DD_HH-mm-ss_(Index).mpg` but if your just trying to grab the date and time something like this, the date is in group 1, time in group 2
(\d{4}-\d{2}-\d{2})_(d{2}-\d{2}-\d{2})
Otherwise, not sure what your trying to do.
Possibly you can use the underscores "_" as your positional indicator since you smartly don't allow them in the title.
In your example of a filename convention:
?_(Camera)_*_YYYY-MM-DD_HH-mm-ss_(Index).mpg
you can parse this user-specified string to see that the date YYYY-MM-DD is always between the 3rd and 4th underscore and the time HH-mm-ss is between the 4th and 5th.
Then it becomes a simple matter when getting the actual filenames following this convention, to find the 3rd underscore and know the date and time follow it.
If you want phone-calls 24/7, then you should go for the RegEx-thing and let the user freely enter some cryptography in a TEdit.
If you want happy users and a good night sleep, then be creative and drop the boring RegEx-approach. Create your own filename-decoder by using an Angry bird approach.
Here's the idea:
Create some birds with different string manipulation personalities.
Let the user select and arrange these birds.
Execute the user generated string manipulation.
Sample code:
program AngryBirdFilenameDecoder;
{$APPTYPE CONSOLE}
uses
SysUtils;
procedure PerformEatUntilDash(var aStr: String);
begin
if Pos('-', aStr) > 0 then
Delete(aStr, 1, Pos('-', aStr));
WriteLn(':-{ > ' + aStr);
end;
procedure PerformEatUntilUnderscore(var aStr: String);
begin
if Pos('_', aStr) > 0 then
Delete(aStr, 1, Pos('_', aStr));
WriteLn(':-/ > ' + aStr);
end;
function FetchDate(var aStr: String): String;
begin
Result := Copy(aStr, 1, 10);
Delete(aStr, 1, 10);
WriteLn(':-) > ' + aStr);
end;
var
i: Integer;
FileName: String;
TempFileName: String;
SelectedBirds: String;
MyDate: String;
begin
Write('Enter a filename to decode (eg. ''01-ThisIsAText-Img_01-Date_2011-03-08.png''): ');
ReadLn(FileName);
if FileName = '' then
FileName := '01-ThisIsAText-Img_01-Date_2011-03-08.png';
repeat
TempFileName := FileName;
WriteLn('Now, select some birds:');
WriteLn('Bird No.1 :-{ ==> I''ll eat letters until I find a dash (-)');
WriteLn('Bird No.2 :-/ ==> I''ll eat letters until I find a underscore (_)');
WriteLn('Bird No.3 :-) ==> I''ll remember the date before I eat it');
WriteLn;
Write('Chose your birds: (eg. 112123):');
ReadLn(SelectedBirds);
if SelectedBirds = '' then
SelectedBirds := '112123';
for i := 1 to Length(SelectedBirds) do
case SelectedBirds[i] of
'1': PerformEatUntilDash(TempFileName);
'2': PerformEatUntilUnderscore(TempFileName);
'3': MyDate := FetchDate(TempFileName);
end;
WriteLn('Bird No.3 found this date: ' + MyDate);
WriteLn;
WriteLn;
Write('Check filename with some other birds? (Y/N): ');
ReadLn(SelectedBirds);
until (Length(SelectedBirds)=0) or (Uppercase(SelectedBirds[1])<>'Y');
end.
When you'll do this in Delphi with GUI, you'll add more birds and more checking of course. And find some nice bird glyphs.
Use two list boxes. One one the left with all possible birds, and one on the right with all the selected birds. Drag'n'drop birds from left to right. Rearrange (and remove) birds in the list on the right.
The user should be able to test the setup by entering a filename and see the result of the process. Internally you store the script by using enumerators etc.