Amazon CloudSearch matching long strings against domain documents - amazon-web-services

I'm implementing Amazon's Cloud Search API and was wondering how well it would work for "vague" queries.
We basically have records that contain descriptions. We want to find matches based on the content of that description. For example, our domain dataset has the following strings (where each string is a different document):
"The sun is shining bright today"
"The moon is shining in the sky tonight"
"The rain is pouring outside today"
If I were to submit a description to the server like this:
"The sun and moon are shining bright lately"
is there a search method that would return a match for the first two elements (albeit with a low score)? There are key words that are important, ignoring the "the" and "is" type of words. If so, how is that search constructed?

I was eventually able to get those strings to be returned with a query based on "The sun and moon are shining bright lately". I accomplished this by boolean OR'ing the terms together like this:
(or name:'sun' name:'moon' name:'shining' name:'bright' name:'lately')
I also removed stopwords but I don't think you need to.
It's definitely not pretty. The problem I had with other approaches was that CloudSearch seems pretty heavy-handed about penalizing results that don't contain a word from your query, so a word like lately in the query would cause it to not match any of the test strings. I was hoping to fix that with a rank expression but I think you can only rank results, not docs that didn't even match your query.
I also played around with sloppy phrase search but that still requires that the words are found some distance from each other, where in this case certain words aren't found at all.
The only other thing I can think to try is looking at the lucene and dismax query parsers. They won't change the underlying search engine but they may give you a different means of specifying a query that would work better.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

Validate Proper Names (with Perl)

I have a census list of 150k last names, and trying to use this to validate the spelling of person names in an existing database.
Obviously there are many ethnic names in my database that don't match the census list, but are clearly not misspelled (Italian names like "Petroni", Swedish names like "Magnusdotter").
I would like to create a function (in Perl) to detect slight variations - i.e. likely mis-spellings - between names in the database and other very popular names in the census list (a frequency number is available).
I can imagine the algorithm, but before I dive in - any suggestions to do this in a reliable way - i.e. one that doesn't throw too many false positives?
Thanks!!
Essentially, you're writing a spell checker. You may want to look into an Open Source, multi-lingual spell checker such as Aspell and see what they do. You might even be able to implement what you want as an aspell dictionary.
There are many algorithms for doing approximate string matching. The Levenshtein distance between words is one algorithm, and there are several Perl modules to calculate it, but Text::Fuzzy looks pretty good.
That's great for comparing a few words, but you have to choose between 150k. You could just see if it's fast enough. You could try caching the result. But it remains an O(n) algorithm. Instead (or in addition) you can create an index using a phonetic matching algorithm. Generally, these index words by what they sound like to allow matching on misspelled words. Once you've generated the index for each word, you can match a new word against the index very quickly. Obviously this is subject to cultural ideas of what words sound like which is why there are many algorithms each with different optimizations. You can create several indexes using different algorithms and try them all.
You can even combine the two and do approximate string matching on the phonetic indexes.

Is there a way to search terms in order with RegexpQuery in lucene?

I would like to search my indexed documents in order using RegexpQuery.
For example I have 2 Document
text: Oracle unveils better than expected quarterly results.
text: Research In Motion shares gained almost 13 per cent on the Toronto Stock Exchange Friday, a day after the smartphone maker posted better than expected quarterly results.
So far I tried this but I got no luck.
Query regexq = new RegexpQuery(new Term("text", "^.+better.+quarterly.+results"));
Is there another way of implementing this?
Thanks
I believe a PhraseQuery fits what you are looking for better. You can use PhraseQuery.setSlop(int) to allow terms to appear between the terms of the query. This would like like:
Query pq = new PhraseQuery();
pq.add(new Term("text", "better"));
pq.add(new Term("text", "quarterly"));
pq.add(new Term("text", "results"));
pq.setSlop(10); //Or whatever is an appropriate slop value for you.
This sort of query is also supported by the standard QueryParser, as seen here, like:
text:"better quarterly results"~10
I think a PhraseQuery is most definitely the better implementation here, but...
Regarding RegexpQuery:
I believe it is intended to compare terms against the regex, and since the phrase you are searching for (I am assuming) is tokenized, no single Term matches your whole regex. You would need to index the entire field as a single Term to make this work, using StringField, KeywordAnalyzer, or similar.
I believe it works like Matcher.matches(), rather than Matcher.find(), which is to say, it must match the entire input term, rather than a portion of it. So, if you had specified "text" as a StringField, you would need to add a .* to the end to consume the rest of the input.
On a similar note, I'm not sure if it supports the use of the character "^" as the start of input, being that it is redundant in that case. I don't see it specified in Lucene's Regexp, but I have seen reference to it's use, so I'm not sure whether it would be accepted or not.
To summarize, a RegexpQuery could work like:
Query regexq = new RegexpQuery(new Term("text", ".+better.+quarterly.+results.*"));
If you used a StringField, or KeywordAnalyzer index the entire field as a single Term.
With the leading wildcard in your regexp, though, you could expect very poor performance from it (See the warning at the top of the RegexpQuery documentation).