Lucene Syntax For More Complicated Queries - regex

I am developing a website for my company, that allows users to query a database in order to get the information they need.
Currently, the users are used to a particular form of queries, and I don't want to make them change the way they are used to. Therefore, I need to convert their query to Lucene's query syntax.
There are some cases which I'm not sure what is the best way to implement them using Lucene syntax, I was wondering maybe you have some better ideas:
"Current Query" : serverRole=~'(ServerOne|ServerTwo|ServerThree)'
"Lucene Suggested": (serverRole:*ServerOne* OR serverRole:*ServerTwo* OR serverRole:*ServerThree*)
Take into account that I'm using Regex to convert these queries, so one of the difficulties I'm facing for example, is how to do it if the number of elements (ServerOne|ServerTwo|ServerThree.....) is dynamic:
luceneQuery = currentQuery
.replace(/(==~|=~)('|")([a-zA-Z0-9]+)(\|)([a-zA-Z0-9]+)('|")/g, ':*$3 OR $5*')
Another query for example:
"Current Query" : OS=~'SLES1[12]'
"Lucene Suggested": (OS:*SLES11* OR OS:*SLES12*)

I would recomand you to check BooleanQuery() on Lucene to create more complex queries like Wildcard , Term, Fuzzy U can include all by using Occur parameter while u build your queries. As an example
Query query1 = new WildcardQuery(new Term("contents", "*ServerOne*"));
Query query2 = new WildcardQuery(new Term("contents", "*ServerTwo*"));
BooleanQuery booleanQuery = new BooleanQuery.Builder()
.add(query1, BooleanClause.Occur.SHOULD)
.add(query2, BooleanClause.Occur.SHOULD)
.build();
There is also regex queries you can directly run but when your indexed field will be complicates it taking time to find regex match

Related

Bigquery struct introspection

Is there a way to get the element types of a struct? For example something along the lines of:
SELECT #TYPE(structField.y)
SELECT #TYPE(structField)
...etc
Is that possible to do? The closest I can find is via the query editor and the web call it makes to validate a query:
As I mentioned already in comments - one of the option is to mimic same very Dry Run call with query built in such a way that it will fail with exact error message that will give you the info you are looking for. Obviously this assumes your use case can be implemented in whatever scripting language you prefer. Should be relatively easy to do.
Meantime, I was looking for making this within the SQL Query.
Below is the example of another option.
It is limited to below types, which might fit or not into your particular use case
object, array, string, number, boolean, null
So example is
select
s.birthdate, json_type(to_json(s.birthdate)),
s.country, json_type(to_json(s.country)),
s.age, json_type(to_json(s.age)),
s.weight, json_type(to_json(s.weight)),
s.is_this, json_type(to_json(s.is_this)),
from (
select struct(date '2022-01-01' as birthdate, 'UA' as country, 1 as age, 2.5 as weight, true as is_this) s
)
with output
You can try the below approach.
SELECT COLUMN_NAME, DATA_TYPE
FROM `your-project.your-dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE TABLE_NAME = 'your-table-name'
AND COLUMN_NAME = 'your-struct-column-name'
ORDER BY ORDINAL_POSITION
You can check this documentation for more details using INFORMATION_SCHEMA for BigQuery.
Below is the screenshot of my testing.
DATA:
RESULT USING THE ABOVE SYNTAX:

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

Convert Django ORM query with large IN clause to table value constructor

I have a bit of Django code that builds a relatively complicated query in a programmatic fashion, with various filters getting applied to an initial dataset through a series of filter and exclude calls:
for filter in filters:
if filter['name'] == 'revenue':
accounts = accounts.filter(account_revenue__in: filter['values'])
if filter['name'] == 'some_other_type':
if filter['type'] == 'inclusion':
accounts = accounts.filter(account__some_relation__in: filter['values'])
if filter['type'] == 'exclusion':
accounts = accounts.exclude(account__some_relation__in: filter['values'])
...etc
return accounts
For most of these conditions, the possible values of the filters are relatively small and contained, so the IN clauses that Django's ORM generates are performant enough. However there are a few cases where the IN clauses can be much larger (10K - 100K items).
In plain postgres I can make this query much more optimal by using a table value constructor, e.g.:
SELECT domain
FROM accounts
INNER JOIN (
VALUES ('somedomain.com'), ('anotherdomain.com'), ...etc 10K more times
) VALS(v) ON accounts.domain=v
With a 30K+ IN clause in the original query it can take 60+ seconds to run, while the table value version of the query takes 1 second, a huge difference.
But I cannot figure out how to get Django ORM to build the query like I want, and because of the way all the other filters are constructed from ORM filters I can't really write the entire thing as raw SQL.
I was thinking I could get the raw SQL that Django's ORM is going to run, regexp parse it, but that seems very brittle (and surprisingly difficult to get the actual SQL that is about to be run, because of parameter handling etc). I don't see how I could annotate with RawSQL since I don't want to add a column to select, but instead want to add a join condition. Is there a simple solution I am missing?

How to order django query set filtered using '__icontains' such that the exactly matched result comes first

I am writing a simple app in django that searches for records in database.
Users inputs a name in the search field and that query is used to filter records using a particular field like -
Result = Users.objects.filter(name__icontains=query_from_searchbox)
E.g. -
Database consists of names- Shiv, Shivam, Shivendra, Kashiva, Varun... etc.
A search query 'shiv' returns records in following order-
Kahiva, Shivam, Shiv and Shivendra
Ordered by primary key.
My question is how can i achieve the order -
Shiv, Shivam, Shivendra and Kashiva.
I mean the most relevant first then lesser relevant result.
It's not possible to do that with standard Django as that type of thing is outside the scope & specific to a search app.
When you're interacting with the ORM consider what you're actually doing with the database - it's all just SQL queries.
If you wanted to rearrange the results you'd have to manipulate the queryset, check exact matches, then use regular expressions to check for partial matches.
Search isn't really the kind of thing that is best suited to the ORM however, so you may which to consider looking at specific search applications. They will usually maintain an index, which avoids database hits and may also offer a percentage match ordering like you're looking for.
A good place to start may be with Haystack

Group by date field in solrj

i want to group the output i am getting through date type. But i am storing the data in solr using datetime type. Date Format i am using is
Date format :: "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
For e.g. Date is stored in solr as "2013-03-01T20:56:45.000+00:00"
What i want as output is count of dates :: for .e.g.
Date1:: "2013-03-01T20:56:45.000+00:00"
Date2:: "2013-03-01T21:56:45.000+00:00"
Date3:: "2013-03-01T22:56:45.000+00:00"
Date3:: "2013-03-02T22:56:45.000+00:00"
Date4:: "2013-03-02T23:56:45.000+00:00"
So i want the output as two columns ::
Date Count
2013-03-01 3
2013-03-02 2
Here is the code i am using
String url = "http://192.168.0.4:8983/solr";
SolrServer server = new HttpSolrServer(url);
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addFilterQuery("sessionStartTime:[2013-03-01T00:00:00Z TO 2013-03-04T24:00:00Z]");
query.add("group", "true");
query.add("group.field","uniqueId"); // uniqueId is grouping the data
query.add("group.main","true");
query.setRows(9999);
QueryResponse rs=server.query(query);
Iterator<SolrDocument> iter = rs.getResults().iterator();
Help is appreciated.
I know that this is an older question but I am working on something related to this so I thought I would share my solution. Since you are using grouping, rs.getResults() will likely be null. After reading through the SolrJ API and doing some testing on my end, you will find that the results are indeed grouped as you want them to be. To access them, create a variable like such:
List<Group> groupedData = rs.getGroupResponse().getValues().get(0).getValues()
Note that Group is the class org.apache.solr.client.solrj.response.Group
Then, iterate through groupedData, usinig groupedData.get(i).getResult() to get a SolrDocumentList of results for each grouped value. In your example, (assuming the data is ordered as you said it would be), groupedData.get(0) would give you a SolrDocumentList of the three matches that have the date 2013-03-01.
I understand that this is quite the chain of method calls but it does end up getting the results to you. If anyone does know a faster way to get to the data, please feel free to let me know as I would like to know as well.
Refer to the API for GroupResponse for more information
Note that this answer is working on Solr 5.4.0
The output that you are trying to achieve, I believe is better suited to Faceting over grouping. Check out the documentation on Date Faceting more specifically and SolrJ fully supports faceting, see SolrJ - Advanced Usage. For an introduction to Faceting I would recommend reading Faceted Search with Solr