I would like to setup ElasticSearch on a table "content" which also have translation table "content_translation" for localization purpose (Globalize gem). We have 10 languages.
I want to implement Elasticsearch for my data model. In SQL I would search like:
SELECT id, content_translation.content
FROM content
LEFT JOIN content_translation on content.id = content_translation.content_id
WHERE content_translation.content LIKE '%???????%'
I wonder what is the best strategy to do "left join" like search with ElasticSearch?
Should I create just "content" index with all translation data in it?
{"id":21, "translations":{"en":{"content":"Lorem..."}, "de":{"content":"Lorem..."} ..}
Should I create "content_translation" index and just filter results for specific locale?
{"content_id":21, "locale":"en", "content": "Lorem ..."}
Are there some good practices how to do it?
Should I take care of maintaining index by myself or should I use something like "Tire gem" which takes care about indexing by itself.
I would recommend the second alternative (one document per language), assuming that you wouldn't need to show content from multiple languages.
i.e.
{"content_id":21, "locale":"en", "content": "Lorem ..."}
I recommend a gem like Tire, and exploit it's DSL to your advantage.
You could have your content model to look like:
class Content < ActiveRecord:Base
include Tire::Model::Search
include Tire::Model::Callbacks
...
end
Then you could do have a search method that would do something like
Content.search do
query do
...
end
filter :terms, :locale => I18n.locale.to_s
end
Your application would need to maintain locale at all times, to serve the respective localized content. You could just use I18n's locale and look up data. Just pass this in as a filter and you could have the separation you wish. Bonus is, you get fallback for free if you have enabled it in i18n for rails.
However, if you have a use case where you need to show multi-lingual content side by side, then this fails and you could look at a single document holding all language content.
Related
I'm working on an NLP/NER script using transformers/BERT and I'm having an issue extracting the name of a company from a set of texts.
In all the texts the script will be used on, the company's name will be presented like this:
"COMPANY NAME: the company's name is XXX"
or
"NAME: the company's name is XXX"
this is my code:
def get_company_info(text, tokenizer_1, model_1, tokenizer_2, model_2):
company_info = {"name": None}
try:
start_company_index = re.search('name', text, re.I).span()[0]
info = NLP_2(
text[start_company_index:start_company_index+100], tokenizer_2, model_2)
for data in info:
if data['entity_group'] == 'ORG':
company_info['name'] = data['word']
break
except:
pass
However, the BERT script returns the word "company" since it finds it in the text and assumes correctly that it is the subject I'm looking for but I want to extract the name of the company instead.
Is there a simple way to avoid this or do I have to fine-tune the model?
I'm using regex to delimit the field of the search but I cannot simply use re.search("company") to start the search after the word company, because sometimes there will be 2 consecutive mentions of the word.
You cannot avoid that from a model-perspective unfortunately.
You have to:
Either, as suggested in your comments, perform some regex parsing or try to use another type of logic to eliminate paragraph titles (if it suits you).
Retrain on your own dataset, giving many examples of such situations like above, so that the network would eventually learn to distinguish and not detect COMPANY_NAME or other similar examples.
For (1) things can get complicated since what you gave here is just one instance where the network fails, it may very well be the case that as you see more documents, you discover more error-prone situations - the post-processing becomes more and more difficult.
For (2) you can actually start predicting on your new data, clean it, and start creating a cleaned dataset on your own for finetuning purposes.
One other approach would be to search for specific pre-trained BERT/similar models which are pre-trained on specific corpuses. For example, SciBERT is a pre-trained language model on scientific text, and, given that you would presumably work with scientific texts, it would have a better performance than a basic BERT. However, I do not know if you will find a model catered exactly for your needs as in the above example.
I'm building an advanced search page for a scientific database using Django. The goal is to be able to allow some dynamically created sophisticated searches, joining groups of search terms with and & or.
I got part-way there by following this example, which allows multiple terms anded together. It's basically a single group of search terms, dynamically created, that I can either and-together or or-together. E.g.
<field1|field2> <is|is not|contains|doesn't contain> <search term> <->
<+>
...where <-> will remove a search term row and <+> will add a new row.
But I would like the user to be able to either add another search term row, or add an and-group and an or-group, so that I'd have something like:
<and-group|or-group> <->
<field1|field2> <is|is not|contains|doesn't contain> <search term> <->
<+term|+and-group|_or-group>
A user could then add terms or groups. The result search might end up like:
and-group
compound is lysine
or-group
tissue is brain
tissue is spleen
feeding status is not fasted
Thus the resulting filter would be like the following.
Data.objects.filter(Q(compound="lysine") & (Q(tissue=brain) | Q(tissue="spleen")) & ~Q(feeding_status="fasted"))
Note - I'm not necessarily asking how to get the filter expression below correct - it's just the dynamic hierarchical construction component that I'm trying to figure out. Please excuse me if I got the Q and/or filter syntax wrong. I've made these queries before, but I'm still new to Django, so getting it right off the top of my head here is pretty much guaranteed to be zero-chance. I also skipped the model relationships I spanned here, so let's assume these are all fields in the same model, for simplicity.
I'm not sure how I would dynamically add parentheses to the filter expression, but my current code could easily join individual Q expressions with and or or.
I'm also not sure how I could dynamically create a hierarchal form to create the sub-groups. I'm guessing any such solution would have to be a hack and that there are not established mechanisms for doing something like this...
Here's a screenshot example of what I've currently got working:
UPDATE:
I got really far following this example I found. I forked that fiddle and got this proof of concept working before incorporating it into my Django project:
http://jsfiddle.net/hepcat72/d42k38j1/18/
The console spits out exactly the object I want. And there are no errors. Clicking the search button works for form validation. Any fields I leave empty causes a prompt to fill in the field. Here's a demo gif:
Now I need to process the POST input to construct the query (which I think I can handle) and restore the form above the results - which I'm not quite sure how to accomplish - perhaps a recursive function in a custom tag?
Although, is there a way to snapshot the form and restore it when the results load below it? Or maybe have the results load in a different frame?
I don't know if I'm teaching a grandmother to suck eggs, but in case not, one of the features of the Python language may be useful.
foo( bar = 27, baz = None)
can instead be coded
args = {}
a1, a2 = 'bar', 'baz'
d[a1] = 27
d[a2] = None
foo( **args )
so an arbitrary Q object specified by runtime keys and values can be constructed q1 = Q(**args)
IIRC q1 & q2 and q1 | q2 are themselves Q objects so you can build up a filter of arbitrary complexity.
I'll also include a mention of Django-filter which is usually my answer to filtering questions like this one, but I suspect in this case you are after greater power than it easily provides. Basically, it will "and" together a list of filter conditions specified by the user. The built-in ones are simple .filter( key=value), but by adding code you can create custom filters with complex Q expressions related to a user-supplied value.
As for the forms, a Django form is a linear construct, and a formset is a list of similar forms. I think I might resort to JavaScript to build some sort of tree representing a complex query in the browser, and have the submit button encode it as JSON and return it through a single text field (or just pick it out of request.POST without using a form). There may be some Javascript out there already written to do this, but I'm not aware of it. You'd need to be sure that malicious submission of field names and values you weren't expecting doesn't result in security issues. For a pure filtering operation, this basically amounts to being sure that the user is entitled to get all data in database table in any case.
There's a form JSONField in the Django PostgreSQL extensions, which validates that user-supplied (or Javascript-generated) text is indeed JSON, and supplies it to you as Python dicts and lists.
I've been working with regular SQL databases and now wanted to start a new project using AWS services. I want the back end data storage to be DynamoDB and what I want to store is a tiered document, like an instruction booklet of all the programming tips I learned that can be pulled and called up via a React frontend.
So the data will be in a format like Python -> Classes -> General -> "Information on Classes Text Wall"
There will be more than one sub directory at times.
Future plans would be to be able to add new subfolders, move data to different folders, "thumbs up", and eventual multi account with read access to each other's data.
I know how to do this in a SQL DB, but have never used a NoSQL before and figured this would be a great starting spot.
I am also thinking of how to sort the partition, and I doubt this side program would ever grow to more than one cluster but I know with NoSQL you have to plan your layout ahead of time.
If NoSQL is just a horrible fit for this style of data, let me know as well. This is mostly for practice and to practice AWS systems.
DynamoDb is a key-value database with options to add a secondary indices. It's good to store documents that doesn't require full scan or aggregation queries. If you design your tiered document application to show only one document at a time, then DynamoDB would be a good choice. You can put the documents with a structure like this:
DocumentTable:
{
"title": "Python",
"parent_document": "root"
"child_documents": ["Classes", "Built In", ...]
"content": "text"
}
Where:
parent_document - the "title" of the parent document, may be empty for "Python" in your example for a document titled "Classes"
content - text or unstructured document with notes, thumbs up, etc, but you don't plan to execute conditional queries over it otherwise you need a global secondary index. But as you won't have many documents, full scan of a table won't take long.
You can also have another table with a table of contents for a user's tiered document, which you can use for easier navigate over the documents, however in this case you need to care about consistency of this table.
Example:
ContentsTable:
{
"user": -- primary key for this table in case you have many users
"root": [
"Python":[
"Classes": [
"General": [
"Information on Classes Text Wall"
]
]
]
]
}
Where Python, Classes, General and Information on Classes Text Wall are keys for DocumentTable.title. You can also use something instead of titles to keep the keys unique. DynamoDB maximum document size is 400 KB, so this would be enough for a pretty large table of contents
We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.
Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.
I'm having an issue with querying an index where a common search term also happens to be part of a company name interspersed throughout most of the documents. How do I exclude the business name in results without effecting the ranking on a search that includes part of the business name?
example: Bobs Automotive Supply is the business name.
How can I include relevant results when someone searches automotive or supply without returning every document in the index?
I tried "-'Bobs Automotive Supply' +'search term'" but this seems to exclude any document with Bobs Automotive Supply and isn't very effective on searching 'supply' or 'automotive'
Thanks in advance.
Second answer here, based on additional clarification from first answer.
A few options.
Add the business name as StopWords in the StopWordFilter. This will stop Solr from Indexing them at all. Searches that use them will only really search for those words that aren't in the business name.
Rely on the inherent scoring that Solr will apply due to Term frequency. It sounds like these terms will be in the index frequently. Queries for them will still return the documents, but if the user queries for other, less common terms, those will get a higher score.
Apply a low query boost (not quite negative, but less than other documents) to documents that contain the business name. This is covered in the Solr Relevancy FAQ http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F
Do you know that the article is tied to the business name or derive this? If so, you could create another field and then just exclude entities that match on the business name using a filter query. Something like
q=search_term&fq=business_name:(NOT search_term)
It may be helpful to use subqueries for this or to just boost down rather than filter out results.
EDIT: Update to question make this irrelavent. Leaving it hear for posterity. :)
This is why Solr Documents have different fields.
In this case, it sounds like there is a "Footer" field that is separate from your "Body" field in your documents. When searches are performed, they would only done against the Body, which won't include data from the Footer. You could even have a third field which is the "OriginalContent" field, which contains the original copy for display purposes. You wouldn't search that, just store it for later.
The important part is to create the two separate fields in your schema and make sure that you index those field that you want to be able to search.