Why cosine_similarity of pretrained fasttex model is high between two sentents are not relative at all? - word2vec

I am wondering to know why pre-trained 'fasttext model' with wiki(Korean) seems not to work well! :(
model = fasttext.load_model("./fasttext/wiki.ko.bin")
model.cosine_similarity("테스트 테스트 이건 테스트 문장", "지금 아무 관계 없는 글 정말로 정말로")
(in english)
model.cosine_similarity("test test this is test sentence", "now not all relative docs really really ")
0.99....??
Those sentence is not at all relative as meaning. Therefore I think that cosine-similarity must be lower. However It was 0.997383...
Is it impossive to compare lone sentents with fasttext?
So Is it only way to use doc2vec?

Which 'fasttext' code package are you using?
Are you sure its cosine_similarity() is designed to take such raw strings, and automatically tokenize/combine the words of each example to give sentence-level similarities? (Is that capability implied by its documentation or illustrative examples? Or perhaps does it expected pre-tokenized lists of words?)

Related

Universe Dictionary ITem multipart key, second part is multi-valued

Trying to use
TRANS(SPEC.ORDR,#ID:"*":#RECORD<27,#CNT>,27,'X')
Universe is returning an error message on #CNT, assuming not used in our Universe flavour, anyone know what the actual trans code would be?
I just realized that the CATS is base so this should work for you. A similarly built I-Descriptor on my system certainly does the concatenation correctly, but I don't have an easy to find example to test the TRANS on top. It does bring back the expected number of blanks based the on the 'X'.
TRANS(SPEC.ORDR,SUBR('-CATS',REUSE(#ID:'*'),#RECORD<27>),27,'X')
Good luck

l10n/i18n: how to handle phrases with dynamic list of items?

What's the sanest way to handle translation and localization of dynamic lists?
Let's say I've queried the database, and got a list ["Foos", "Bars", "Bazes"]. Let's also assume the list always contain at least two items - I'll be sure to use a different translation for the single-item case.
What should I do if I need a phrase like "We have a wide choice of Foos, Bars and Bazes in our code"? (assuming that list items are dynamic so I can't just pre-translate all the possible permutations, and need to do things at runtime.)
I see at least the following issues:
I need to inflect all the items to the correct form (are there languages where different forms are required depending on the position in the list?)
Different locales may have drastically different rules how to join items.
E.g. CJK locales need "、" instead of ",".
And AFAIK in Chinese there will be "及" or "和" - depending on the full phrase - before the last item, so I guess there's some ambiguity with translating "and".
And, as I've read, some languages may avoid punctuation like it's used in English, but have other concepts instead, e.g. Arabic translator may prefer use "و" before every item (although they also have commas, "،"). Not sure if true or not - I don't know Arabic, just saw it mentioned.
My problem is, I don't even know what tooling may help me here. I don't have any particular programming language requirements, although Python or JavaScript would be the best. But I guess I can run just about anything, as I can probably build a l10n microservice and query it from my project.
I've used GNU gettext before I've encountered this, but I haven't found anything that would help me in its APIs and data formats. The best I can imagine is _("We have a wide choice of %s in our code", list_text) and generate list_text using some DIY hacks. I'm not sure XLIFF format has anything like this as well. I've found i18n-list-generator on npm but it's way too simplicistic.
Have anyone dealt with something like this? What did you do? Is there any library out there that handles this - so I can take a look at its API and learn how it does things?
Here's how I would approach it:
No concatenation. All string joining needs to be done via format strings with placeholders.
Only use format strings that support named/numbered placeholders. E.g. {FOO} or $1 instead of %s (this is to allow for parameter reordering). Named placeholders are also better since they give more context to translators. Let's assume we're using {FOO}-style placeholders.
To render a list, I would use a couple of format strings, e.g.: joinItem = "{LIST}, {ITEM}" to append items to the list and joinLastItem = "{LIST} and {ITEM}" to append the last item. This will allow one to render strings like Foos, Bars and Bases, change punctuation and even reverse the ordering of the list, if necessary.
Finally, you can use the final format string, e.g. weHaveTheseItems = "We have a wide choice of {ITEMS} in our code", assuming the {ITEMS} gets replaced with the previously rendered string.
Shameless self-promotion: you may want to have a look at the Plurr library that supports such {FOO}-style placeholders, as well as plurals (something you will likely need for such messages). It supports JavaScript among other languages.
This is a pain, as you point out not all locales can be expected to support the ",,,,and" form.
Inspired by #GSerg and #Igor Afanasyev I came up with a GNU Gettext based solution like the following (pseudo gettext invocation):
GettextPlural(
// TRANSLATORS: For multiple "choices", each will be prefixed with a new-line (\n)
"We have a wide choice of {choices} in our code",
"In our code we have a wide choice of{choices}", choices.Count)
should print like:
"We have a wide choice of FOOs in our code"
"In our code we have a wide choice of
FOOs
BARs
BAZs"
Remember to stick the --add-comments=TRANSLATORS to your xgettext invocation.
For Web purposes you could use <ul><li>...</li>... </ul> or whatever instead of \n.
The benefit is that layout is at least as universal as UI layout, but you are still allowing non-English'ish locale plural forms.
Some languages have only one plural form so their translation must work with both a single choice and multiple choices, so in particular, they cannot have a conditional new-line.

Is there an easy way to tell which URLs are unused?

I have a Django project which is entering its fourth year of development. Over those four years, a number of URLs (and therefore view functions) have become obsolete. Instead of deleting them as I go, I've left them in a "just in case" attitude. Now I'm looking to clean out the unused URLs and unused view functions.
Is there any easy way to do this, or is it just a matter of grunting through all the code to figure it out? I'm nervous about deleting something and not realizing it's important until a few weeks/months later.
The idea is to iterate over urlpatterns and check that the status_code is 200:
class BasicTestCase(TestCase):
def test_url_availability(self):
for url in urls.urlpatterns:
self.assertEqual(self.client.get(reverse('%s' % url.name)).status_code,
200)
I understand that it might not work in all cases, since there could be urls with "dynamic" nature with dynamic url parts, but, it should give you the basic idea. Note that reverse() also accepts arguments in case you need to pass arguments to the underlying view.
Also see: Finding unused Django code to remove.
Hope that helps.
You are looking for coverage, which is basically "how much of the code written is actually being used".
Coverage analysis is usually done with tests - to find how what percentage of your code is covered by tests - but there is nothing stopping you to do the same to find out what views are orphaned.
Rather than repeat the process, #sk1p already answered it at How to get coverage data from a django app when running in gunicorn.

How do you access/configure summaries/snippets in Django Haystack

I'm working on getting django-haystack set up on my site, and am trying to have snippets in my search results roughly like so:
Title of result one about Wikis ...this special thing about wiki values is that...I always use a wiki when I walk...snippet value three talks about wikis too...and here's another snippet value
about wikis.
I know there's a template tag that uses Haystack code to do the the highlighting, but the snippets it generates are pretty limited:
they always start with the query word
there's only one snippet value
they don't support asterisk queries
and other stuff?
Is there a way to use the Solr backend to generate proper snippets as shown above?
Bottom line is that the Solr highlighting can't really be used by Haystack in a flexible way. I spoke to the main developer for Haystack on IRC, and he said basically, if I want to have the kind of highlighting I'm looking for, the only way to get it is to extend the Solr backend that Haystack uses.
I dabbled in that for about half a day, but couldn't get Haystack to recognize my custom back end. Haystack has some magic backend loading code that just wasn't working with me.
Consequently, I've switched over to sunburnt, which provides a lighter-weight and more extensible wrapper around Solr. I'm hoping it will fare better.
from haystack.utils import Highlighter
my_text = 'This is a sample block that would be more meaningful in real life.'
my_query = 'block meaningful'
highlight = Highlighter(my_query)
highlight.highlight(my_text)
http://docs.haystacksearch.org/dev/highlighting.html

Inexact full-text search in PostgreSQL and Django

I'm new to PostgreSQL, and I'm not sure how to go about doing an inexact full-text search. Not that it matters too much, but I'm using Django. In other words, I'm looking for something like the following:
q = 'hello world'
queryset = Entry.objects.extra(
where=['body_tsv ## plainto_tsquery(%s)'],
params=[q])
for entry in queryset:
print entry.title
where I the list of entries should contain either exactly 'hello world', or something similar. The listings should then be ordered according to how far away their value is from the specified string. For instance, I would like the query to include entries containing "Hello World", "hEllo world", "helloworld", "hell world", etc., with some sort of ranking indicating how far away each item is from the perfect, unchanged query string.
How would you go about doing this?
Your best bet is to use Django raw querysets, I use it with MySQL to perform full text matching. If the data is all in the database and Postgres provides the matching capability then it makes sense to use it. Plus Postgres offers some really useful things in terms of stemming etc with full text queries.
Basically it lets you write the actual query you want yet returns models (as long as you are querying a model table obviously).
The advantage this gives you is that you can test the exact query you will be using first in Postgres, the documentation covers full text queries pretty well.
The main gotcha with raw querysets at the moment is they don't support count. So if you will be returning lots of data and have memory constraints on your application you might need to do something clever.
"Inexact" matching however isn't really part of the full text searching capabilities. Instead you want the postgres fuzzystrmatch contrib module. It's use is described here with indexes.
The best would be to use a search engine for this purpose. Django-haystack supports the integration of three different search engines.
In 2022, Django supports full text search with postgres. Full documentation here: https://docs.djangoproject.com/en/4.0/ref/contrib/postgres/search/