AWS cloudsearch console providing different search result

AWS cloudsearch console providing different search result - amazon-web-services

In Aws console when I search "authority" it provide me the result which have the term "author" which is not at all feasible according to my search term. Is there any configuration problem.

This is because of stemming, which is a trick search engines use to try to return matches for the same root word. For example, a query for "cooking" should probably also match "cook", "cooked", etc. This is accomplished by indexing the word stems, and you can control the extent to which CloudSearch stems.
The default for English is full algorithmic stemming and what you have here is a case of that algorithm not returning the desired results. Your options are to turn down stemming to light or none, or to index this as a literal field rather than text (probably not what you want but I don't know much about your use case).
Here are the docs for Configuring Text Analysis Schemes and Text Processing in CloudSearch.

Related

AWS Cloudsearch strange issue

I uplaoded a JSON to cloudsearch with 1 field is 'text' type and searchable. It contains a word 'Residential'.
However if I use 'Residentia*', it shows me no search result. But using 'Residenti*' or 'Residential' is fine.
Who know about that? Thanks heaps!

I ran into similar issues with Cloudsearch and I searched everywhere for the answer. I eventually came across a piece about "Algorithmic Stemming": https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html.
The default stemming level for English text is "full". I created a custom analysis scheme with stemming set to "None" and applied that to most fields in document and it solved my problems.

Solr admin shows number of indexes(numDocs) to be greater than the number of files I processed

When I process 56 files with Solr it says, 'numDoc:74'. I have no clue as to why more indexes would exists than files processed, but one explanation I came up with is that the indexes of a couple of the processed files are too big, so they are split up into multiple indexes(I use rich content extraction on all processed files) . It was just a thought, so I don't want to take it as true right off the bat. Can anyone give an alternate explanation or confirm this one?
using Django + Haystack + Solr.
Many thanks

Your terminology is unfortunately is all incorrect but the troubleshooting process should be simple enough. Solr comes with admin console. Usually at http:// [ localhost or domain ]:8983/solr/ . Go there, find your collection in the drop-down (I am assuming Solr 4) and run the default query in the Query screen. That should give you all your documents and you can see what the extras are.
I suspect you may have some issues with your unique ids and/or reindexing. But with the small number of documents you can really just review what you are actually storing in Solr and figure out what is not correct.

Sitecore language search with Lucene.NET

I am using Alex Shyba's Advanced Database Crawler to index data from Sitecore and Lucene.NET queries to make search queries. I have it working solidly for the most part but having issues with the _language field when I try to do a term match for example en-US, zh-CN and de-DE.
It returns all results for the 'en' culture. But for example in the zh-CN culture it's returning about 99% of the results and leaving out 2-3 articles from each set. The en and zh-CN are different versions of the same item. I can see both information about the item in both cultures in the index via Luke.
I am using TermQuery on the language field to return data. I tried using PhraseQuery and WildCardQuery but everytime I got the same results.
I tried escaping the hyphen since Standard Analyzer doesn't like hypens with a back slash but that didn't work either.
At this point I am out of ideas. How can I have my queries return all the matching documents?
Thanks

The ADC has its own query objects to define search parameters. Simply use the Language property on the SearchParam object to search by a language.

Multi language website, how to approach this?

I have a website (Coldfusion) on which I want to offer multi language, but no idea what is the best way to do this.
There 2 plans I have:
1:
Of course all content (text) is in a database.
If a user would want a different language, the user would click on a link/flag, this would put the requested language in a session variable, for example: session.language = "es"
In the database I would have 2 columns (every language has 1 column) and then select the text which belongs to 'es'
Every page would then do a request to the database to get the text beloging to the session.language.
PROS: Relatively simple to implement
CONS: SEO wise I don't think this could be very good. http:// www.domain.com/page.cfm would give an english text or spanish text (or other language). Google will not add duplicate URL's
2:
Do something with http:// www.domain.com/en/page.cfm for english and http:// www.domain.com/es/page.cfm for english.
With a URL rewrite rule the language value in the URL http:// www.domain.com/en/page.cfm would actually be a page http:// www.domain.com/page.cfm?language=en
The url.language variable will then select the correct language from the database.
PROS: Unique URL for each language. Good for SEO and Google indexing.
CONS: A bit more difficult to implement. (I think)
Or does anyone have other / better ideas?
Thanks!!

You should always first check the browser header "Accept-Language" for the default language(s) (the correct standard way to do it), and offer links (the intuitively seemingly right way) only as an alternative.
Doing it in a database doesn't seem very standard. Let's assume you would like to use MVC architecture (model-view-controller). Most software uses keys in the presentation layer (view) (eg. html) and along with the presentation layer, you have language files (in Java, this is typically properties files) which are mapped simply by their filenames, and can be modified by regular users, without any special skills, such as professional translators with no computer skills. Certainly you could put it in a database, but then it is just more work, and moves the information out of the presentation layer.
There are various libraries for doing this. You should find the normal one for your application. Please edit your question to include what you are using to develop the application. (eg. JSP, Tapestry, Wicket, ASP, PHP, etc.) So for example, if you wanted to use JSPs, I would then suggest you use the JSTL tag library's language support. Or if you were using Tapestry, I would point you to http://tapestry.apache.org/localization.html or http://tapestry.apache.org/tapestry4.1/UsersGuide/localization.html
To look it up, you can look for the terms "internationalization" aka "i18n", or "localization". (The terms don't mean the same thing, but few use them correctly, so either works. http://www.w3.org/International/questions/qa-i18n)

I would go for option 2. Every translation should have its own url. Links to your website will already be in the intended translation.
To store translations in a database, I wouldn't put every translation in a seperate column, but rather put them in a seperate table:
Table Posts:
- id
- title_id
- ...
Table Translations:
- label_id
- value
- country_code
- language_code
Where title_id matches label_id
This way you won't have to alter your table structure when a new translation is added. This allows you to have infinite translations for any label or text.

To effectively do a multi-lingual site then you need set a rule for yourself that NO TEXT is ever put in the source as hard coded. It either needs to come from the database and / or a Resource Bundle.
Text from the database
You need to make sure that the column you are storing your data in is unicode otherwise you'll have issues with accented character. Also don't have a column per language as this is not scalable, do what #jan suggests and have a translations table where the items are keyed on a reference as well as a language.
Resource bundles
You are not going to want to get every last little bit of text from the database so for those you can utilise a resource bundle. This is an, admittedly old, link http://www.sustainablegis.com/blog/cfg11n/index.cfm?mode=entry&entry=FD48909C-50FC-543B-1FE177C1B97E8CC1 from Paul Hastings's blog about some solutions to resource bundles. To be honest his blog is an excellent resource on this very subject.
With regards to how you handle the URLs do not do option 1 as you quite rightly identified you will cause issues the SEO rankings of the page and it will mean that users cannot correctly share or return to the page.
Two approaches are having the language code in the URL as you identified in option 1.
Pros
Simpler to configure
Cons
You have one application which means that as you add more languages you add more complexity and weight on the memory of that app
Or you can have a different sub domain or domain per application e.g. es.yourdomain.com or yourdomain.es they can all be the same codebase
Pros
Each language is a standalone application meaning it has it's own memory
Cons
more effort to configure

http://i18n.riaforge.org/ has a download for i18n. It can be used to make sure that all string labels match. That way if some one wants to change "Save" to "Update", it can all be done in one spot.
It is also important to consider the technical background of those that will being doing the translation. It is often easier to get the translation team to edit files in notepad as opposed to updating a db. Text files work well with version control.

The best way i found is to use an XML to hold just that pages language stuff, one xml to cover each page, and you then vary it for language. when the page loads, just load a different XML from the database or files... many ways to do this. all other methods i have tried have their issues, and at least this one allows you to take a language XML, hand it to someone who will copy it, and then change the boxes... you put it in the DB to serve it.
one can also do this for text, and have the DB make the XML for just the text for that page by using a list of items to include in the XML for the page.
once you get the idea, the rest becomes very easy...
and given CF ways of accessing such data with dot notation, easy peasy to us
say you have "Load Images"
in english xml it may be <LoadIMGS>Load Images</LoadIMGS>
in chinese it may be <LoadIMGS>加载图像</LoadIMGS>
or <LoadIMGS>Jiā zǎi túxiàng</LoadIMGS>
regardless, in your CFM code you would just put #variablename.LoadIMGS# in the place... i would also suggest putting in the loadimages tag the size the font should be adjusted to if not normal size. that way, when translations are too large, you can shrink that font there for that... etc.
enjoy!!!

How to not transform special characters to html entities with owasp antisamy

I use Owasp Anti samy with Ebay policy file to prevent XSS attacks on my website.
I also use Hibernate search to index my objects.
When I use this code:
String html = "special word: été";
// use the Ebay configuration file
Policy policy = Policy.getInstance(xssPolicyFile.getInputStream());
AntiSamy as = new AntiSamy();
CleanResults cr = as.scan(html, policy);
// result is now : "special word: été"
result = cr.getCleanHTML();
As you can see all chars "é" has been transformed to their html entity equivalent "é"
My page is on UTF-8, so I don't need this transformation. Moreover, when I index this text with Hibernate Search, it indexes the word with html entities, so I can't find word "été" on my index.
How can I force antisamy to not transform special chars to their html entity equivalent ?
thanks
PS: an issue has been opened : http://code.google.com/p/owaspantisamy/issues/detail?id=99

I ran into the same problem this morning.
I have encapsulated antisamy in a class and I use apache StringEscapeUtil from apache common-lang to restore special characters.
CleanResults cleanResults = antiSamy.scan(taintedHtml);
cleanedHtml = cleanResults.getCleanHTML();
return StringEscapeUtils.unescapeHtml(cleanedHtml)
The result is a cleaned up HTML without the HTML escaping of special characters.
Hope this helps.

Like Mohamad said it in a comment, Antisamy has just released a new directive named : entityEncodeIntlChars
here is the detail : http://code.google.com/p/owaspantisamy/source/detail?r=240
It seems that this directive solves the problem.

After scouring the AntiSamy source code, I found no way of changing this behavior apart from modifying AntiSamy.

Check out this one: http://code.google.com/p/owaspantisamy/source/browse/#svn/trunk/dotNet/current/source/owaspantisamy/html/scan
Grab the source and notice that key classes (AntiSamyDOMScanner, CleanResults) use standard framework objects (like XmlDocument). Compile and run with the binary you compiled - so that you can see everything in a debugger - as in which of the major classes actually corrupts your data. With that in hand you'll be able to either change a few properties on major objects to make it stop or inject your own post-processing to revert the wrongdoing (say with a regexp). Latter you can expose that as additional top-level property, say one named NoMess :-)
Chances are that behavior in that respect is different between languages (there's 3 in that trunk) but the same tactics will work no matter which one you have to deal with.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js