In need of a SaaS solution for semantic thesaurus matching - sas

I'm currently building a web application. In one of it's key processes the application need to match short phrases to other similar ones available in the DB.
The application needs to be able to match the phrase:
Looking for a second hand car in good shape
To other phrases which basically have the same meaning but use different wording, such as:
2nd hand car in great condition needed
or
searching for a used car in optimal quality
The phrases are length limited (say 250 chars), user generated & unstructured.
I'm in need of a service / company / some solution which can help / do these connections for me.
Can anyone give any ideas?

Have you looked at SAS text miner? It may be suited to this kind of application. I have only seen a demo of it and it would be able to tokenize the data just fine. You may need some custom programming around the synonyms though.

Related

Google Analytic Search and replace according to list

Scenario:
In Google Analytic, I notice that it is possible to replace certain URI parameter to words that you want by using search and replace filter like the following example below.
e.g. www.example.com/abc/product_id=3 -----> www.example.com/abc/product_name=shampoo
Problems:
Currently I've got a list of over 1000 products in my hand, instead of creating 1000 search and replace filter, what would be the most efficient and maintainable way to go solve the problem?
I've done some digging and notice that custom dimension could be the solution, however it would require me to modify the the JS code on the FTP sever which I dont have permission on. What other solutions do I have?
If it is not possible to show it here would there be any kind of tutorial that I could follow through?
Really appreciate for the help, Many Thanks
This is not a complete answer, but it's certainly more than a comment.
Besides the tedium of writing this out by hand, I can think of two options available to you.
Firstly, you could use the Google Analytics Management API (https://developers.google.com/analytics/devguides/config/mgmt/v3/). By constructing a set of commands, you could quickly iterate through your list and create the required 1,000 search and replace filters.
Secondly, if you were to use Google Tag Manager you would be able to create a Custom JavaScript Variable that takes the page path and compares it to your list. This variable could then replace the Page field before the hit data is sent to Google Analytics. This may sound more complicated, but it would allow you to pull your solution out of Google Analytics and into the flexible world of JavaScript.
Note that if you rewrite the product_id to a product_name once, you will have to maintain that cross reference every day and keep it in sync with what appears on the website -- make sure you have an automated solution or it will quickly get out-of-sync and be more of a mess than before.
An alternative is to do the search-and-replace on the reporting side.
I know Analytics Edge or Analytics Canvas products could easily do this, or you could just download into Excel or Google Sheets and do a series of lookup formulae.

How to decide if a webpage is about a specific topic or not? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to write a code which can take the source html of a web page then decide what kind of web page it is. I am intrested in deciding if the web page is about academic courses or not. A naive first approach that I have is to check if the text has words which can be related like (course, instructor, teach,...) and decide that it is about an academic course if it achieves enough hits.
Even though, I need some ideas how to achieve that more efficiently.
Any ideas would be appreciated.
Thanks in advance :)
Sorry for my English.
There are many approaches to classifying a text, but first: a web page should be converted to plain text either using a dump way of removing all the HTML tags and reading what's left, or using smarter ways of identifying the main parts of the page that would contain all the useful text, in the latter case you can use some HTML5 elements like <article>, read about the HTML5 structural elements here.
Then you can try any of the following methods, depending on really how far you are willing to go with your implementation:
Like you mentioned, a simple search for relative words, but that would give you a very low success rate.
Improve the solution above by passing the tokens of the texts to a lexical analyzer and focus on the nouns, nouns usually have the highest value - I will try to find the resource of this but I'm sure I read it somewhere while implementing a similar project -, this might improve the rate a little.
Improve more by looking at the origin of the word, you can use a Morphological Analyzer to do so, and this way you can tell that the word "papers" is the same as "paper". That can improve a little.
You can also use an ontology of words like Word Net, and you can then start looking whether the words in the document are descendants of one of the words you're looking for, or the other way around but going up means genaralizing which would affect the precision. e.g. you can tell that the word "kitten" is related to the word "cat" and so you can assume that since the document talks about "kittens" then it talks about "cats".
All the above depends on you setting a defined list of keywords that you would base your decision on. But life doesn't work that way usually, that's why we use machine learning. And the basic idea would be that you would get a set of documents and manually tag/categorize/classify them, and then feed those documents to your program as a training set and let your program learn on them, afterwards your program would be able to apply what it learned in tagging other untagged documents. If you decide to go with this option then you can check this SO question and this Quora question and the possibilities are endless.
And assuming you speak Arabic I would share a paper of the project I worked on here if you're interested, but it is in Arabic and deals with the challenges of classifying Arabic text.
I know nothing about web programming as a c language programmer, but I would make sure it checks for different domain name suffixes. .edu is one most universities use, .gov for government pages and so on, then no need to scan a page. But surly the way to achieve highest accuracy is o use these methods, but create a way for users to correct the app, this info can be hosted on a webserver and a page can be cross referenced against that data base. Its always great to use your customer as an improvement tool!
another way would be to see if you can cross reference it with search engines that categorise in their index. For example google collates academic abstracts in google scholar. You could see if the web age is present in that data base?
Hope this helped! If I have any other ideas you will be the first to know!
Run text thru sequence-finding algorithm.
Basics of algorithm: you take some amount of definitely academic course related web-pages, clean them and search them for frequently met word sequences (2-5 words). Then by hand remove common word sequences, that are not related to academic course directly. By examining how much of that sequences are met in some web-page you can with some precission find out, if it's contents is well-related to source of test word sequences.
Note: Testet web pages must be properly cleaned up. Clean page contents from anything unrelated - delete link, script tags&contents, remove tags itself (but leave text in image's alt/title attributes) and so on. Context to examine should be title, meta keywords & description + cleaned contents of page. Next step is to stem text.

how to design full text search algorithm where keyword quantity is huge (like Google Alerts)?

I am building something very similar to Google Alerts. If you don't know what it is, consider the following scenario,
Thousands of new textual articles, blog posts influx everyday
Each user has a list of favorite "keywords" that he'd like to subscribe to
There are million users with million keywords
We scan every article/blog post looking for every keyword
Notify each user if a specific keyword matches.
For one keyword, doing a basic full text search against thousands of articles is easy, but how do make a full text search effectively with million keywords?
Since I don't have a strong CS backtround, the only idea I came of is compiling all keywords into regex, or automata, will this work well? (Like Google's re2)
I think I am missiong some thing important here. Like compiling those keywords into some advanced data structure. Since many keywords are alike (e.g. plural form, simple AND, NOT logic, etc). Are there any prior theory I need to know before head into this?
All suggestions are welcome, thanks in advance!
I can think of the following: (1) Make sure each search query is really fast. Millisecond performance is very important. (2) Group multiple queries with the same keywords and do a single query for each group.
Since different queries are using different keywords and AND/OR operations, I don't see other ways to group them.

A tool which checks that a local version of a site is fully translated (for continuous integration)

I'm working on a project, in which we design a localized version of an existing site (written in English) for another country (which is not English-speaking). And the business requirement is "no English text for all possible and impossible cases".
Does anyone know if there is a checker software/service which could check if a site is fully translated, that is which checks that there are no English text in it.
I new that there are sites for checking broken links, html validity etc, I need something like http://validator.w3.org/checklink but for checking that on all pages of the site there is no English text.
The reasons I think this way is needed are:
1. There is a lot of code which is common (both on backend and frontend) for all countries
2. If someone commits anything to the common code I need to be sure that this will not lead to english text issues in localized version.
3. From business point of view it is preferable that site does not support some functionality, than it shows english text ( legal matters)
4. The code both on frontend and backend changes a lot
5. There are a lot of files which affect text on the client's screen. Not just one with messages, unfortunately. And some of messages comes from backend, but most of them are in frontend
6. Due to all those fact currently someone manually fills all the forms and watch with his own eyes, and that is before each deploy...
I think you're approaching the problem from the wrong direction. You're looking for an algorithm or webcrawler that can detect wether any text is English or not? I don't know, but I doubt such a thing even exists.
If you have translated the website, you have full access to the codebase and/or translation texts, right? Can't you just open both the English and non-English strings files (.resx or whatever you are using) in a comparetool like Notepad++ to check the differences to see if there are any missing strings? And check the sourcecode and verify that all parts that can output user-displayable text use the meta:resourceKey property (or whatever you are using).
If you want to go the way of crawling, I'm not aware of an existing crawler that does this, but it sounds like a combination of two simple issues:
Finding existing open-source code for a web crawler should be dead simple
Identifying a language through n-gram analysis is trivial if there's a limited number of languages the text can be in.
The only difficult part would be to ensure that the analyzer always has a decent chunk of text to work with. You could extract stuff paragraph by paragraph. For forms you'd probably have to combine the text of several form labels.

What are my options for white-listing HTML in ColdFusion?

I want to allow my users to input HTML.
Requirements
Allow a specific set of HTML tags.
Preserve characters (do not encode ã into ã, for example)
Existing options
AntiSamy. Unfortunately AntiSamy encodes special characters and breaks requirement 2.
Native ColdFusion functions (HTMLCodeFormat() etc...) don't work as they encode HTML into entities, and thus fail requirement 1.
I found this set of functions somewhere, but I have no way of telling how secure this is: http://pastie.org/2072867
So what are my options? Are there existing libraries for this?
Portcullis works well for Cold Fusion for attack-specific issues. I've used a couple of other regex solutions I found on the web over time that have worked well, though they haven't been nearly as fleshed out. In 15 years (10 as a CMS developer) nothing I've built has been hacked....knock on wood.
When developing input fields of any type, it's good to look at the problem from different angles. You've got the UI side, which includes both usability and client-side validation. Yes, it can be bypassed, but javascript-based validation is quicker, more responsive, and rates higher on the magical UI scale than backend-interruption method or simply making things "disappear" without warning. It will speed up the back-end validation because it does the initial screening. So, it's not an "instead of" but an "in-addition to" type solution that can't be ignored.
Also on the UI front, giving your users a good quality editor also can make a huge difference in the process. My personal favorite is CKeditor simply because it's the only one that can handle Microsoft Word code on the front-side, keeping it far away from my DB. It seems silly, but Word HTML is valid, so it won't setoff any red flags....but on a moderately sized document it will quickly overload a DB field insert max, believe it or not. Not only will a good editor reduce the amount of silly HTML that comes in, but it will also just make things faster for the user....win/win.
I personally encode and decode my characters...it's always just worked well so I've never changed practice.