Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I need a multilingual dictionary accessible through C++ which is capable of performing the following operation:
inputs: Language of Input Word, Input Word, Language of Output Definition
output: A string definition of the input word in the desired output language (NULL if the word is not found)
Some restrictions: This function needs to be able to run in under 0.5 seconds on an iPhone 6. Therefore, only fast and slim web based solutions or highly optimized local dictionary search functions are suitable.
I have considered using the Bing Translate API to translate the definition of the word to the desired destination language. However, I have been unable to find a dictionary which will return a definition of a word given the language of the input word. Does such a system exist? If not, how could I go about implementing the system outlined here? Any and all suggestions and information are greatly appreciated.
Thanks in advance!
Here is how I solved this. I downloaded word lists for all of the supported languages. I checked the word list of the language of the given input word and if the input word existed in this list, I used the bing translate API to get the definition of the input word in the destination language. Otherwise I returned NULL as expected.
Here is a link to an English word list similar to the one I used:
http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt
This Microsoft site contains information about the Bing Translator API costs and how to get started:
https://datamarket.azure.com/dataset/bing/microsofttranslator
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
How can we train the Luis patterns to capture both plural and singular words?
Tried adding patterns like below
I am looking for {coursename} course[s]
But it is not working.
Consider:
Intent:**Training**
Trained pattern:I am looking for {coursename} course
If the query is:
"I am looking for python courses"
I want Luis to capture it as "Training" intent. It is not happening because the "Courses" is in the plural form from query and training for intent is done on a singular form(course)
I need a suggestion to manipulate the trained pattern to handle the plural form of words.
Thank you
I generally wouldn't recommend training by patterns for this exact reason. They are either too restrictive, or if you try to account for more variations, they can become too broad. Is there a reason you are not just training it by utterances? In other words, for your training intent you can and should have phrases like:
I am looking for python courses
I am looking for python course
I want to take a class on nodejs
Do you have any java classes?
I want to learn C#
Can you teach me javascript?
Really the point of LUIS is that you can continue to add phrases over time so that it becomes better at recognizing user intent. It's not looking for exact matches either, so in this case something like I am looking for classes on ruby should be recognized even though that combination is never specified.
It is unclear if you are using the pattern for entity detection as well, but again, you will fare better using other methods. If you only have a few values, a list entity works fine. If your list is large, varied, and/or may expand in the future, I would recommend using Machine Learned entities. Basically, you would create a machine learned entity then go back through your utterances and label these entities. Then LUIS can pick those up in the future based on not only the value but the context of how it is used in the sentence. If you don't plan to scale a list entity can be better since you won't get false positives (e.g. you won't recognize puppies as an entity if someone says "I want to learn about puppies").
Most of this is basic functionality of LUIS, so if you browse the Microsoft documentation for LUIS (or Google it) you should find tons of additional information on how to use LUIS most effectively.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to incorporate machine learning into a project ive been working on but i havent seen anything about my intended use case. It seems like the old pandoras box project did something like this but with textual input and output.
I want to train a model in real time as well as use it (and then switch it from testing to live api endpoints when it works well enough.)
But every library ive found works like "feed in a datablob, get an answer"
I want to be able to stream data into it:
instead of giving it "5,4,3,4,3,2,3,4,5" and it says "1" or "-1" or "0"
I want to give it "5" then "4" then "3" then "4" etc and each time it responds.
Im not even sure if "streaming" is the right word for this. Please help!
It sounds like a usecase for recurrent neural networks, which translate sequences (your stream) into single outputs or other sequences. This a well-explored approach, e.g., in natural language processing. Tensorflow has support for different flavors of such nets.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Good morning,
I want to write a small tool (possible with Java, C/C++ ???) that queries the messages on https://cve.mitre.org/data/downloads/index.html and filters only certain relevant messages.
My questions:
1. Which format is the best one for parsing data? In the textfile, for example, all the information is arranged together. So I think a filter for searching specific headers and specific lines will not work.
How do I get the information from one of the files locally on my PC or on a server?
How do I read and filter this information?
I'd recommend the JSoup Java library for fetching and parsing web pages. You can use a syntax very similar to jQuery for extracting data from the pages you've fetched.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to write a code which can take the source html of a web page then decide what kind of web page it is. I am intrested in deciding if the web page is about academic courses or not. A naive first approach that I have is to check if the text has words which can be related like (course, instructor, teach,...) and decide that it is about an academic course if it achieves enough hits.
Even though, I need some ideas how to achieve that more efficiently.
Any ideas would be appreciated.
Thanks in advance :)
Sorry for my English.
There are many approaches to classifying a text, but first: a web page should be converted to plain text either using a dump way of removing all the HTML tags and reading what's left, or using smarter ways of identifying the main parts of the page that would contain all the useful text, in the latter case you can use some HTML5 elements like <article>, read about the HTML5 structural elements here.
Then you can try any of the following methods, depending on really how far you are willing to go with your implementation:
Like you mentioned, a simple search for relative words, but that would give you a very low success rate.
Improve the solution above by passing the tokens of the texts to a lexical analyzer and focus on the nouns, nouns usually have the highest value - I will try to find the resource of this but I'm sure I read it somewhere while implementing a similar project -, this might improve the rate a little.
Improve more by looking at the origin of the word, you can use a Morphological Analyzer to do so, and this way you can tell that the word "papers" is the same as "paper". That can improve a little.
You can also use an ontology of words like Word Net, and you can then start looking whether the words in the document are descendants of one of the words you're looking for, or the other way around but going up means genaralizing which would affect the precision. e.g. you can tell that the word "kitten" is related to the word "cat" and so you can assume that since the document talks about "kittens" then it talks about "cats".
All the above depends on you setting a defined list of keywords that you would base your decision on. But life doesn't work that way usually, that's why we use machine learning. And the basic idea would be that you would get a set of documents and manually tag/categorize/classify them, and then feed those documents to your program as a training set and let your program learn on them, afterwards your program would be able to apply what it learned in tagging other untagged documents. If you decide to go with this option then you can check this SO question and this Quora question and the possibilities are endless.
And assuming you speak Arabic I would share a paper of the project I worked on here if you're interested, but it is in Arabic and deals with the challenges of classifying Arabic text.
I know nothing about web programming as a c language programmer, but I would make sure it checks for different domain name suffixes. .edu is one most universities use, .gov for government pages and so on, then no need to scan a page. But surly the way to achieve highest accuracy is o use these methods, but create a way for users to correct the app, this info can be hosted on a webserver and a page can be cross referenced against that data base. Its always great to use your customer as an improvement tool!
another way would be to see if you can cross reference it with search engines that categorise in their index. For example google collates academic abstracts in google scholar. You could see if the web age is present in that data base?
Hope this helped! If I have any other ideas you will be the first to know!
Run text thru sequence-finding algorithm.
Basics of algorithm: you take some amount of definitely academic course related web-pages, clean them and search them for frequently met word sequences (2-5 words). Then by hand remove common word sequences, that are not related to academic course directly. By examining how much of that sequences are met in some web-page you can with some precission find out, if it's contents is well-related to source of test word sequences.
Note: Testet web pages must be properly cleaned up. Clean page contents from anything unrelated - delete link, script tags&contents, remove tags itself (but leave text in image's alt/title attributes) and so on. Context to examine should be title, meta keywords & description + cleaned contents of page. Next step is to stem text.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Anyone knows? Because i think one of the stumbling blocks for people to embrace wiki is of the fact that they need to separately upload the images to the wiki instead of just doing simple copy/paste to the word document
OpenOffice can save as a Mediawiki text file, ready to be pasted into the edit box online.
Microsoft has an add-on for Word 2007 and Word 2010 to save in mediawiki markup. It can be downloaded from http://www.microsoft.com/download/en/details.aspx?id=12298
The guys at mindtouch.com told me they have a tool that imports word docs into their wiki.
You did not specify which wiki you use, here are some links mostly related to mediawiki:
You can use wikEd, which allows to paste formatted text
Help:WordToWiki lists 3 alternative ways
Extension:Word2MediaWikiPlus includes also automatical image upload.
The folks at webworks.com have a nice desktop software solution that automatically publishes Word documents to wiki. Currently they support out-of-box: Confluence, Mediawiki, and MoinMoin.