What data structure would you use to implement a dictionary? - c++

So I had an interview question about what data structure I would use to implement a dictionary. I said a map because you can enter the word and get the definition in return.
The follow up question was what data structure I would use to implement a system where the user only wanted to lookup a certain number of the starting letters of a word (i.e the first 3 letters) such that the user can get a list of, for example, all the words in the dictionary that start with fic.
I said something about a binary tree but really I had NO idea; the interviewer said the answer was TYPEAHEAD, something like what Microsoft Visual Studio's intellisense does. I had no idea what that was and I have tried to look it up on google but am getting some weird search results. Even though I have no idea what this is and never used it before it is definitely an interesting question.
Does anyone know how to do this? What kind of data structure and what would be the implementation method?
Edit: I'm confused why this is being closed. Is this not a programming question?

Use Trie, a type of tree. Follow this link for solution and better understanding: Trie

Related

C++ - Is std::string suitable for storing a large text file, and if not what is the best data type for doing so? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I was just wondering, what is the best data type for storing the contents of a text-based file? Is std::string suitable for keeping the contents of a larger file in memory?
I'm making an editor of sorts right now so I'd like to know, I can't seem to find a good answer.
Edit: Yeah, this was a very vague question and I didn't expect it to get quite as much attention. Saying it's an editor is kinda a bad description, and the question is quite vague, I was just wondering how to store read-only text in memory, if std::stringis a bad way to do so; if it is inefficient or not.
The "editor of sorts" is probably the important thing: if the text were read-only, you could consider using mmap. I don't have enough experience with memory mapped files to know if they're appropriate for text editors, however.
There are data structures more suited to modifying large chunks of text. A rope is a binary tree with short text strings at the leaf nodes... operations on a string such as appending some text might cause the leaf node to be split and the appended text added into the righthand new node. This has the advantage that existing strings don't always need to be repeated moved or grown as the text document is modified.
Another alternative is a simpler structure called a gap buffer. This effectively uses three strings to hold your text, a prefix, a postfix and a pre-sized gap. When the user starts work on a section of text, the document is split into the prefix and postfix strings, and a new gap buffer is allocated. The text the user adds is pushed into the gap buffer which may be expanded as needed. When they move on to a different point in the document, the gap buffer is merged with the other strings and a new gap is created. The assumption here is that most of the document will be static, with most edits occurring around a specific location in the document at any given time, minimising string copies, moves and reallocations.
Emacs uses gap buffers, which suggests they're not a bad place to start. There's plenty of discussion (and comparison) of the two datastructures out there, and you may even be able to find perfectly useable implementations already available. Implementing your own gap buffer should be dead easy.
Possibly useful reading: Gap Buffers, or, Don’t Get Tied Up With Ropes? (which includes some profiling information), original SGI C++ library Rope docs
Well, for a vague question, my answer is that probably std::string will suite you well.
But.. there many ways to store this, it depends on how are you development requisites.
Edit: Complementary Answer (edited question) No, it's not inefficient at all. It's quite suitable for generic use and excelent for readlonly access.
This is a vague question that is why you can't find a good answer. It is more about what you do with this text file. If the text file is small enough to be stored in memory then sure you can store it in a string. But then how are you going to use it? What does this do for you?
Are you going to use regex for find certain words? Then sure you can do that but it may be slow.
Is the the text file a webpage(source)? Then sure you can do that and search for the tags you are looking for. There might be better ways like putting it into an xml tree and searching for the tags but the ONE string should still work.
Anyway this is a tough question to answer because we don't know what you are using the string for in the first place.
If you just need it whole and intact then if you have enough memory to store it in a string then sure.

AI: What kind of process would sites like Wit use to train Natural language [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project where I would like to achieve a sense of natural language understanding. However, I am going to start small and would like to train it on specific queries.
So for example, starting out I might tell it:
songs.
Then if it sees a sentence like "Kanye Wests songs" it can match against that.
BUT then I would like to give it some extra sentences that could mean the same thing so that it eventually learns to be able to predict unknown sentences into a set that I have trained it on.
So I might add the sentence: "Songs by
And of course would be a database of names it can match agains.
I came across a neat website, Wit.ai that does something like I talk about. However, they resolve their matches to an intent, where I would like to match it to a simplified query or BETTER a database like query (like facebook graph search).
I understand a context free grammar would work well for this (anything else?). But what are good methods to train several CFG that I say have similar meaning and then when it sees unknown sentences it can try and predict.
Any thoughts would be great.
Basically I would like to be able to take a natural language sentence and convert it to some form that can be run better understood to my system and presented to the user in a nice way. Not sure if there is a better stackexchange for this!
To begin with, I think SO is quite well-suited for this question (I checked Area 51, there is no stackexchange for NLP).
Under the assumption that you are already familiar with the usual training of PCFG grammars, I am going to move into some specifics that might help you achieve your goal:
Any grammar trained on a corpus is going to be dependent on the words in that training corpus. The poor performance on unknown words is a well-known issue in not just PCFG training, but in pretty much any probabilistic learning framework. What we can do, however, is to look at the problem as a paraphrasing issue. After all, you want to group together sentences that have the same meaning, right?
In recent research, detecting sentences or phrases that have the same (or similar) meaning have employed a technique known as as distributional similarity. It aims at improving probability estimation for unseen cooccurrences. The basic concept is
words or phrases that share the same distribution—the same set of words in the same context in a corpus—tend to have similar meanings.
You can use only intrinsic features (e.g. production rules in PCFG) or bolster such features with additional semantic knowledge (e.g. ontologies like FreeBase). Using additional semantic knowledge enables generation of more complex sentences/phrases with similar meanings, but such methods usually work well only for specific domains. So, if you want your system to work well only for music, it's a good idea.
Reproducing the actual distributional similarity algorithms will make this answer insanely long, so here's a link to an excellent article:
Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods by Madnani and Dorr.
For your work, you will only need to go through section 3.2: Paraphrasing Using a Single Monolingual Corpus. I believe the algorithm labeled as 'Algorithm 1' in this paper will be useful to you. I am not aware of any publicly available tool/code that does this, however.

How can we Implement a c++ class such that it allows us to add data members at runtime

How can we Implement a c++ class such that it allows us to add data members at runtime. This question was asked in an interview.
I find that to be a quite interesting interview question. If I was asking it, I would hope for that to be a starter of a back and forth conversation. The interviewed would have to know that you cannot add members dynamically, but I would hope for questions about the actual problem to solve (what do you want to solve? why do you need to add members?) and proposed solutions that would tackle those situations.
Note that there are two ways of looking to an interview question, looking for facts and looking for problem solving abilities. This is one question where both of them can be exercised (then again, it is up to the interviewer to lead the conversation, i.e. if the person just answers it's impossible, the interviewer can continue as but I need to be able to...)
You cannot do that in C++, there's not many languages you could do it in. You could have some sort of simulation of that with maps and such like but it's not the same thing.
I hope you just answered no.
Obviously you cannot change the class vtable at runtime - that's be silly, but you can use a single member variable that allows you to add entries to it, a stl map would be a good choice here.
As a map store name-value pairs, you would use it to store data comprising of the 'data member' name and the corresponding data value.
I hope you asked "what do you mean?" to that interview question and got them to explain just what they were thinking.
You cannot add data members in C++. You should look into containers for C++, perhaps the http://www.cplusplus.com/reference/stl/map/ which allows you to tag data with names.

Code for identifying programming language in a text file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:
I must write this in C++.
A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#...
Amount of false positives (wrong recognition) should be low - better to output "unknown" than a wrong result. (it will be in the list of probabilities for example as unknown: 100%, see below)
The output should be a list of probabilities for each language the code knows, so if it knows C, Java and Perl, the output should be for example: C: 70%, Java: 50%, Perl: 30% (note there is no need to have the probabilities sum up to 100%)
It should have a good ratio of accuracy/speed (speed is a bit more favored)
It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.
Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?
EDIT: based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.
You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.
One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.
If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.
If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.
I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.
It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.
As dmckee suggested, you might want to have a look at the Unix file program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?
Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.
Here's a description of the highlighting system.
Since the list of languages is known upfront you know the syntax/grammar for each of them.
Hence you can, as an example, to write a function to extract reserved words from the provided source code.
Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.
If in the end you only have 1 possibility left - this is your language.
If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.
You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.
By constructing this search tree efficiently you could end with a rather fast code.
This one is not fast and may not satisfy your requirements, but just an idea. It should be easy to implement and should give 100% result.
You could try to compile/execute the input text with different compilers/interpreters (opensource or free) and check for errors behind the scene.
The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.

Regular expression generator/reducer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
I was posed an interesting question from a colleague for an operational pain point we currently have, and am curious if there's anything out there (utility/library/algorithm) that might help automate this.
Say you have a list of literal values (in our cases, they are URLs). What we want to do is, based on this list, come up with a single regex that matches all of those literal items.
So, if my list is:
http://www.example.com
http://www.example.com/subdir
http://foo.example.com
The simplest answer is
^(http://www.example.com|http://www.example.com/subdir|http://foo.example.com)$
but this gets large for lots of data, and we have a length limit we're trying to stay under.
Currently we manually write the regexes but this doesn't scale very well nor is it a great use of anyone's time. Is there a more automated way of decomposing the source data to come up with a length-optimal regex that matches all of the source values?
The Aho-Corasick matching algorithm constructs a finite automaton to match multiple strings. You could convert the automaton to its equivalent regex but it is simpler to use the automaton directly (this is what the algorithm does.)
Today I was searching that. I didn't found it, so I create a tool: kemio.com.ar/tools/lst-trie-re.php
You put a list on the right side, submit it, and get the regexp on the left one.
I tried with a 6Kb list of words, and produced a regexp of 4Kb (that I put on a JS file) like: var re=new RegExp(/..../,"mib");
Don't abuse of it, please.
The Emacs utility function regexp-opt (source code) does not do exactly what you want (it only works on fixed strings), but it might be a useful starting point.
If you want to compare against all the strings in a set and only against those, use a trie, or compressed trie, or even better a directed acyclic word graph. The latter should be particularly efficient for URLs IMO.
You would have to abandon regexps though.
I think it would make sense to take a step back and think about what you're doing, and why.
To match all those URLs, only those URLs and no other, you don't need a regexp; you can probably get acceptable performance from doing exact string comparisons over each item in your list of URLs.
If you do need regexps, then what are the variable differences you're trying to accomodate? I.e. which part of the input must match verbatim, and where is there wiggle room?
If you really do want to use a regexp to match a fixed list of strings, perhaps for performance reasons, then it should be simple enough to write a method that glues all your input strings together as alternatives, as in your example. The state machine doing regexp matching behind the scenes is quite clever and will not run more slowly if your match alternatives have common (and thus possibly redundant) substrings.
Taking the cue from the other two answers, is all you need to match is only the strings supplied, you probably better off doing a straight string match (slow) or constructing a simple FSM that matches those strings(fast).
A regex actually creates a FSM and then matches your input against it, so if the inputs are from a set of previously known set, it is possible and often easier to make the FSM yourself instead of trying to auto-generate a regex.
Aho-Corasick has already been suggested. It is fast, but can be tricky to implement. How about putting all the strings in a Trie and then querying on that instead (since you are matching entire strings, not searching for substrings)?
An easy way to do this is to use Python's hachoir_regex module:
urls = ['http://www.example.com','http://www.example.com/subdir','http://foo.example.com']
as_regex = [hachoir_regex.parse(url) for url in urls]
reduce(lambda x, y: x | y, as_regex)
creates the simplified regular expression
http://(www.example.com(|/subdir)|foo.example.com)
The code first creates a simple regex type for each URL, then concatenates these with | in the reduce step.