How to retrieve map nodes with keys not scalars? - c++

I want to use insensively yaml-cpp in a C++ project, as it fits my needs perfectly. But I want to update one node from another node, i.e. properly add non-existant nodes from one mode to another, or replace existing values of existing ones. I can't find how to do this simply with the current interface...
So I try to do this with a simple loop on an iterator. I figure out that the following thing does not work while traversing a map node:
if (node_1[it->first]) /*...*/
It does not find any node! So, for map nodes with scalars as keys, the test if (node_1[it->first.Scalar()]) /*...*/ works well. My problem is to do the same with sequence keys. How can I do that?
EDIT
Here is an example of YAML document:
---
#Fake entry
Time: 0.1.1.2
ID: 25814
Emitter: Me
Line : {
orig: 314,
st: 512
}
Message : |
This is a fake error
#More difficult
[0,1,2,3] : my_address
[5, 6, 7, 8] : an_address
...
This document is loaded without any problem into a Node, say doc1; I want now modify some entries with respect to another YAML document, such as:
---
Comment: what a dummy file!
Emitter: You
[0,1,2,3] : address changed
...
So I load this second document into a Node doc2, and I want to update doc1 with the nodes of doc 2. The first key of doc 2 is not present in doc 1 and is a scalar, so I can do doc1[it->first.Scalar()] = it->second. The second key is present, so the same instruction will update doc1, replacing the value linked with key Emitter. My problem is that I cannot succeed in finding the 3rd key inside doc1, as it is a sequence.

yaml-cpp doesn't offer generic equality testing for nodes, which is why your initial solution (which would have the best bet of working) didn't work.
Instead, yaml-cpp relies on typed equality testing. E.g., node[5] will convert all keys to integers to check key equality; it won't convert 5 to a node and then check equality that way. This is why your other solution will usually work - most of your keys are simple scalars, so they can match using std::string equality.
It appears you really want to "merge" two nodes; this has been on the yaml-cpp issues list for a while: https://code.google.com/p/yaml-cpp/issues/detail?id=41, and there's some discussion there which explains why it's a hard problem.
As a possible workaround, if you know the type of each node, you can explicitly cast before doing your comparison, e.g.:
doc1[it->first.as<T>()] = it->second;

Related

Adding a node and edge to a graph using Gremlin behaving strange

I'm new to using Gremlin (up until now I was accessing Neptune using Opencypher and given up due to how slow it was) and I'm getting really confused over some stuff here.
Basically what I'm trying to do is -
Let us say we have some graph A-->B-->C. There are multiple such graphs in the database, so I'm looking for the specific A,B,C nodes that have the property 'idx' equals '1'. I want to add a node D{'idx' = '1'} and an edge so I will end up having
A-->B-->C-->D
It is safe to assume A,B,C already exist and are connected together.
Also, we wish to add D only if it doesn't already exist.
So what I currently have is this:
g.V().
hasLabel('A').has('idx', '1').
out().hasLabel('B').has('idx', '1').
out().hasLabel('C').has('idx', '1').as('c').
V().hasLabel('D').has('idx', '1').fold().
coalesce(
unfold(),
addV('D').property('idx','1')).as('d').
addE('TEST_EDGE').from('c').to('d')
now the problem is that well, this doesn't work and I don't understand Gremlin enough to understand why. This returns from Neptune as "An unexpected error has occurred in Neptune" with the code "InternalFailureException"
another thing to mention is that if the node D does exist, I don't get an error at all, and in fact th node is properly connected to the graph as it should.
furthermore, I've seen in a different post that using ".as('c')" shouldn't work since there is a 'fold' action afterwards which makes it unusable (for a reason I still don't understand, probably cause I'm not sure how this entire .as,.store,.aggregate work)
And suggests using ".aggregate('c')" instead, but doing so will change the returned error to "addE(TEST_EDGE) could not find a Vertex for from() - encountered: BulkSet". This, adding to the fact that the code I wrote actually works and connects node D to the graph if it already exists, makes me even more confused.
So I'm lost
Any help or clarification or explanation or simplification would be much appreciated
Thank you! :)
A few comments before getting to the query:
If the intent is to have multiple subgraphs of (A->B->C), then you may not want to use this labeling scheme. Labels are meant to be of lower variation - think of labels as groups of vertices of the same "type".
A lookup of a vertex by an ID is the fastest way to find a vertex in a TinkerPop-based graph database. Just be aware of that as you build your access patterns. Instead of doing something like `hasLabel('x').has('idx','y'), if both of those items combined make a unique vertex, you may also want to think of creating a composite ID of something like 'x-y' for that vertex for faster access/lookup.
On the query...
The first part of the query looks good. I think you have a good understanding of the imperative nature of Gremlin just up until you get to the second V() in the query. That V() is going to tell Neptune to start evaluating against all vertices in the graph again. But we want to continue evaluating beyond the 'C' vertex.
Unless you need to return an output in either case of existence or non-existence, you could get away with just doing the following without a coalesce() step:
g.V().
hasLabel('A').has('idx', '1').
out().hasLabel('B').has('idx', '1').
out().hasLabel('C').has('idx', '1').
where(not(out().hasLabel('D').has('idx','1'))).
addE('TEST_EDGE).to(
addV('D').property('idx','1'))
)
The where clause allows us to do the check for the non-existence of a downstream edge and vertex without losing our place in the traversal. It will only continue the traversal if the condition specified is not() found in this case. If it is not found, the traversal continues with where we left off (the 'C' vertex). So we can feed that 'C' vertex directly into an addE() step to create our new edge and new 'D' vertex.

Finding a unique url from a large list of URLs in O(n) time in a single pass

Recently I was asked this question in an interview. I gave an answer in O(n) time but in two passes. Also he asked me how to do the same if the url list cannot fit into the memory. Any help is very much appreciated.
If it all fits in memory, then the problem is simple: Create two sets (choose your favorite data structure), both initially empty. One will contain unique URLs and the other will contain URLs that occur multiple times. Scan the URL list once. For each URL, if it exists in the unique set, remove it from the unique set and put it in the multiple set; otherwise, if it does not exist in the multiple set, add it to the unique set.
If the set does not fit into memory, the problem is difficult. The requirement of O(n) isn't hard to meet, but the requirement of a "single pass" (which seems to exclude random access, among other things) is tough; I don't think it's possible without some constraints on the data. You can use the set approach with a size limit on the sets, but this would be easily defeated by unfortunate orderings of the data and would in any event only have a certain probability (<100%) of finding a unique element if one exists.
EDIT:
If you can design a set data structure that exists in mass storage (so it can be larger than would fit in memory) and can do find, insert, and deletes in O(1) (amortized) time, then you can just use that structure with the first approach to solve the second problem. Perhaps all the interviewer was looking for was to dump the URLs into a data base with a UNIQUE index for URLs and a count column.
One could try to use Trie structure for keeping data. It's compressed so it would take less memory, as memory reusage for common url parts.
loop would look like:
add string s to trie;
check that added string is not finished in existing node
internal node -> compress path
leaf node -> delete path
For the "fits-in-memory" case, you could use two hash-tables as follows (pseudocode):
hash-table uniqueTable = <initialization>;
hash-table nonUniqueTable = <initialization>;
for-each url in url-list {
if (nonUniqueTable.contains(url)) {
continue;
}
else if (uniqueTable.contains(url)) {
nonUniqueTable.add(url);
uniqueTable.remove(url);
}
else {
uniqueTable.add(url)
}
}
if (uniqueTable.size() > 1)
return uniqueTable.first();
Python based
You have a list - not sure where it's "coming" from, but if you already have it in memory then:
L.sort()
from itertools import groupby
for key, vals in groupby(L, lambda L: L):
if len(vals) == 1:
print key
Otherwise use storage (possibly using):
import sqlite3
db = sqlite3.connect('somefile')
db.execute('create table whatever(key)')
Get your data into that, then execute "select * from whatever group by key where count(*) = 1)"
This is actually a classic interview question and the answer they were expecting was that you first sort the urls and then make a binary search.
If it doesn't fit in memory, you can do the same thing with a file.

Searching data stored in a tree

I have this data that is hierarchical and so I store it in a tree. I want to provide a search function to it. Do I have to create a binary tree for that? I don't want to have thousands of nodes twice. Isn't there a kind of tree that allows me to both store the data in the order given and also provide me the binary tree like efficient searching setup, with little overhead?
Any other data structure suggestion will also be appreciated.
Thanks.
EDIT:
Some details: The tree is a very simple "hand made" tree that can be considered very very basic. The thing is, there are thousands of names and other text that will be entered as data that I want to search but I don't want to traverse the nodes in a traditional way and need a fast search like binary search.
Also, importantly, the user must be able to see the structure he has entered and NOT the sorted one. So I cant keep it sorted to support the search. That is why I said I don't want to have thousands of nodes twice.
If you don't want to change your trees hierarchy use a map to store pointers to vertexes: std::map<SearchKeyType,Vertex*> M.
Every time when you will add vertex to your tree you need to add it to your map too. It's very easy: M[key]=&vertex. To find an element use M.find(key);, or M[key]; if you are sure that key exists.
If your tree has duplicate keys, then you should use a multimap.
Edit: If your key's size is too big, than you can use pointer to key instead of key:
inline bool comparisonFunction(SearchKeyType * arg1,SearchKeyType * arg2);
std::map<SearchKeyType *, Vertex *, comparisonFunction> M;
inline bool comparisonFunction(SearchKeyType * arg1,SearchKeyType * arg2)
{
return (*arg1)<(*arg2);
}
to search Element with value V you must write following:
Vertex * v = M[&V]; // assuming that element V exists in M

How do I de-duplicate a list of nodes in XSLT - and return the last node encountered?

I've seen lots of "de-duplicate this xml" questions but everyone wants the first node or the nodes are identical. I have a bit of a bigger puzzle.
I have a list of articles in XML, a relevant snippet is shown:
<item><key>Article1</key><stamp>100</stamp></item>
<item><key>Article1</key><stamp>130</stamp></item>
<item><key>Article2</key><stamp>800</stamp></item>
<item><key>Article1</key><stamp>180</stamp></item>
<item><key>Article3</key><stamp>900</stamp></item>
<item><key>Article3</key><stamp>950</stamp></item>
<item><key>Article4</key><stamp>990</stamp></item>
<item><key>Article5</key><stamp>999</stamp></item>
I'd like a list of nodes where the keys are unique and where the last instance is returned, not the first: Stamp (integer) is always increasing for elements of a particular key. Ideally I'd like "largest stamp" but they're always in order so the shortcut is ok.
Desired result: (Order doesn't really matter.)
<item><key>Article2</key><stamp>800</stamp></item>
<item><key>Article1</key><stamp>180</stamp></item>
<item><key>Article3</key><stamp>950</stamp></item>
<item><key>Article4</key><stamp>990</stamp></item>
<item><key>Article5</key><stamp>999</stamp></item>
I'm somewhat confused on how to get this list. Any ideas?
I'm using the Saxon processor if it matters.
The short version:
Instead of using [1] in the Muenchian grouping, use [last()]

Making an index-creating class

I'm busy with programming a class that creates an index out of a text-file ASCII/BINARY.
My problem is that I don't really know how to start. I already had some tries but none really worked well for me.
I do NOT need to find the address of the file via the MFT. Just loading the file and finding stuff much faster by searching for the key in the index-file and going in the text-file to the address it shows.
The index-file should be built up as follows:
KEY ADDRESS
1 0xABCDEF
2 0xFEDCBA
. .
. .
We have a text-file with the following example value:
1, 8752 FW,
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++,
******************************************************************************,
------------------------------------------------------------------------------;
I hope that this explains my question a bit better.
Thanks!
It seems to me that all your class needs to do is store an array of pointers or file start offsets to the key locations in the file.
It really depends on what your Key locations represent.
I would suggest that you access the file through your class using some public methods. You can then more easily tie in Key locations with the data written.
For example, your Key locations may be where each new data block written into the file starts from. e.g. first block 1000 bytes, key location 0; second block 2500 bytes, key location 1000; third block 550 bytes; key location 3500; the next block will be 4050 all assuming that 0 is the first byte.
Store the Key values in a variable length array and then you can easily retrieve the starting point for a data block.
If your Key point is signified by some key character then you can use the same class, but with a slight change to store where the Key value is stored. The simplest way is to step through the data until the key character is located, counting the number of characters checked as you go. The count is then used to produce your key location.
Your code snippet isn't so much of an idea as it is the functionality you wish to have in the end.
Recognize that "indexing" merely means "remembering" where things are located. You can accomplish this using any data structure you wish... B-Tree, Red/Black tree, BST, or more advanced structures like suffix trees/suffix arrays.
I recommend you look into such data structures.
edit:
with the new information, I would suggest making your own key/value lookup. Build an array of keys, and associate their values somehow. this may mean building a class or struct that contains both the key and the value, or instead contains the key and a pointer to a struct or class with a value, etc.
Once you have done this, sort the key array. Now, you have the ability to do a binary search on the keys to find the appropriate value for a given key.
You could build a hash table in a similar manner. you could build a BST or similar structure like i mentioned earlier.
I still don't really understand the question (work on your question asking skillz), but as far as I can tell the algorithm will be:
scan the file linearly, the first value up to the first comma (',') is a key, probably. All other keys occur wherever a ';' occurs, up to the next ',' (you might need to skip linebreaks here). If it's a homework assignment, just use scanf() or something to read the key.
print out the key and byte position you found it at to your index file
AFAIUI that's the algorithm, I don't really see the problem here?