Titan on AWS- Prevent dropping of edges and vertex's - amazon-web-services

Is there a way to run a gremlin server so that the drop command is prevented? I never actually drop any edges or vertex's so I'd like the added assurance that it can't be done by mistake

You could have some luck developing your own TraversalStrategy and intercept the behavior of the .drop() step, preventing it to actually delete data. However, people could still be able to bypass the Gremlin/TinkerPop API and directly manipulate the graph instance and remove graph elements (Vertex, Edge and Property).
Depending on your use case, you might just want to disable any mutation to the graph, and not just the removal of elements:
At the Titan level, you can use the storage-read.only option which makes the Titan storage backend read only. See Titan v1.0.0 documentation, Ch. 12 - Configuration reference, 12.3.23. storage.
You can also handle this at the TinkerPop level with the ReadOnlyStrategy Traversal strategy. See TinkerPop v3.0.1 documentation on ReadOnlyStrategy.

Related

Automate sequential integer IDs without using Identity Specification?

Are there any tried/true methods of managing your own sequential integer field w/o using SQL Server's built in Identity Specification? I'm thinking this has to have been done many times over and my google skills are just failing me tonight.
My first thought is to use a separate table to manage the IDs and use a trigger on the target table to manage setting the ID. Concurrency issues are obviously important, but insert performance is not critical in this case.
And here are some gotchas I know I need to look out for:
Need to make sure the same ID isn't doled out more than once when
multiple processes run simultaneously.
Need to make sure any solution to 1) doesn't cause deadlocks
Need to make sure the trigger works properly when multiple records are
inserted in a single statement; not only for one record at a time.
Need to make sure the trigger only sets the ID when it is not already
specified.
The reason for the last bullet point (and the whole reason I want to do this without an Identity Specification field in the first place) is because I want to seed multiple environments at different starting points and I want to be able to copy data between each of them so that the ID for a given record remains the same between environments (and I have to use integers; I cannot use GUIDs).
(Also yes, I could set identity insert on/off to copy data and still use a regular Identity Specification field but then it reseeds it after every insert. I could then use DBCC CHECKIDENT to reseed it back to where it was, but I feel the risk with this solution is too great. It only takes one time for someone to make a mistake and then when we realize it, it would be a real pain to repair the data... probably enough pain that it would have made more sense just to do what I'm doing now in the first place).
SQL Server 2012 introduced the concept of a SEQUENCE database object - something like an "identity" column, but separate from a table.
You can create and use sequence from your code, you can use the values in various place, and more.
See these links for more information:
Sequence numbers (MS Docs)
CREATE SEQUENCE statement (MS Docs)
SQL Server SEQUENCE basics (Red Gate - Joe Celko)

Code changes needed for custom distributed ML Engine Experiment

I completed this tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml file. If using the tf.estimator.Estimator API, are any additional code changes needed to create a custom tier of any size? For example, the article suggests: "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." so this would suggest the config.yaml file below would be possible
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: complex_model_m
workerCount: 10
parameterServerCount: 4
Are any code changes needed to the mnist tutorial to be able to use this custom configuration? Would this distribute the X number of batches across the 10 workers as the tutorial suggests would be possible? I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed? My current understanding is that if using the Estimator API (even within your own defined model) that there should not need to be any additional changes.
Does any of this change if the config.yaml specifies using GPUs? This article suggests for the Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." However, since the config.yaml is specifically identifying the machine type for parameter servers and workers, I am expecting that within ML-Engine the ClusterSpec will be configured properly based on the config.yaml file. However, I am not able to find any ml-engine documentation that confirms no changes are needed to take advantage of GPUs.
Last, within ML-Engine I am wondering if there are any ways to identify usage of different configurations? The line "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." suggests that the use of additional workers would be roughly linear, but I don't have any intuition around how to determine if more parameter servers are needed? What would one be able to check (either within the cloud dashboards or tensorboard) to determine if they have a sufficient number of parameter servers?
are any additional code changes needed to create a custom tier of any size?
No; no changes are needed to the MNIST sample to get it to work with different number or type of worker. To use a tf.estimator.Estimator on CloudML engine, you must have your program invoke learn_runner.run, as exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec. It will automatically do the right thing on Parameter Server nodes and it will use the provided Estimator to start training and evaluation.
Most of the magic happens because tf.estimator.Estimator automatically uses a device setter that distributes ops correctly. That device setter uses the cluster information from the RunConfig object whose constructor, by default, uses TF_CONFIG to do its magic (e.g. here). You can see where the device setter is being used here.
This all means that you can just change your config.yaml by adding/removing workers and/or changing their types and things should generally just work.
For sample code using a custom model_fn, see the census/customestimator example.
That said, please note that as you add workers, you are increasing your effective batch size (this is true regardless of whether or not you are using tf.estimator). That is, if your batch_size was 50 and you were using 10 workers, that means each worker is processing batches of size 50, for an effective batch size of 10*50=500. Then if you increase the number of workers to 20, your effective batch size becomes 20*50=1000. You may find that you may need to decrease your learning rate accordingly (linear seems to generally work well; ref).
I poked around some of the other ML Engine samples and found that
reddit_tft uses distributed training, but they appear to have defined
their own runconfig.cluster_spec within their trainer package:
task.pyeven though they are also using the Estimator API. So, is there
any additional configuration needed?
No additional configuration needed. The reddit_tft sample does instantiate its own RunConfig, however, the constructor of RunConfig grabs any properties not explicitly set during instantiation by using TF_CONFIG. And it does so only as a convenience to figure out how many Parameter Servers and workers there are.
Does any of this change if the config.yaml specifies using GPUs?
You should not need to change anything to use tf.estimator.Estimator with GPUs, other than possibly needing to manually assign ops to the GPU (but that's not specific to CloudML Engine); see this article for more info. I will look into clarifying the documentation.

Inotify-like feature in a distributed file system

As the title goes, I want to trigger a notification when some events happen.
A event above can be user-defined, such as updating specified files in 1-miniute.
If files are stored locally, I can easily make it with the system call inotify, but the case is that files locate on a distributed file system such as mfs..
How to make it? I wonder to know if there are some solutions or open-source project to solve this problem. Thanks.
If you have only black-box access (e.g. NFS protocol) to the remote system(s), you don't have much options unless the protocol supports what you need. So I'll assume you have control over the remote systems.
The "trivial" approach is running a local inotify/fanotify listener on each computer that would forward the notification over the network. FAM can do this over NFS.
A problem with all notification-based system is the risk of lost notifications in various edge cases. This becomes much more acute over a network - e.g. client confirms reciept of notification, then immediately crashes. There are reliable message queues you can build on but IMHO this way lies madness...
A saner approach is stateless hash-based scan.
I like to call the following design "hnotify" but that's not an established term. The ideas are widely used by many version control and backup systems, dating back to Plan 9.
The core idea is if you know cryptographic hashes for files, you can compose a single hash that represents a directory of files - it changes if any of the files changed - and you can build these bottom-up to represent the whole filesystem's state.
(Git stores things this way and is very efficient at it.)
Why are hash trees cool? If you have 2 hash trees — one representing the filesystem state you saw at point in the past, one representing the current state — you can easily find out what changed between them:
You start at the roots. If they are different you read the 2 root directories and compare hashes for subdirectories.
If a subdirectory has same hash in both trees, then nothing under it changed. No point going there.
If a subdirectory's hash changed, compare its contents recursively — call step (1).
If one has a subdirectory the other doesn't, well that's a change. With some global table you can also detect moves/renames.
Note that if few files changed, you only read a small portion of the current state. So the remote system doesn't have to send you the whole tree of hashes, it can be an interactive ping-pong of "give me hashes for this directory; ok now for this...".
(This is akin to how Git's dumb http protocol worked; there is a newer protocol with less round trips.)
This is as robust and bug-proof as polling the whole filesystem for changes — you can't miss anything — but reasonably efficient!
But how does the server track current hashes?
Unfortunately, fully hashing all disk writes is too expensive for most people. You may get if for free if you're lucky to be running a deduplicating filesystem, e.g. ZFS or Btrfs.
Otherwise you're stuck with re-reading all changed files (which is even more expensive than doing it in the filesystem layer) or using fake file hashes: upon any change to a file, invent a new random "hash" to invalidate it (and try to keep the fake hashes on moves). Still compute real hashes up the tree. Now you may have false positives — you "detect a change" when the content is the same — but never false negatives.
Anyway, the point is that whatever stateful hacks you do (e.g. inotify with periodic scans to be sure), you only do them locally on the server. Across the network, you only ever send hashes that represent snapshots of current state (or its subtrees)! This way you can have a distributed system with many servers and clients, intermittent connectivity, and still keep your sanity.
P.S. Btrfs can efficiently find differences from an older snapshot. But this is a snapshot taken on the server (and causing all data to be preserved!), less flexible than a client-side lightweight tree-of-hashes.
P.S. One of your tags is HadoopFS. I'm not really familiar with it, but I suspect a lot of its files are write-once-then-immutable, and it might be able to natively give you some kind of file/chunk ids that can serve as fake hashes?
Existing tools
The first tool that springs to my mind is bup index. bup is a very clever deduplicating backup tool built on git (only scalable to huge data), so it sits on the foundation described above. In theory, indexing data in bup on the server and doing git fetch over the network would even implement the hash-walking comparison of what's new that I described above — unfortunately the git repositories that bup produces are too big for git itself to cope with. Also you probably don't want bup to read and store all your data. But bup index is a separate subsystem that quickly scans a filesystem for potential changes, without yet reading the changed files.
Currently bup doesn't use inotify but it's been discussed in depth.
Oh, and bup uses Bloom Filters which are a nearly optimal way to represent sets with false positives. I'm almost certain Bloom filters have a role to play in optimizion stateless notification protocols ("here is a compressed bitmap of all I have; you should be able to narrow your queries with it" or "here is a compressed bitmap of what I want to be notified about"). Not sure if the way bup uses them is directly useful to you, but this data structure should definitely be in your toolbelt.
Another tool is git annex. It's also based on Git (are you noticing a trend?) but is designed to keep the data itself out of Git repos (so git fetch should just work!) and has a "WORM" option that uses fake hashes for faster performance.
Alternative design: compressed replayable journal
I used to think the above is the only sane stateless approach for clients to check what's changed. But I just read http://arstechnica.com/apple/2007/10/mac-os-x-10-5/7/ about OS X's FSEvents framework, which has a perhaps simpler design:
ALL changes are logged to a file. It's kept forever.
Clients can ask "replay for me everything since event 51348".
The magic trick is the log has coarse granularity ("something in this directory changed, go re-scan it to find out what", repeated changes within 30 seconds are combined) so this journal file is very compact.
At the low level you might resort to similar techniques — e.g. hashes — but the top-level interface is different: instead of snapshots you deal with a timeline of events. It may be an easier fit for some applications.

Creating a Version Node in Titan

I'm new to graph databases and to Titan. I'm embedding Titan in a Clojure app. When the app starts up, it creates a BerkeleyDB backed Titan store.
I want to know/do 3 things:
Is this database new? If so, create a version node with version 0. Run the migration procedures to bring the "schema" to the newest version.
If not, does it have a version node? If not, throw an exception.
If the database was preexisting and has a version node, run migration procedures to bring the "schema" up to date.
How do I do this in Titan? Is there a best practice for this?
EDIT:
OK, on further review, I think using a hard-coded vertexid makes the most sense. There's a TitanTransaction.containsVertex(long vertexid). Are there any drawbacks to this approach? I guess I don't know how vertexids are allocated and what their reserved ranges are, so this smells dangerous. I'm new to graph DBs, but I think in Neo4j creating a reference node from the root node is recommended. But Titan discourages root node usage because it becomes a supernode. IDK...
1- I don't know if there is a way to see if the database is new through Titan. You could check to see if the directory where BerkeleyDB will be stored exists before you start Titan.
2/3- Probably your best bet would be a hardcoded vertex with an indexed property "version". Do a look up within the (nearly empty) index on "version" at the start and base your logic on those results.
An aside, you might be interested in Titanium[0]. We're gearing up for a big release in the next week or so that should make it much more useful[1].
[0] http://titanium.clojurewerkz.org/
[1] http://blog.clojurewerkz.org/blog/2013/04/17/whats-going-on-with-titanium/

Updating a field in all records in elasticsearch

I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).
There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.