Can GraphEngine support RDF ? - graphengine

Does GraphEngine support RDF and SPARQL, as described in the paper:A Distributed Graph Engine for Web Scale RDF Data : https://www.graphengine.io/downloads/papers/Trinity.RDF.pdf
If not, could it be implemented on top of the engine, or is it in the roadmap?

Please take a look at our sample code for hosting Freebase:
https://github.com/Microsoft/GraphEngine/tree/master/samples/freebase-likq
The implementation is in src/LIKQ, and samples/freebase-likq provides an example of integrating an index service, multi-typed entity adapters with the LIKQ module.
The freebase dataset is imported as a Trinity image via samples/GraphEngine.DataImporter (currently in the experimental branch). It scans the data twice, first round to decide the data types for the entities and generates TSL storage layout schema, and the second round for the actual import work.

Related

Use map data offline with osmdroid

My ultimate goal is to have map data (offline, because I will customize it myself) and display it in an app (Android). I could make osmdroid work to load maps online and I was trying to figure out how to download and display offline maps. I downloaded MOBAC (Mobile Atlas Creator) and export the data to SQLite format and when I had a look at it I realized that tiles are saved in image format (PNG).
What I would like to do is to import data to the phone to later use it in algorithms such as a search engine or a routing algorithm, so I need the "nodes" and "ways" (as I get them from the original OSM XML), import them to the phone and visualize it to later have this data available for the algorithms I want to develop. Basically, what MAPS.ME does. I think it wouldn't be difficult to convert the XML into the SQLite since a simple script could make it, but then, how can I generate the tiles from this custom SQLite database? Or, is there a way I can download the data in a more appropriate way to do what I'm planning to do?
Thanks.
Rendering the tiles in an app from raw Openstreetmap data would be computation heavy and inefficient. I would suggest to use image tiles you exported for visual representation.
In addition to tiles you should export a data set you will need in the application for desired functionality. You will not need all data from Openstreetmap so you should identify what you need and build your custom export (there are tools and libraries for processing and filtering of Openstreetmap data. I have used pyosmium for some filtering and processing but there are others.) For example, you can build your custom database with POIs you want to search for.
Routing is another chapter. You can implement it yourself but it is a very complex task. There is java library called Graphopper which can do the data extraction (from Openstreetmap) and offline routing for you. They have an online API too but it is possible to make it working completely offline (I did it for one application). Try to look at the source code because than you can see how complex topic routing is. Final note: data exported from Graphopper contains information about some POIs along routes. It may be possible to search for some things via its java API but I haven't investigated this yet.

Text document classification with AWS Machine Learning

I've been looking at using AWS Machine Learning to implement a categorizer for my project. I have something on the order of 40,000 documents that have a several text-only features. For example: Name (< 200 chars) and Description (potentially hundreds / thousands of words).
In a nutshell, I'm looking to assign categories (0 or more) to each document based on it's content.
I've read through the AWS ML tutorial and checked out a few other sources but the available material seems to deal with feature fields that are numeric, boolean, datetime, or otherwise non-textual.
Is AWS Machine Learning capable of performing multi-class categorization on documents based primarily (or possibly only) on text fields? And if so, is there any reference material available for this particular avenue?
Mainly, you don`t need "text fields", first you have to create a vector space model (VTM) from your corpus (texts), than you can weight your VTM with tf-idf, and you can use numeric field.
Are you sure that do you want to apply AWS ML to train a corpus with only 40.000 documents?

What's the difference between core data, essential data and sample data in hybris?

In the hybris wiki trails, there is mention of core data vs. essential data vs. sample data. What is the difference between these three types of data?
Ordinarily, I would assume that sample data is illustrative gobbledygook data created to populate the example apparel and electronics storefronts. However, the wiki trails suggest that core data is for non-store specific data and the sample data is for store specific data.
On the same page, the wiki states that core data contains cockpit and catalog definitions, email templates, CMS layout, and site definitions (countries and user groups impex are included in this as well). This seems rather store specific to me. Does anyone have an explanation for this?
Yes, I have an explanation. Actually a lot of this is down to arbitrary decisions I made on separating data between acceleratorcore and acceleratorsampledata extensions as part of the Accelerator in 4.5 (later these had y- prefix added).
Essential and Project Data are two sets of data that are used within hybris' init/update process. These steps are controlled for each extension via particular Annotations on classes and methods.
Core vs Sample data is more about if I thought the impex file, or lines, were specific to the sample store or were more general. You will notice your CoreSystemSetup has both essential and projectdata steps.
Lots of work has happened in various continents since then, so, like much of hybris now, its a bit of a mess.
There are a few fun bugs related to hybris making certain things part of essentialdata. But these are in the platform not something I can fix without complaining to various people etc.
To confuse matters further, there is the yacceleratorinitialdata extension. This extension was a way I hoped to make projects easier, by giving some impex skeletons for new sites and stores. This would be generated for you during modulegen. It has rotted heavily though since release, now very out of date.
For a better explanation, take a look at this answer from answers.sap.com.
Hybris imports two types of data on initialization and update processes; first is essentialdata and other one is projectdata.
Essentialdata is the coredata setup which is mandatory and will import when you run initialization or update.
sampledata is your projectdata and it is not mandatory it will import when you select project while updating the system.

Are there tools for generating the clients model from an odata metadata?

Back in the good old days of flex (anyone?) flash builder provided a tool for generating the clients model based on the server model. Is there something similar for generating, say an ember's app model, based on the odata metadata?
datajs documentation does mention the subject. Though the reference for OData.read used in the sample doesn't say explicitly that it interprets the metadata somehow, it seems implied. You'll have to verify that.
It does take an optional metadata object however, suggesting there exists a formal representation for metadata to the library -- I would imagine generated via OData.read. Documentation seems non-existent. I don't know what that looks like.
From there, you should be able to further transform the model to something suitable for Ember.
(datajs is a low-level javascript library that implements client-side OData operations.)
I also know that JayStack provides a JaySvcUtil, a CLI process assembly (.NET program) that extracts metadata. The destination format is JavaScript code, though the model it uses is specific to JayData. Still, you may be able to work from there.
As mentioned by Maya, Microsoft provides the OData Client Code Generator, which generates .NET proxies. Might be more difficult to transform that.
If none of these work for you (which is actually likely), you can always parse the $metadata resource yourself -- I believe it always uses an XML representation in all current versions of OData.
If you need to do it dynamically in the browser, use DOMParser or XMLHttpRequest. More information.
If you can do it statically, then by all means do so -- it's simply best for performance. In this case, you can use whatever language and runtime tools you want to fetch, parse, transform and re-serialize the model.
The format (CSDL) is specified for OData here (v4) and here (v3).
Finally, check out this list, something new may appear that better fits your needs.
There are two suggestion which may help you.
1, OData provide client code generator to generate client-side proxy class. Just need to pass metadata url, .net client code will be generate for you. You can follow the following blog:
http://blogs.msdn.com/b/odatateam/archive/2014/03/11/how-to-use-odata-client-code-generator-to-generate-client-side-proxy-class.aspx
2, If the model means "EdmModel", you can just de-serialize $metadata. OData reader can de-serialize the $metadata to IEdmModel, which can be used in client side. The following is sample code:
HttpWebRequestMessage message = new HttpWebRequestMessage(new Uri(ServiceBaseUri.AbsoluteUri + "$metadata", UriKind.Absolute));
message.SetHeader("Accept", MimeTypes.ApplicationXml);
using (var messageReader = new ODataMessageReader(message.GetResponse()))
{
Model = messageReader.ReadMetadataDocument();
}

How to parse freebase quad dump using Amazon mapreduce

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.
I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.
What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?
You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:
time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545
real 18m56.968s
user 19m30.101s
sys 0m56.804s
extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.
You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.
Tom
First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.
However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.
I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.
Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))
My 2 cents.