I need to know if DashDB supports "spatial objects" and "spatial queries" (i.e. can we store in DashDB points, or areas, or polygons and query those "objects"?). I know that for PostgreSQL, for example, this is supported by installing an add-on called PostGIS. But what about DashDB?
Yes it does. dashDB's spatial data and indexing support comes from DB2, so it's actually very mature although dashDB is a relatively new product. See more here: http://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.doc/learn_how/loaddata_gsdata.html
Search the web for dashdb plus geospatial and you'll find plenty of information.
Related
There are some projects on GitHub which use a key/value store like leveldb and rocksdb to make a NoSQL/SQL store. For example levelgraph is a graph DB written in Nodejs and uses leveldb under the hood. YugaByte DB is a distributed RDBMS on top of rocksdb. These projects specially levelgraph motivated me to make a document store on top of rocksdb/leveldb. Since i'm not familar with the algorithms, data structures and in general theory of DBs, I want to know what's the best approach to make an embeddable document store (I don't want it to be distributed right now).
Questions:
Is there any academic paper or reference on this subject? would you please list the required skills i need to obtain to finish the project?
Levelgraph is written in node.js using levelup, wrapper forabstract-leveldowncompliant stores. leveldown is pure C++ Node.jsleveldbbinding. If i want to program my DB inNodejsusinglevelup`, How much language difference will impact the DB performance?
I'm trying to perform statistical analysis on relatively flat time series data with AWS Elastic MapReduce. AWS gives you the option of using Hive, Pig, or HBase for EMR jobs- which one would be best for this type of analysis? I don't think data analysis is gonna be on the terrabyte scale- items in my tables are mostly under 1K. I've also never used any of the three, but learning curve shouldn't be an issue. I'm more concerned with what is going to be more efficient; I'm also handing this project off soon, so something that is relatively to understand for people with noSQL experience would be nice- but I'm mostly looking to make the sensible choice for the data I have. An example query I might make is something like "Find all accounts between last week and today with an event value over 20 for each day".
IMHO, none of these. You use MR, Hive, Pig, etc when your data is big, really big and you are talking about a dataset which not even of ~TB. And you want your system to be efficient as well. In such a scenario using these tools would be an overkill. So the sensible choice for the data you have would be a RDBMS of your choice.
And if it is just for learning purpose then use HDFS+Hive or Pig(Depending on what suits you better).
In response to your comment :
If I had such a situation like this, I would use HDFS, to store my flat data, with Hive. The reason why I would go with Hive is that I don't see a lot of transformation kind of things going on here. So, yes, I would go with Hive. And, I don't really see any HBase need as of now. HBase is normally used when you need random real-time access to some part of your data. And if your use case really demands HBase you need to be careful while designing your schema since you are dealing with timeseries data.
But the decision on whether to use Hive or Pig needs some deeper analysis of the kind of operations you are going to perform on your data. You might find these links helpful :
http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
P.S. : You might wanna have a look at R project.
A short summary answer:
Hive is an easy "first option" for your data analysis, because it will use familiar SQL syntax. Because of this there are many convenient connectors to front end analysis tools: Excel, Tableau, Pentaho, Datameer, SAS, etc.
Pig is used more for ETL (transformation) of data incoming to Hadoop. Your data analysis may require some "transformation" of the data before it is stored in Hive. For example you may choose to strip out headers, apply information from other sources, etc. A good example of how this works is provided with the free Hortonworks sandbox tutorials.
HBase is more valuable when you're explicitly looking for a NoSQL store on top of hadoop (example).
I'm searching for any NoSQL system (preferably open source) that supports analytic functions (AF for short) like Oracle/SQL Server/Postgres does. I didn't find any with build-in functions. I've read something about Hive but it doesn't have actual feature of AF (windows, first_last values, ntiles, lag, lead and so on) just histograms and ngrams. Also some NoSQL systems (Redis for example) support map/reduce, but I'm not sure if AF can be replaced with it.
I want to make a performance comparison to choose either Postgres or NoSQL system.
So, in short:
Searching for NoSQL systems with AF
Can I rely on map/reduce to replace AF? Is it fast, reliable, easy to go.
ps. I tried to make my question more constructive.
Once you've really understood how MapReduce works, you can do amazing things with a few lines of code.
Here is a nice video course:
http://code.google.com/intl/fr/edu/submissions/mapreduce-minilecture/listing.html
The real difficulty factor will be between functions that you can implement with a single MapReduce and those that will need chained MapReduces. Moreover, some nice MapReduce implementations (like CouchDB) don't allow you to chain MapReduces (easily).
Some function uses knowledge of all existing data when it involves some king of aggregation (avg, median, standard deviation) or some ordering (first, last).
If you want a distributed NOSQL solution that support AF out of the box, the system will need to rely on some centralized indexing and metadata to keep information about the data in all nodes, thus having a master-node and probably a single point of failure.
You have to ask what you expect to accomplish using NoSQL. You want schemaless tables ? Distributed data ? Better raw performance for very simple queries ?
Depending of your needs, I see three main alternatives here:
1 - use a distributed NoSQL with no single point of failure (ie: Cassandra) to store your data and use map/reduce to process the data and produce the results for the desired function (almost any major NoSQL solution support Hadoop). The caveat is that map/reduce queries are not realtime (can take minutes or hours to execute the query) and requires extra-setup and learning.
2 - use a traditional RDBMS that support multiple servers like MySQL Cluster
3 - use a NoSQL with master/slave topology that supports ad-hoc and aggregation queries like Mongo
As for the second question: yes, you can rely on M/R to replace AF. You can do almost anything with M/R.
I'm currently building my first Django-based app, which is proceeding
fairly well, thus far. One of the goals is to allow users to post
information (text, pics, video) and for the app to be able to
automatically detect the location where they posted this information
(i.e., pulling location info from the browser). That data would then ideally
be able to be filtered later, such as viewing the posts that were made within
a specific radius.
I've been reading a bit about GeoDjango and it sounds intriguing,
if perhaps more sophisticated than the requirements of this project.The querying aspects appear promising.
A colleague, though, suggested everything that can be done using
GeoDjango is equally efficient utilizing the Google Maps API with
JavaScript or JQuery to obtain the proper coordinates.
Essentially, I'm looking to see what benefits GeoDjango would offer
this fairly straightforward project over using just the Google Maps API. If I've already
started a project in basic Django; is incorporating GeoDjango
problematic? I'm still attempting to master the basics of Django and
venturing into GeoDjango may be too much for a novice developer. Or not.
Any insight appreciated.
To accurately find geolocated posts within a given radius of a location, you need to calculate distances between geographic locations. The calculations are not trivial. To quote the Django docs (with a minor grammatical correction):
Distance calculations with spatial data are tricky because,
unfortunately, the Earth is not flat.
Fortunately using GeoDjango hides this complexity. Distance queries are as simple as the following:
qs = SouthTexasCity.objects.filter(point__distance_lte=(pnt, 7000))
Although one could program this logic using JavaScript/JQuery, I don't see a reason to because you are already using Django. Unless if you are:
unable to use a spatial database. GeoDjango distance queries are only available if you use PostGIS, Oracle, or SpatiaLite as your database. (i.e. MySQL does not support distance queries)
unable to install the geospatial libraries that GeoDjango depends on.
I am developing a gis application using Google maps API. Currently I am using Postgis db.
I am considering switching to mongodb and I have following questions about this,
Is mongodb a viable choice for storing GIS data. ( Is any other NoSQL engine viable?)
Does django-nonrel has modified django.contrib.gis module for mongodb support? and how well does it work?
Thanks in advance :)
As stated by Sergio Tulentsev, MongoDB has spatial indexes. But, not all geometry types can be stored. As far as I know, currently only points can be stored. You can, however, query using a polygon.
Since MongoDB is very flexible, you could store geometry as a text, but also as a JSON object. For example, you could store a coordinate like: { lon: 52.1234, lat: 14.1245 }. You can interpret this is on your own application as you like.
The downside to this is that there is no native index. If you never have to query based on location/spatial relation, you are fine. If you do have query based on location/spatial relation, you would have to write your own index. Can be a hard thing do...
There is CouchDB which has a GeoSpatial addition. Don't know how it works/what it supports though.
A hybrid approach is also often taken by some large-data-based companies. For example, use MongoDB to store/query normal data, and PostGIS to store/query geospatial data.
Django MongoDB Engine developer here.
We haven't got any sort of real support for MongoDB's GIS features. It's possible to do geospatial queries by "working around Django" though (see http://django-mongodb.org/topics/lowerlevel.html - the first example is actually about Geo indexes).
It should be possible, however, to implement native Django GIS support into the MongoDB backend. We're really happy to mentor anyone who wants to approach that task!
I've been working with MongoDb storing points of interest in the DB, then doing point in poly queries, along with radius queries and Mongo has been FAR superior to PostGres. I've had much faster response times with hundreds of thousands of items.
I would say however depending on the complexity of your queries (ie if you're doing something beyond point in poly, radius around a point), you might want a more heavy GIS db like postgres.. you're just going to get the extra weight of postgres as well.
Sorry I can't speak to anything on django as well.