Is it possible to use GeoSparql with GeoJSON objects on AWS Neptune? - amazon-web-services

The documentation for AWS Neptune does not indicate that it supports the GeoSPARQL extension. Is there any way to support geospatial queries that we have stored as GeoJSON objects? Perhaps using Gremlin? We are considering Azure Cosmos as an alternative but would rsther stay in the AWS ecosphere. Thank you!
Added re: comment for GeoJSON specifics:
As a recent workaround, we have been exploring the use of Geohashing and bounding boxes to solve this in a NoSQL store . Or, the original Azure Cosmos direction. The problem with the former solution is the computational time to do so at scale, and then fetch and combine results from the graph with some type of unique ID mapping. Instead, traversing raw lat/long coordinates via GeoJSON enabled queries in the graph means that we could do everything in one pass. Ultimately it enables dramatically simpler architecture and cleaner DEVOPS.
In regards to specifics, mostly just querying items within a radius or manhattan distance of a given point. We don't currently use polygons but have been considering it for various use cases.

Can you provide more details on the types of geospatial queries that you'd want to support over the GeoJSON objects?

From the latest AWS Neptune User Guide (PDF) in the section on Blazegraph to Neptune compatibility (dated Dec 2020 at time of writing):
Geospatial search – Blazegraph supports the configuration of namespaces that enable geospatial
support. This feature is not yet available in Neptune.
Therefore, the answer appears to be 'no' if using out-the-box functionality.

Related

Does Amazon Web Services have a geo query search service? (with sorting results by closest to point)

Does Amazon Web Services have a solution to store objects in a database with a latitude and longitude and then to do a geo query on them to give you all objects in a specific radius and sort them by what is closest to the center of that radius?
There are geohashing solutions for DynamoDB but those don't allow you to sort results by what is closest to the target location.
Use RDS PostgreSQL with PostGIS, its spatial extension. PostGIS gives you geospatial data types, function, and indexes. With that you can ask all sorts of geospatial questions in a performant way using plain old SQL :)
You can use ST_DWithin to get what you want, for example:
SELECT DISTINCT ON (s.gid) s.gid, s.school_name, s.geom, h.hospital_name
FROM schools s
LEFT JOIN hospitals h ON ST_DWithin(s.geom, h.geom, 3000)
ORDER BY s.gid, ST_Distance(s.geom, h.geom);
This query will find the nearest hospital to each school that is within 3000 units of the school. ST_DWithin leverages spatial indexes to perform the search faster so make sure you index your geom. If the units of the spatial reference is meters then units would be meters.
Take a look at this intro to PostGIS workshop to get an overview of PostGIS and its capabilities. Visit Managing spatial data with the PostGIS extension to see how to work with PostGIS on AWS.
Hope this helps.
I think the term you’re looking for is spatial search or spatial indexing. DynamoDB does not have such a feature, but the SQL-based products in the RDS area support spatial indices, including geofencing just like their “normal” counterparts (MySQL, Postrgres etc.).
If you are looking for a low-touch solution, you can look into the different Aurora options, including Aurora Serverless. For example, Aurora MySQL has support for different spatial types/indices.
There are many offerings from AWS which can offer geo-spatial queries, depending on your specific use-case will determine which option is best for you.
NoSQL
DocumentDB
MemoryDB
Elasticache Redis
Relational
Aurora
RDS (multiple engines)

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.

Selecting the right cloud storage option on GCP

I am an entry level developer in a startup. I am trying to deploy a text classifier on GCP. For storing inputs(training data) and outputs, I am struggling to find the right storage option.
My data isn't huge in terms of columns but is fairly huge in terms of instances. It could even be just key-value pairs. My use case is to retrieve each entity from just one particular column from the DB, apply some classification on it and store the result in the corresponding column and update the DB. Our platform requires a DB which can handle a lot of small queries at once without much delay. Also, the data is completely unrelational.
I looked into GCP's article of choosing a storage option but couldn't narrow down my options to any specific answer. Would love to get some advice on this.
You should take a look at Google's "Choosing a Storage Option" guide: https://cloud.google.com/storage-options/
Your data is structured, your main goal is not analytics, your data isn't relational, you don't mostly need mobile SDKs, so you should probably use Cloud Datastore. That's a great choice for durable key-value data.
In brief, these are the storage options available. May be in future it can be more or less.
Depending on choice, you can select your storage option which is best suited.
SOURCE: Linux Academy

How to do incremental development with DynamoDB?

We're currently investigating a proper key-value store to store feed data. As we're hosting on AWS, their DynamoDB key-value solution looks very tempting. But, its seems that a DynamoDB table structure is unmodifiable after creation, making it really cumbersome to have rolling updates. Also even whilst not live it seems to be very painful to just copy all your indices manually in order to add an index to your table. Maybe I'm missing some kind of automation tool somewhere?
DynamoDB recently announced that online indexing will be coming soon, so you will be able to add indexes when you need them rather than all of it at table creation time. Here is the relevant section from this source on 08 Oct 2014:
Online Indexing (Available Soon) We are planning to give you the
ability to add and remove indexes for existing DynamoDB tables. This
will give you the flexibility to adjust your indexes to match your
evolving query patterns. This feature will be available soon, so stay
tuned!
Update 2015/1/27: Online Indexing is now available
DynamoDB tables have not any structure, only primary keys and indices.
You should create your architecture before any tables.
Think, then do.
Anyway AWS have a flex API, you can write your own scripts or use aws-cli for automation.

Data Warehouse and Django

This is more of an architectural question than a technological one per se.
I am currently building a business website/social network that needs to store large volumes of data and use that data to draw analytics (consumer behavior).
I am using Django and a PostgreSQL database.
Now my question is: I want to expand this architecture to include a data warehouse. The ideal would be: the operational DB would be the current Django PostgreSQL database, and the data warehouse would be something additional, preferably in a multidimensional model.
We are still in a very early phase, we are going to test with 50 users, so something primitive such as a one-column table for starters would be enough.
I would like to know if somebody has experience in this situation, and that could recommend me a framework to create a data warehouse, all while mantaining the operational DB with the Django models for ease of use (if possible).
Thank you in advance!
Here are some cool Open Source tools I used recently:
Kettle - great ETL tool, you can use this to extract the data from your operational database into your warehouse. Supports any database with a JDBC driver and makes it very easy to build e.g. a star schema.
Saiku - nice Web 2.0 frontend built on Pentaho Mondrian (MDX implementation). This allows your users to easily build complex aggregation queries (think Pivot table in Excel), and the Mondrian layer provides caching etc. to make things go fast. Try the demo here.
My answer does not necessarily apply to data warehousing. In your case I see the possibility to implement a NoSQL database solution alongside an OLTP relational storage, which in this case is PostgreSQL.
Why consider NoSQL? In addition to the obvious scalability benefits, NoSQL offer a number of advantages that probably will apply to your scenario. For instance, the flexibility of having records with different sets of fields, and key-based access.
Since you're still in "trial" stage you might find it easier to decide for a NoSQL database solution depending on your hosting provider. For instance AWS have SimpleDB, Google App Engine provide their own DataStore, etc. However there are plenty of other NoSQL solutions you can go for that have nice Python bindings.