I don't see any documentation on scaling GraphEngine.
I'm confused as to how this works. I saw that there is a "distributed" mode. I can add the servers and get them to hook up, but I can't get data to be used in both places.
LocalStorage vs CloudStorage is confusing. It doesn't seem you can query on CloudStorage either (for data that might exist in multiple places).
Do you have any examples of this? I'd be very appreciative.
If you mean LINQ interfaces by "query", then currently there's no built-in support for LINQ over cloud storage. However, it should be easy to write a wrapper handler that broadcasts a query to the cloud, and let each server collect data using LINQ, then aggregated together.
Could you elaborate your scenario? It would be interesting to see how a user would like to query over a distributed cluster.
Related
I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.
What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.
I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.
Currently I am looking at Data Fusion, Datastream, Apache Airflow.
Any suggestions?
Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results.
Beam has a Flatten transform that can join PCollections.
What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.
Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).
However, I'd be remiss not to ask: should you?
There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.
Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!
we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.
So the questions has more to do with what services should i be using to have the efficient performance.
Context and goal:
So what i trying to do exactly is use tag manager custom HTML so after each Universal Analytics tag (event or pageview) send to my own EC2 server a HTTP request with a similar payload to what is send to Google Analytics.
What i think, planned and researched so far:
At this moment i have two big options,
Use Kinesis AWS which seems like a great idea but the problem is that it only drops the information in one redshift table and i would like to have at least 4 o 5 so i can differentiate pageviews from events etc ... My solution to this would be to divide from the server side each request to a separated stream.
The other option is to use Spark + Kafka. (Here is a detail explanation)
I know at some point this means im making a parallel Google Analytics with everything that implies. I still need to decide what information (im refering to which parameters as for example the source and medium) i should send, how to format it correctly, and how to process it correctly.
Questions and debate points:
Which options is more efficient and easiest to set up?
Send this information directly from the server of the page/app or send it from the user side making it do requests as i explained before.
Does anyone did something like this in the past? Any personal recommendations?
You'd definitely benefit from Google Analytics custom task feature instead of custom HTML. More on this from Simo Ahava. Also, Google Big Query is quite a popular destination for streaming hit data since it allows many 'on the fly computations such as sessionalization and there are many ready-to-use cases for BQ.
As I have very little knowledge on how ESB's work in tandem with database I'm asking a question regarding how communication can take place between the two hoping I'll atleast be pointed in the right direction to search in!
SITUATION : We have two systems(one of them is the client's) on different networks which have their own databases. We are required to do a regular real-time data exchange of all points present in our database with the other. We are also required to have a provision to be abel to import data into our system. This exchange has to follow SOA functionality over customer provided Biztalk ESB.We are supposed to provide the exchange by the use of ODBC.
Question: My query is whether it is possible to integrate the databases to the ESB as some endpoints without making any use of WEBSERVICES or extra interfaces, and send the data over the ESB as a pull-push transfer mechanism?
I have tried searching the net for this situation but have not come up with a lot of straightforward answers. Could someone please point me in the right direction.
ESB Toolkit in BizTalk is not an ESB! It is just small additional tool for some special cases.
Let's stop talk about the ESB, we need to solve the technical problem, right?
As I can understand you have two SQL databases and want to integrate them.
To do so with BizTalk the easiest way is to use the WCF-SQL ports/adapters.
You start the Wizards for this adapter, choose the tables/sp-s which should provide data/consume data, the Wizard will generate all needed Xml schemas for you.
Then you will use BizTalk Mapper to create the Xslt maps, which will transfer one SQL data format to another.
They you will create a pair of ports. One will consume data from one SQL database, the second will insert data to another SQL database. One of this port will use the mentioned above Xslt map.
If you need more processing, you could create and orchestration to manage additional processing, sophisticated error handling, etc.
I would recommend using MSMQ. There's a fairly detailed description of it here
I have a long list of towns and cities, and I'd like to add latitude and longitude information to each of them.
Does anyone know the easiest way to generate this information once?
See also Geocode multiple addresses
The first part of the third video shows how to get latitude and Longitude using Google Refine and geocoding. No need to write a new script. Ideal for doing this kind of change once.
http://code.google.com/p/google-refine/
Or use www.geonames.org - there's language APIs for that. Or Open Street Map's Nominatim: http://wiki.openstreetmap.org/wiki/Nominatim - google have slightly more restrictive terms of service.
You can use the Google Geocoding API. Check the API at this URL: http://code.google.com/apis/maps/documentation/geocoding/
What follows next is writing some code. I am doing something similar in C# and it is quite easy here.
Most geocoding services can handle queries with only administrative names which is what you're after, e.g., municipality and region. So I'd choose one you like that also handles batch or bulk requests, e.g., the Bing Spatial Data API (here's an article on batch geocoding with it.)
An alternative approach that might be useful if you're on a budget and have a lot of these to do would be to download the Geonames database and write a bit of code to import it into your database or index it; then query it however and how often you like, e.g., if you put your places in another table you could SELECT [...] FROM my_places LEFT JOIN geonames [...]. I used to import Geonames DB into a vanilla PostgreSQL nightly and probably still have the code in a git repo somewhere if that's a route you want to try (comment and I'll find it and attach.)
For a service that uses google, which I find most accurate.
Look at http://www.torchproducts.com/tools/geocode