I am used to Google Cloud SQL, where you can connect to a database outside of GAE. Is something like this possible for the GAE datastore, using the Python NDB interface ideally ?
Basically, my use-case is I want to run acceptance tests that pre-populate and clean a datastore.
It looks like the current options are a JSON API or protocol buffers -- in beta. If so, it's kind of a pain then I can't use my NDB models to populate the data, but have to reimplement them for the tests, and worry that they haven't been saved to the datastore in the exact same way as if through the application.
Just checking I'm not missing something....
PS. yes I know about remote_api_shell, I don't want a shell though. I guess piping commands into it is one way, but ugghh ...
Cloud Datastore can be accessed via client libraries outside of App Engine. They run on the "v1 API" which just went GA (August 16, 2016) after a few years in Beta.
The Client Libraries are available for Python, Java, Go, Node.js, Ruby, and there is even .NET.
As a note, GQL language variant supported in DB/NDB is a bit different from what the Cloud Datastore service itself supports via the v1 API. The NDB Client Library does some of its own custom parsing that can split certain queries into multiple ones to send to the service, combining the results client-side.
Take a read of our GQL reference docs.
Short answer: they're working on it. Details in
google-cloud-datastore#2 and gcloud-python#40.
Related
I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow
Is it a good choice to use Selenium inside ParDo Fns ? Can we use a single instance of Selenium across multiple instances ?
Is the same applicable Playwright, should I build a custom image ?
You can do anything in a Python DoFn that you can do from Python. Yes, I would definitely use custom containers for complex dependencies like this.
You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method. You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).
I have a query related to the google cloud storage for julia application.
Currently, I am hosting a julia application (docker container) on GCP and would like to allow the app to utilize cloud storage buckets to write and read the data.
I have explored few packages which promise to do this operation.
GoogleCloud.jl
This package in the docs show a clear and concise representation of the implementation. However, adding this package result in incremental compilation warning with many of the packages failing to compile. I have opened an issue on their github page : https://github.com/JuliaCloud/GoogleCloud.jl/issues/41
GCP.jl
The scope is limited, currently the only support is for BigQuery
Python package google
This is quite informative and operational but will take a toll on the code's performance. But do advise if this is the only viable option.
I would like to know are there other methods which can be used to configure a julia app to work with google storage?
Thanks look forward to the suggestions!
GCP.jl is promising plus you may be able to do with gRPC if Julia support gRPC (see below).
Discovery
Google has 2 types of SDK (aka Client Library). API Client Libraries are available for all Google's APIs|services.
Cloud Client Libraries are newer, more language idiosyncratic but only available for Cloud. Google Cloud Storage (GCS) is part of Cloud but, in this case, I think an API Client Library is worth pursuing...
Google's API (!) Client Libraries are auto-generated from a so-called Discovery document. Interestingly, GCP.jl specifically describes using Discovery to generate the BigQuery SDK and mentions that you can use the same mechanism for any other API Client Library (i.e. GCS).
NOTE Explanation of Google Discovery
I'm unfamiliar with Julia but, if you can understand enough of that repo to confirm that it's using the Discovery document to generate APIs and, if you can work out how to reconfigure it for GCS, this approach would provide you with a 100% fidelity SDK for Cloud Storage (and any other Google API|service).
Someone else tried to use the code to generate an SDK for Sheets and had an issue so it may not be perfect.
gRPC
Google publishes for the subset of its services that support gRPC. If you'd prefer to use gRPC, it ought be possible to use the Protobufs in Google's repo to define a gRPC client for Cloud Storage
I locally run my app which uses Datastore.
The app is written in Java and uses Objectify. The code is like the below.
ofy().transact(() -> { ofy().load().type(PersonEntity.class).list(); })
This simple query runs successfully when my app connects to my GCP Project's Datastore.
But, when I use cloud-datastore-emulator, this query is rejected with an error message Only ancestor queries are allowed inside transactions.
This restriction about non-ancestor query seems to be removed on Firestore in Datastore mode. But cloud-datastore-emulator seems still restrict it.
My question is,
cloud-datastore-emulator doesn't support Firestore in Datastore mode?
Is there any way to emulate Firestore in Datastore mode?
gcloud SDK version: 346.0.0
Well, the answer to your question is: It should support it, as the emulator is suppose to support everything that the production environment does. That being said I did went through the documentation after seeing your question and found that here it's stated that:
The Cloud SDK includes a local emulator of the production Datastore mode environment.
But if you were to follow the link, there are hints that this is an emulator to both the legacy Datastore and Firestore in Datastore mode. So this might be why you are seeing this behavior. With that information at hand, it might be a good idea to open a case in Google's Issue Tracker so that they're engineering team can clarify if this is an expected behavior or not and if not, fix this issue.
I am about to embark on a project using Apache Hadoop/Hive which will involve a collection of hive query scripts to produce data feeds for various down stream applications. These scripts seem like ideal candidates for some unit testing - they represent the fulfillment of an API contract between my data store and client applications, and as such, it's trivial to write what the expected results should be for a given set of starting data. My issue is how to run these tests.
If I was working with SQL queries, I could use something like SQLlite or Derby to quickly bring up test databases, load test data and run a collection of query tests against them. Unfortunately, I am unaware of any such tools for Hive. At the moment, my best thought is to have the test framework bring up a hadoop local instance and run Hive against that, but I've never done that before and I'm not sure it will work, or be the right path.
Also, I'm not interested in a pedantic discussion about if what I am doing is unit testing or integration testing - I just need to be able to prove my code works.
Hive has special standalone mode, specifically design for the testing purposes. In this case it can run without hadoop. I think it is exactly what you need.
There is a link to the documentation:
http://wiki.apache.org/hadoop/Hive/HiveServer
I'm working as part of a team to support a big data and analytics platform, and we also have this kind of issue.
We've been searching for a while and we found two pretty promising tools: https://github.com/klarna/HiveRunner https://github.com/bobfreitas/HadoopMiniCluster
HiveRunner is a framework built on top of JUnit to test Hive Queries. It starts a standalone HiveServer with in memory HSQL as the metastore. With it you can stub tables, views, mock samples, etc.
There are some limitations on Hive versions though, but I definitely recommend it
Hope it helps you =)
You may also want to consider the following blog post which describes automating unit testing using a custom utility class and ant: http://dev.bizo.com/2011/04/hive-unit-testing.html
I know this is an old thread, but just in case someone comes across it. I have followed up on the whole minicluster & hive testing, and found that things have changed with MR2 and YARN, but in a good way. I have put together an article and github repo to give some help in it:
http://www.lopakalogic.com/articles/hadoop-articles/hive-testing/
Hope it helps!
My company uses a lot of different web services on daily bases. I find that I repeat same steps over and over again on daily bases.
For example, when I start a new project, I perform the following actions:
Create a new client & project in Liquid Planner.
Create a new client Freshbooks
Create a project in Github or Codebasehq
Developers to Codebasehq or Github who are going to be working on this project
Create tasks in Ticketing system on Codebasehq and tasks in Liquid Planner
This is just when starting new projects. When I have to track tasks, it gets even trickier because I have to monitor tasks in 2 different systems.
So my question is, is there a tool that I can use to create a web service that will automate some of these interactions? Ideally, it would be something that would allow me to graphically work with the web service API and produce an executable that I can run on a server.
I don't want to build it from scratch. I know, I can do it with Python or RoR, but I don't want to get that low level.
I would like to add my sources and pass data around from one service to another. What could I use? Any suggestions?
Progress DataXtend Semantic Integrator lets you build WebServices through an Eclipse based GUI.
It is a commercial product, and I happen to work for the company that makes it. In some respects I think it might be overkill for you, as it's really an enterprise-level data mapping tool for mapping disparate data sources (web services, databases, xml files, COBOL) to a common model, as opposed to a simple web services builder, and it doesn't really support your github bits, anymore than normal Eclipse plugins would.
That said, I do believe there are Mantis plugins for github to do task tracking, and I know there's a git plugin for Eclipse that works really well (jgit).
Couldn't you simply use Selenium to execute some of this tasks for you? Basically as long as you can do something from the browser, Selenium will also be able to do. Selenium comes with a language called "selenese", so you can even use it to programmatically create an "API" with your tasks.
I know this is a different approach to what you're originally looking for, but I've been using selenium for a number of tasks, and found it's even good to execute ANT tasks or unit tests.
Hope this helps you
What about Apache Camel?
Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.
Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options.