Azure Cosmos DB - Gremlin latitude longitude format conversion issues - geocoding

I am trying to convert airport GeoCoordinate data i.e. [IATA Code, latitude, longitude] to Gremlin Vertex in an Azure Cosmos DB Graph API project.
Vertex conversion is mainly done through an Asp.Net Core 2.0 console application using CSVReader to stream and convert data from a airport.dat (csv) file.
This process involves converting over 6,000 lines...
So for example, in original airport.dat source file, the Montreal Pierre Elliott Trudeau International Airport would be listed using a similar model as below:
1,"Montreal / Pierre Elliott Trudeau International Airport","Montreal","Canada","YUL","CYUL",45.4706001282,-73.7407989502,118,-5,"A","America/Toronto","airport","OurAirports"
Then if I define a Gremlin Vertex creation query in my cod as followed:
var gremlinQuery = $"g.addV('airport').property('id', \"{code}\").property('latitude', {lat}).property('longitude', {lng})";
then when the console application is launched, the Vertex conversion process would be generated successfully in exact similar fashion:
1 g.addV('airport').property('id', "YUL").property('latitude', 45.4706001282).property('longitude', -73.7407989502)
Note that in the case of Montreal Airport (which is located in N.A not in the Far East...), the longitude is properly formatted with minus (-) prefix, though this seems to be lost underway when doing a query on Azure Portal.
{
"id": "YUL",
"label": "airport",
"type": "vertex",
"properties": {
"latitude": [
{
"id": "13a30a4f-42cc-4413-b201-11efe7fa4dbb",
"value": 45.4706001282
}
],
"longitude": [
{
"id": "74554911-07e5-4766-935a-571eedc21ca3",
"value": 73.7407989502 <---- //Should be displayed as -73.7407989502
}
]
}
This is a bit awkward. If anyone has encountered a similar issue and was able to fix it, then I'm fully open to suggestion.
Thanks

According to your description, I just executed Gremlin query on my side and I could retrieve the inserted Vertex as follows:
Then, I just queried on Azure Portal and retrieved the record as follows:
Per my understanding, you need to check the execution of your code and verify the response of your query to narrow down this issue.

Thank you for your suggestion, though problem has now been solved in my case.
What was previously suggested as a working answer scenario [and voted 1...] has long been settled in case of .Net 4.5.2 [& .Net 4.6.1] version used in combination with Microsoft.Azure.Graph 0.2.4 -preview. The issue of my question didn't really concern that and may have been a bit more subtle... Perhaps I should have put a bit more emphasis on the fact that the issue was mainly related to Microsoft.Azure.Graph 0.3.1 -preview used in Core 2.0 + dotnet CLI scenario.
According to following Graph - Multiple issues with parsing of numeric constants in the graph gremlin query #438 comments on Github,
https://github.com/Azure/azure-documentdb-dotnet/issues/438
there are indeed some fair reasons to believe that the issue was a bug with Microsoft.Azure.Graph 0.3.1 -preview. I chose to use Gremlin.Net approach instead and managed to get the proper result I expected.

Related

Google Document AI training fails due to an error that is already addressed

I am training a model using Google's Document AI. The training fails with the following error (I have included only a part of the JSON file for simplicity but the error is identical for all documents in my dataset):
"trainingDatasetValidation": {
"documentErrors": [
{
"code": 3,
"message": "Invalid document.",
"details": [
{
"#type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "INVALID_DOCUMENT",
"domain": "documentai.googleapis.com",
"metadata": {
"num_fields": "0",
"num_fields_needed": "1",
"document": "5e88c5e4cc05ddb8.json",
"annotation_name": "INCOME_ADJUSTMENTS",
"field_name": "entities.text_anchor.text_segments"
}
}
]
}
What I understand from this error is that the model expects the field INCOME_ADJUSTMENTS to appear (at least) once in the document but instead, it finds zero instances of it.
That would have been understandable except I have already defined the field INCOME_ADJUSTMENTS in my schema as "Optional Once", i.e., this field can appear either zero or one time.
Am I missing something? Why does this error persist despite the fact that it is addressed in the schema?
p.s. I have also tried "Optional multiple" (and "Required once" and "Required multiple") and the error persists.
EDIT: As requested, here's what one of the JSON files looks like. Note that there is no PII here as the details (name, SSN, etc.) are synthetic data.
I have/had the same issue as you in the past and also having it right now.
What I managed to do was to get the document string from the error message and then searching for the images in the Storage bucket that has the dataset.
Then I opened the image and searched for that image in my 1000+ images dataset.
Then I deleted the bounding box for the label with the issue and then relabeled it. This seemed to solve 90%of the issues I had.
It`s a ton of manual work and I wish google thought of more when they released the Web app for Doc AI because the ML part is great but the app is really lackluster.
I would also be very happy for any other fixes
EDIT: another quicker workaround I have found is deleting the latest revision of the labeled documents from the Dataset in cloud storage. Like, take faulty document name from the operation json dump, then search for it in documents/ and then just delete latest revision.
Will probably mess up labeling and make you lose work, but it`s a quick fix to at least make some progress if you want.
Removed a few empty boxes and a lot of intersecting boxes fixed it for me.
i had the same problem.
so i deleted all my dataset and imported and re-labeled again.
then the training worked fine.

google speech recognition weak date transcription

I am using currently google asr/tts with german speech model (de-DE) and experience wrong results in certain usage scenarios of date extractions.
I am really keen on to know whether others might have similar results.
Let me give you some examples:
I am saying:
"der 1.10.1905" -> "1. 10195 11095"
"9.5.78" -> 90587
"22.11.98" -> 22.11 89
BUT:
"22. November 98" -> "22 November 98"
When I fully qualify the month word it works all fine.
I also checked the proposed way to optimize by using hints, without any improvement
"speechContexts": [
{
"phrases": [
"$FULLDATE"
]
}
]
Is this something one has to accept or is there anything that I can try on top?
Cheers Andre
UPDATE:
According to https://issuetracker.google.com/issues/186559805 the problem should be fixed.
But
I could not verify it, maybe somebody else can ??!
I tried it with the following configuration, but it did not improve at all. Maybe overlooked something ?!
Here is my configuration. I performed the request with the beta client:
const {SpeechClient} = require('#google-cloud/speech').v1p1beta1;
const googleRequest = {
config: {
encoding: "MULAW",
sampleRateHertz: 8000,
model: "command_and_search",
languageCode: "de-DE",
speechContexts: [
{
"phrases": [
"$OOV_CLASS_FULLDATE"
]
}
]
},
singleUtterance: true,
interimResults: true
};
I also raised a new ticket to google developers.
https://issuetracker.google.com/issues/243294056
I have the same results as you have when using your example. But it does transcribe correctly if the speaker says ein tausend neun hundert acht und neunzig instead.
It seems that, for some reason, in German, the API doesn't recognize when the speaker uses hundreds to express thousands (I.E. nineteen hundred ninety eight instead of one thousand nine hundred ninety eight).
I don't know German and can't confirm if this way of speaking numbers/dates is official but if you are sure that in German, it is officially accepted, then I suggest that you create an issue on issue-tracker with some references to confirm it.

How do I find out where the response from dialogflow comes from?

I'm not a developer, so this is a little above my head.
My team has implemented a project in dialogflow, one for an old app and one from a new app. I have basic access to the old dialogflow account and I can see that it has an intent called glossaries, same intent name as in the new one. In glossaries, there is a training phrase called "What is a red talk?". This phrase only works in one of my apps and I need to know why.
There is no default response or anything under context. If I copy that curl link into a terminal, the payload doesn't return with any information.
I found the API for the new app and red talks is definitely not in the payload when I do a GET/all. There may be an old API somewhere, but no one knows where.
Where can I find this information? I'm very confused and all the basic training for dialogflow points to default response, which we're not using. I have read through the docs. I have searched the three company github repos that have the application in the name but I have not found anything. I am looking for an app.intent phrase with glossaries in it or just the word glossaries.
I have found only this json and a glossaryTest.php that doesn't seem helpful:
"meta": {
"total": 2,
"page": 1,
"limit": 10,
"sort": "createdAt",
"direction": "desc",
"load-more": false
},
"results": [
{
"term": "This is a term",
"definition": "This is a definition",
"links": [
{
"id": "1",
"url": "http:\/\/example.com\/1",
"title": "KWU Course: Lead Generation 36:12:3",
"ordering": "1"
},
{
"id": "2",
"url": "http:\/\/example.com\/2",
"title": "",
"ordering": "2"
}
]
}
]
}
There is also a json with a lot data for API calls but no glossaries there either.
If we're using fulfillment to handle these intents, I don't see a fullfillment header like google docs say there should be. I may not have full access so perhaps I would be viewing more information in the screen if I had that, I have no idea. The devs who created this are long gone. The devs who also created the new app are also long gone.
Am I missing an API in my environment documentation? Is the intent hard coded? I suspect it was. How do I prove that or move forward?
Yes, your intent are somehow hard-coded [0], or defined through the UI.
Each intent has a setting to enable fulfillment. If an intent requires
some action by your system or a dynamic response, you should enable
fulfillment for the intent. If an intent without fulfillment enabled
is matched, Dialogflow uses the static response you defined for the
intent. [2]
Perhaps you are using a custom integration [1]. So, unless you are using static response (those you see in the UI), the frontend code may be managed by your project API (not Dialogflow API), and perhaps the content modified before performing any further or eventually returning the response.
As I understand you should contact your colleagues for understanding about the integration solution they have created. Or otherwise if the Intent has been created through the API, look for its relative files where there may be They may have created the integration through the SDK, while picking up training data from a source out of the codebase. So perhaps you cannot see it directly in the code. Nonetheless, you should be able to access it through the UI once it has been created.
In case my answer was not of your help, please do not hesitate to further clarify your needs, perhaps providing some further information.
[0] https://cloud.google.com/dialogflow/docs/manage-intents#create_intent
[1] https://cloud.google.com/dialogflow/docs/integrations
[2] https://cloud.google.com/dialogflow/docs/fulfillment-overview

Google ML Engine: Submit a training job via REST API

I'm trying to start a training job via a REST API request using the Census example project from Googles github. I'm able to submit a job, but it always fails as I'm unable to state where the training and evaluation (testing) files are kept, and the documentation is really lacking on this - it just states args[]. When I check the logs in Google ML, the following errors appear:
task.py: error: the following arguments are required: --train-files, --eval-files
The replica master 0 exited with a non-zero status of 2.
This is my formulated REST request:
{
"jobId": "training_12",
"trainingInput": {
"scaleTier": "BASIC",
"packageUris": ["gs://MY_BUCKET/census.tar.gz"],
"pythonModule": "trainer.task",
"args": ["--train_files gs://MY_BUCKET/adult.data.csv", "--eval_files gs://MY_BUCKET/adult.test.csv"],
"region": "europe-west1",
"jobDir": "gs://MY_BUCKET/",
"runtimeVersion": "1.4",
"pythonVersion": "3.5"
}
}
Under the args I've tried many different ways of stating where the train and eval files are, but I have been unable to get it to work. Just for clarification, I have to use the REST API for this use case - not the CLI.
Thanks
-- Update --
I've tried to have the args as --train-files and --eval-files, this still does not work.
-- Update 2 --
I've been able to solve this problem by formulating the args as:
"args": [
"--train-files",
"gs://MY_BUCKET/adult.data.csv",
"--eval-files",
"gs://MY_BUCKET/adult.test.csv",
"--train-steps",
"100",
"--eval-steps",
"10"],
Now, I'm getting a new error and the logs don't seem to give any more information: "The replica master 0 exited with a non-zero status of 1."
The logs have actually done some training, and I suspect this is something related to the saving of the job, but I'm unsure.
I see that you already found the solution to your issue with args when submitting a Training Job in Google Cloud ML Engine. However, let me share with you some documentation pages where you will find all the required information regarding this topic.
In this first page about formatting configuration parameters (under the Python tab), you can see that the args field is populated like:
'args': ['--arg1', 'value1', '--arg2', 'value2'],
Therefore, the correct approach to define args is writing them as key-value pairs as independent strings.
Additionally, this other page containing general information about Training jobs, explains that the training service accepts arguments as a list of strings with the format:
['--my_first_arg', 'first_arg_value', '--my_second_arg', 'second_arg_value']
That is the reason why the last formatting you shared (below) is the correct one:
"args": [
"--train-files",
"gs://BUCKET/FILE",
"--eval-files",
"gs://BUCKET/FILE_2",
"--train-steps",
"100",
"--eval-steps",
"10"]

REST API design: different granularity for receiving and updating resources

I'm in the process of creating a REST API. Among others, there is a resource-type called company which has quite a lot of attributes/fields.
The two common use cases when dealing with company resources are:
Load a whole company and all its attributes with one single request
Update a (relatively small) set of attributes of a company, but never all attributes at the same time
I came up with two different approaches regarding the design of the API and need to pick one of them (maybe there are even better approaches, so feel free to comment):
1. Using subresources for fine-grained updates
Since the attributes of a company can be grouped into categories (e.g. street, city and state represent an address... phone, mail and fax represent contact information and so on...), one approach could be to use the following routes:
/company/id: can be used to fetch a whole company using GET
/company/id/address: can be used to update address information (street, city...) using PUT
/company/id/contact: can be used to update contact information (phone, mail...) using PUT
And so on.
But: Using GET on subresources like /company/id/address would never happen. Likewise, updating /company/id itself would also never happen (see use cases above). I'm not sure if this approach follows the idea of REST since I'm loading and manipulating the same data using different URLs.
2. Using HTTP PATCH for fine-grained updates
In this approach, there are no extra routes for partial updates. Instead, there is only one endpoint:
/company/id: can be used to fetch a whole company using GET and, at the same time, to update a subset of the resource (address, contact info etc.) using PATCH.
From a technical point of view, I'm quite sure that both approaches would work fine. However, I don't want to use REST in a way that it isn't supposed to be used. Which approach do you prefer?
Do you really nead each and every field contained in the GET response all the time? If not, than its more than just fine to create own resources for addresses and contacts. Maybe you will later find a further use-case where you might reuse these resources.
Moreover, you can embed other resources as well in resources. JSON HAL (hal-json) f.e. explicitely provides an _embedded property where you can embed the current state of f.e. sub-resources. A simplified HAL-like JSON representation of an imaginary company resource with embedded resources could look like this:
{
"name":"Test Company",
"businessType":"PLC",
"foundingYear": 2010,
"founders": [
{
"name": "Tim Test",
"_links": {
"self": {
"href": "http://example.org/persons/1234"
}
}
}
],
...
"_embedded": {
"address": {
"street": "Main Street 1",
"city": "Big Town",
"zipCode": "12345",
"country": "Neverland"
"_links": {
"self": {
"href": "http://example.org/companies/someCompanyId/address/1"
},
"googleMaps": {
"href": "http://maps.google.com/?ll=39.774769,-74.86084"
}
}
},
"contacts": {
"CEO": {
"name": "Maria Sample",
...
"_links": {
"self": {
"href": "http://example.org/persons/1235"
}
}
},
...
}
}
}
Updating embedded resources therefore is straigtforward by sending a PUT request to the enclosed URI of the particluar resource. As GET requests my be cached, you might need to provide finer grained caching settings (f.e. with conditional GET requests a.k.a If-Modified-Since or ETAG header fields) to retrieve the actual state after an update. These headers should consider the whole resource (incl. embedded once) in order to return the updated state.
Concerning PUT vs. PATCH for "partial updates":
While the semantics of PUT are rather clear, PATCH is often confused with a partial update by just sending the new state for some properties to the service. This article however describes what PATCH really should do.
In short, for a PATCH request a client is responsible for comparing the current state of a resource and calculating the necessary steps to transform the current resource to the desired state. After calculating the steps, the request will have to contain instructions the server has to understand to execute these instructions and therefore produces the updated version. A PATCH request is furthermore atomic - either all instructions succeed or none. This adds some transaction requirements to this request.
In this particular case I'd use PATCH instead of subresources approach. First of all this isn't a real subresources. It's just a fake abstraction introduced to eliminate the problem of updating the whole big entity (resource). Whereas PATCH is a REST compatible, well established and common approach.
And (IMO ultima ratio), imagine that you need to extend company somehow (by adding magazine, venue, CTO, whatever). Will you be adding a new endpoint to enable client to update this newly-added part of a resource? How it finishes? With multiple endpoint that no one understands. With PATCH your API is ready for new elements of a company.