I want to manipulate the file in MarkLogic using Python - python-2.7

declareUpdate();
//get Docs
myDoc = cts.doc("/heal/scripts/Test.json").toObject();
//add Data
myDoc.prescribedPlayer =
[
{
"default": "http://www.youtube.com/watch?vu003dhYB0mn5zh2c"
}
]
//persist
xdmp.documentInsert("/heal/scripts/Test.json",myDoc,null,"scripts")

You're looking to add a new JSON property. You can do that using a REST Client API request, sending a PATCH command. Use an insert instruction in the patch.
See the note in Specifying Position in JSON, which indicates that
You cannot use last-child to insert a property as an immediate child of the root node of a document. Use before or after instead. For details, see Limitations of JSON Path Expressions.
Instead, your patch will look something like:
{
"insert": {
"context": "/topProperty",
"position": "after",
"content":
[
{
"default": "http://www.youtube.com/watch?vu003dhYB0mn5zh2c"
}
],
}
}
where topProperty is a JSON property that is part of the root node of the JavaScript object you want to update.
If that approach is problematic (for instance, if there is no topProperty that's reliably available), you could also do a sequence of operations:
retrieve the document
edit the content in Python
update the document in the database
With this approach, there is the possibility that some other process may update the document while you're working on it. You can either rely on optimistic locking or a multi-statement transaction to work around that, depending on the potential consequences of someone else doing a write.

Hey #Ankur Please check below python method,
def PartialUpdateData(self,filename, content, context):
self.querystring = {"uri": "/" + self.collection + "/" + filename}
url = self.baseUri
self.header = {'Content-Type': "application/json"}
mydata = {
"patch":[{ "insert": {
"context": context,
"position": "before",
"content": content
}}]}
resp = requests.patch(url + "/documents", data=json.dumps(mydata),
headers=self.header, auth=self.auth, params=self.querystring)
return resp.content
I hope this can solve your problem.

Related

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.
There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.
This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.
I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.
Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html
Current Setup:
I have added a new field to the index called sf_answer_preview. I then mapped this field in the data source to the rich text field in the Salesforce org.
If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.
I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.
For my lambda, I create the following:
exports.handler = async (event) => {
// Debug
console.log(JSON.stringify(event))
// Vars
const s3Bucket = event.s3Bucket;
const s3ObjectKey = event.s3ObjectKey;
const meta = event.metadata;
// Answer
const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');
// Remove HTML Tags
const removeTags = (str) => {
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace( /(<([^>]+)>)/ig, '');
}
// Truncate
const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
let result = truncate(removeTags(answer.value.stringValue));
// Response
const response = {
"version" : "v0",
"s3ObjectKey": s3ObjectKey,
"metadataUpdates": [
{"name":"sf_answer_preview", "value":{"stringValue":result}}
]
}
// Debug
console.log(response)
// Response
return response
};
Based on the contract for the lambda described here, it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.
For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.
When I log the data in the lambda, the pre-extraction event details are as follows:
{
"s3Bucket": "kendrasfdev",
"s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
"metadata": {
"attributes": [
{
"name": "_document_title",
"value": {
"stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
}
},
{
"name": "sf_answer_preview",
"value": {
"stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available online. This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the Exploratory program description. "
}
},
{
"name": "_data_source_sync_job_execution_id",
"value": {
"stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
}
},
]
}
}
The Problem:
When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.
Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.
Any thoughts?
since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

How the users can access my Elasticsearch database in my Django SaaS?

Let's say that I have a SaaS based on Django backend that processes the data of the users and write everything to the Elasticsearch. Now I would like to give users access to search and request their data stored in ES using all possible search requests available in ES. Obviously the user should have only access to his data, not to other user's data. I am aware that it can be done in a lot of different ways but I wonder what is safe and the best solution? At this point I store everything in one index and type in the way shown below but I can do this in any way.
"_index": "example_index",
"_type": "example_type",
"_id": "H2s-lGsdshEzmewdKtL",
"_score": 1,
"_source": {
"user_id": 1,
"field1": "example1",
"field2": "example2",
"field3": "example3"
}
I think that the best way would be to associate every document with the user_id. The user would send for example GET request with body and authorization header with Token. I would use Token to extract id of the user for example in this way
key = request.META.get('HTTP_AUTHORIZATION').split()[1]
user_id = Token.objects.get(key=key).user_id
After this I would redirect his request to ES and only data that meet requirements and belongs to this user would be returned. Of course I could do this like shown above where I also add field user_id. For example I could use post_filter in this way:
To every request I would add something like this:
,
"post_filter": {
"match": {
"user_id": 1
}
}
For example the user sends GET with body
{
"query": {
"regexp": {
"tag": ".*example.*"
}
}
}
and I change this in my backend and redirect request to ES with body:
{
"query": {
"regexp": {
"tag": ".*example.*"
}
},
"post_filter": {
"match": {
"user_id": 1
}
}
}
but it doesn't seem to me that including this field in _source is a good idea. I am almost sure that it can be solved in a more optimal way than post_filtering. I see a lot of information about authorization in ES however I can’t find how can I associate document with user_id and then search only his documents without post_filtering. Any ideas?
UPDATE
My current solution looks in they way shown below however as I mentioned I believe that it is not optimal way. If anyone has an idea how can I solve this in the way described above I will be grateful for help.
I send for example
{
"query": {
"regexp": {
"tag": ".*test.*"
}
}
}
In Django backend I just do
key = request.META.get('HTTP_AUTHORIZATION').split()[1]
user_id = Token.objects.get(key=key).user_id
body = json.loads(request.body)
body['post_filter'] = {"match": {"user_id": user_id}}
res = es.search(index="pictures", doc_type="picture", body=body)
output = []
for hit in res['hits']['hits']:
output.append(hit["_source"])
return Response(
{'output': output},
status=status.HTTP_200_OK)
In elasticsearch 7.1, you have now basic security in the free version of elasticsearch. Thanks to that, you can control per indice thé Access of your user.

Google Cloud Vision Api only return "name"

I am trying to use Google Cloud Vision API.
I am using the REST API in this link.
POST https://vision.googleapis.com/v1/files:asyncBatchAnnotate
My request is
{
"requests": [
{
"inputConfig": {
"gcsSource": {
"uri": "gs://redaction-vision/pdf_page1_employment_request.pdf"
},
"mimeType": "application/pdf"
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
],
"outputConfig": {
"gcsDestination": {
"uri": "gs://redaction-vision"
}
}
}
]
}
But the response is always only "name" like below:
{
"name": "operations/a7e4e40d1e1ac4c5"
}
My "gs" location is valid.
When I write the wrong path in "gcsSource", 404 not found error is coming.
Who knows why my response is weird?
This is expected, it will not send you the output as a HTTP response. To see what the API did, you need to go to your destination bucket and check for a file named "xxxxxxxxoutput-1-to-1.json", also, you need to specify the name of the object in your gcsDestination section, for example: gs://redaction-vision/test.
Since asyncBatchAnnotate is an asynchronous operation, it won't return the result, it instead returns the name of the operation. You can use that unique name to call GetOperation to check the status of the operation.
Note that there could be more than 1 output file for your pdf if the pdf has more pages than batchSize and the output json file names change depending on the number of pages. It isn't safe to always append "output-1-to-1.json".
Make sure that the uri prefix you put in the output config is unique because you have to do a wildcard search in gcs on the prefix you provide to get all of the json files that were created.

Elasticsearch Update Doc String Replacement

I have some documents on my Elasticsearch. I want to update my document contents by using String Regexp.
For example, I would like to replace all http words into https words, is it possible ?
Thank You
This should get you off to a start. Check out the "Update by Query" API here. The API allows you to include the update script and search query in the same request body.
Regarding your case, an example might look like this...
POST addresses/_update_by_query
{
"script":
{
"lang": "painless",
"inline": "ctx._source.data.url = ctx._source.data.url.replace('http', 'https')"
},
"query":
{
"query_string":
{
"query": "http://*",
"analyze_wildcard": true
}
}
}
Pretty self explanatory, but script is where we do the update, and query returns the documents to update.
Painless supports regex so you're in luck, look here for some examples, and update the inline value accordingly.

Returning record(s) after store pushPayload call

Is there a better way to return the record(s) after DS.Store#pushPayload is called? This is what I'm doing...
var payload = { id: 1, title: "Example" }
store.pushPayload('post', payload);
return store.getById('post', payload.id);
But, with regular DS.Store#push you get the inserted record returned. The only difference between the two, from what I can tell, is that DS.Store#pushPayload serializes the payload data with the correct serializers.
DS.Store#pushPayload is able to take an array of items, not just one, and may contain side-loaded data. It processes a full payload and expects root keys in the payload:
{
"posts": [{
"id": 1,
"title": "title",
"comments": [1]
}],
"comments": [
//.. and so on ...
]
}
DS.Store#push expects a single record which has been normalized and contains no side loaded data (notice there is no root key):
{
"id": 1,
"title": "title",
"comments": [1]
}
For this reason, it makes sense for push to return the record, but for pushPayload to return nothing.
When you use pushPayload, a second lookup of store.find('post', 1) (or store.getById('post', 1)) is the way to go, I don't believe there is a better way.
As of this PR pushPayload can now return an array of all the records pushed into the store, once the 'ds-pushpayload-return' feature flag has been enabled.
At the moment, this feature isn't available in a standard or beta release-- you'll have to use
"ember-data": "emberjs/data#master",
(i.e. Canary) in your package.json in order to access it. I'm not sure when the feature will be generally available.