I am trying to bulk index my django model to elastic search 6, my plan is to run this as a cron once a day to update the index.
import requests
data = serialize('json', CapitalSheet.objects.all())
data += "\n"
r = requests.post("http://127.0.0.1:9200/capitalsheet/_bulk", json = data)
print(r.content)
I am getting this error:
b'{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"The bulk request must be terminated by a newline [\\n]"}],"type":"illegal_argument_exception","reason":"The bulk request must be terminated by a newline [\\n]"},"status":400}'
If you can suggest something better, I would be glad.
I would recommend looking at the python library provided by elasticsearch. It will make it easier to do the bulk inserts. Here is a link to the docs:
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers
If you want to do it manually though, the ES bulk API actually requires two lines for every record you want to insert. The first line details what index and the type of operation, and the second line is the record to insert. For example your request body would look something like this:
{ "index" : { "_index" : "test", "_type" : "type1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_type" : "type1" } }
{ "field1" : "value2" }
The ES documentation explains this well here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
Related
Let's say that I have a SaaS based on Django backend that processes the data of the users and write everything to the Elasticsearch. Now I would like to give users access to search and request their data stored in ES using all possible search requests available in ES. Obviously the user should have only access to his data, not to other user's data. I am aware that it can be done in a lot of different ways but I wonder what is safe and the best solution? At this point I store everything in one index and type in the way shown below but I can do this in any way.
"_index": "example_index",
"_type": "example_type",
"_id": "H2s-lGsdshEzmewdKtL",
"_score": 1,
"_source": {
"user_id": 1,
"field1": "example1",
"field2": "example2",
"field3": "example3"
}
I think that the best way would be to associate every document with the user_id. The user would send for example GET request with body and authorization header with Token. I would use Token to extract id of the user for example in this way
key = request.META.get('HTTP_AUTHORIZATION').split()[1]
user_id = Token.objects.get(key=key).user_id
After this I would redirect his request to ES and only data that meet requirements and belongs to this user would be returned. Of course I could do this like shown above where I also add field user_id. For example I could use post_filter in this way:
To every request I would add something like this:
,
"post_filter": {
"match": {
"user_id": 1
}
}
For example the user sends GET with body
{
"query": {
"regexp": {
"tag": ".*example.*"
}
}
}
and I change this in my backend and redirect request to ES with body:
{
"query": {
"regexp": {
"tag": ".*example.*"
}
},
"post_filter": {
"match": {
"user_id": 1
}
}
}
but it doesn't seem to me that including this field in _source is a good idea. I am almost sure that it can be solved in a more optimal way than post_filtering. I see a lot of information about authorization in ES however I can’t find how can I associate document with user_id and then search only his documents without post_filtering. Any ideas?
UPDATE
My current solution looks in they way shown below however as I mentioned I believe that it is not optimal way. If anyone has an idea how can I solve this in the way described above I will be grateful for help.
I send for example
{
"query": {
"regexp": {
"tag": ".*test.*"
}
}
}
In Django backend I just do
key = request.META.get('HTTP_AUTHORIZATION').split()[1]
user_id = Token.objects.get(key=key).user_id
body = json.loads(request.body)
body['post_filter'] = {"match": {"user_id": user_id}}
res = es.search(index="pictures", doc_type="picture", body=body)
output = []
for hit in res['hits']['hits']:
output.append(hit["_source"])
return Response(
{'output': output},
status=status.HTTP_200_OK)
In elasticsearch 7.1, you have now basic security in the free version of elasticsearch. Thanks to that, you can control per indice thé Access of your user.
I have a collection that name is web_events
and data is saved in something below structure.
{
"subscriber_id" : "amit",
"event" : "login"
},
{
"subscriber_id" : "manish",
"event" : "login"
},
{
"subscriber_id" : "manish",
"event" : "page_view"
},
{
"subscriber_id" : "manish",
"event" : "add_to_cart"
}
and i want to find out how many user do (login) event
and how many user do (login,add_to_cart) both events
output would be like ....
** (login) event count - 2, users and (login,add_to_cart) event count-- 1 user
in one query**
I am using two query for this....
for login below query
db.web_events.aggregate([{$group:{_id:{subscriber_id:"$subscriber_id"},myfield:{$push:{$concat:["$event"]}}}},{$project:{"results":{$reduce:{input:"$myfield",initialValue:'',in:{$concat:["$$value","$$this"]}}}}},{$match:{results:{$regex:/.*login.*/}}}]);
and login and add_tocart
db.web_events.aggregate([{$group:{_id:{subscriber_id:"$subscriber_id"},myfield:{$push:{$concat:["$event"]}}}},{$project:{"results":{$reduce:{input:"$myfield",initialValue:'',in:{$concat:["$$value","$$this"]}}}}},{$match:{results:{$regex:/.*login.*add_to_cart.*/}}}]);
I want to run both the aggregations in single projection query.
Any help is appreciated.
I am having issues finding good sources for / figuring out how to correctly add server-side validation to my AppSync GraphQL mutations.
In essence I used AWS dashboard to define my AppSync schema, hence had DynamoDB tables created for me, plus some basic resolvers set up for the data.
No I need to achieve following:
I have a player who has inventory and gold
Player calls purchaseItem mutation with item_id
Once this mutation is called I need to perform some checks in resolver i.e. check if item_id exists int 'Items' table of associated DynamoDB, check if player has enough gold, again in 'Players' table of associated DynamoDB, if so, write to Players DynamoDB table by adding item to their inventory and new subtracted gold amount.
I believe most efficient way to achieve this that will result in less cost and latency is to use "Apache Velocity" templating language for AppSync?
It would be great to see example of this showing how to Query / Write to DynamoDB, handle errors and resolve the mutation correctly.
For writing to DynamoDB with VTL use the following tutorial
you can start with the PutItem template. My request template looks like this:
{
"version" : "2017-02-28",
"operation" : "PutItem",
"key" : {
"noteId" : { "S" : "${context.arguments.noteId}" },
"userId" : { "S" : "${context.identity.sub}" }
},
"attributeValues" : {
"title" : { "S" : "${context.arguments.title}" },
"content": { "S" : "${context.arguments.content}" }
}
}
For query:
{ "version" : "2017-02-28",
"operation" : "Query",
"query" : {
## Provide a query expression. **
"expression": "userId = :userId",
"expressionValues" : {
":userId" : {
"S" : "${context.identity.sub}"
}
}
},
## Add 'limit' and 'nextToken' arguments to this field in your schema to implement pagination. **
"limit": #if(${context.arguments.limit}) ${context.arguments.limit} #else 20 #end,
"nextToken": #if(${context.arguments.nextToken}) "${context.arguments.nextToken}" #else null #end
}
This is based on the Paginated Query template.
What you want to look at is at Pipeline Resolvers:
https://docs.aws.amazon.com/appsync/latest/devguide/pipeline-resolvers.html
Yes, this requires the VTL (Velocity Template)
That allows you to perform read, writes, validation, and anything you'd like using VTL. What you basically do is chain the inputs and outputs into the next template and make the required process.
Here's a Medium post showing you how to do it:
https://medium.com/#dabit3/intro-to-aws-appsync-pipeline-functions-3df87ceddac1
In other words, what you can do is:
Have one template that queries the database, pipeline the result to another template that validates the result and inserts it if succeeds or fails it.
This might be a stupid question, but I really cannot find a way to do that.
So, I have DynamoDB tables and I have schema in AppSync api. In a table, for each row, there is a field which has a list as its value. So how can I append multiple items into this list without replacing the existing items? How should I write the resolver of that mutation?
Here is the screenshot of my table:
And you can see there are multiple programs in the list.
How can I just append two more programs.
Here is a new screenshot of my resolver:
screenshot of resolver
I want to add a existence check method in UpdateItem operation. But the current code does not work. The logic I want is that use the "contains" method to see whether the "toBeAddedProgramId" already exists. But the question is, how to extract the current program id list from User table and how to make the program id list a "list" type (since the contains method only take String set and String).
I hope this question makes sense. Thanks so much guys.
Best,
Harrison
To append items to a list, you should use the DynamoDB UpdateItem operation.
Here is an example if you're using DynamoDB directly
In AWS AppSync, you can use the DynamoDB data source and specify the DynamoDB UpdateItem operation in your request mapping template.
Your UpdateItem request template could look like the following (modify it to serve your needs):
{
"version" : "2017-02-28",
"operation" : "UpdateItem",
"key" : {
"id" : { "S" : "${context.arguments.id}" }
},
"update" : {
"expression" : "SET #progs = list_append(#progs, :vals)",
"expressionNames": {
"#progs" : "programs"
},
"expressionValues": {
":vals" : {
"L": [
{ "M" : { "id": { "S": "49f2c...." }}},
{ "M" : { "id": { "S": "931db...." }}}
]
}
}
}
}
We have a tutorial here that goes into more details if you are interested in learning more
What is the default behavior for a regexp query against a non-analyzed field? Also, is that the same answer when dealing with .raw fields?
After everything i've read, i understand the following.
1. RegExp queries will work on analyzed and non-analyzed fields.
2. A regexp query should work across the entire phrase rather than just matching on a single token in non-analyzed fields.
Here's the problem though. I can not actually get this to work. I've tried it across multiple fields.
The setup i'm working with is a stock elk install and i'm dumping pfsense and snort logs into it with a basic parser. I'm currently on Kibana 4.3 and ES 2.1
I did a query to look at the mapping for one of the fields and it indicates it is not_analyzed, yet the regex does not work across the entire field.
"description": {
"type": "string",
"norms": {
"enabled": false
},
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
What am i missing here?
if a field is non-analyzed, the field is only a single token.
It's same answer when dealing with .raw fields, at least in my work.
You can use groovy script:
matcher = (doc[fields.raw].value =~ /${pattern}/ );
if(matcher.matches()) {
matcher.group(matchname)}
you can pass pattern and matchname in params.
What's meaning of tried it across multiple fields.? If your situation is more complex, maybe you could make a native java plugin.
UPDATE
{
"script_fields" : {
"regexp_field" : {
"script" : "matcher = (doc[fieldname].value =~ /${pattern}/ );if(matcher.matches()) {matcher.group(matchname)}",
"params" : {
"pattern" : "your pattern",
"matchname" : "your match",
"fieldname" : "fields.raw"
}
}
}
}