Have an index which is search heavy. Rpm varies from 15-20k. Issue is, for first few days resp time of search query will be around 15ms. But it will start increasing gradually and touches ~70ms. Some of the requests starts queuing(as per Search thread pool graph in aws console) but there were no rejection. Queuing would increase latency of the search request.
Got to know that queuing will happen if there is pressure on resource. I think I have sufficient cpu and memory, plz look at config below.
Enabled slow query logs, but didnt find any anamoly. Even though average resp time is around 16ms, I see few queries going above 50ms. But there was no issue in search query. Searchable documents is around 8k.
Need your suggestion on how to improve performance here. Document mapping, search query and ES config are given below. Is there any issue in mapping or query here?
Mapping:
{
"data":{
"mappings":{
"_doc":{
"properties":{
"a":{
"type":"keyword"
},
"b":{
"type":"keyword"
}
}
}
}
}
}
Search query:
{
"size":5000,
"query":{
"bool":{
"filter":[
{
"terms":{
"a":[
"all",
"abc"
],
"boost":1
}
},
{
"terms":{
"b":[
"all",
123
],
"boost":1
}
}
],
"adjust_pure_negative":true,
"boost":1
}
},
"stored_fields":[]
}
Im using keyword in mapping and terms in search query as I want to search for exact value.Boost and adjust_pure_negative are added automatically. From what I read, they should not affect performance.
Index settings:
{
"data":{
"settings":{
"index":{
"number_of_shards":"1",
"provided_name":"data",
"creation_date":"12345678154072",
"number_of_replicas":"7",
"uuid":"3asd233Q9KkE-2ndu344",
"version":{
"created":"10499"
}
}
}
}
}
ES config:
Master node instance type: m5.large.search
Master nodes: 3
Data node instance type: m5.2xlarge.search
Data nodes: 8 (8 vcpu, 32 GB memory)
Related
I want to fetch all the transaction from a block using the single rpc call.
I know we can fetch using the chunk id but in that case we have to make a call for each chunk.
Unfortunately, it's impossible to do in a single call. However, it is possible in N+1 where N is a number of shards.
Request a block (by height, hash or finality - depends on your quest, lets assume you need latest)
https://docs.near.org/docs/api/rpc/block-chunk#block-details
httpie example
$ http post https://rpc.testnet.near.org/ id=dontcare jsonrpc=2.0 method=block params:='{"finality": "final"}'
Collect Chunks hashes from the response. You can find them in the response JSON
{
"id": "dontcare",
"jsonrpc": "2.0",
"result": {
"author": "node1",
"chunks": [
{
...
chunk_hash: 6ZJzhK4As3UGkyH2kxHmRFYoV7hiyXareMo1qzyxS624,
Using a jq
$ http post https://rpc.testnet.near.org/ id=dontcare jsonrpc=2.0 method=block params:='{"finality": "final"}' | jq .result.chunks[].chunk_hash
"GchAtNdcc16bKvnTa7RA3xkYAt2eMg22Qkmc9FfFTrK2"
"8P6u7zwsLvYMH5vbV4hnaCaL7FKuPnfJU4yNJY52WCd2"
"8p1XaC4BzCBVUhfYWyf6nBXF4m9uzJVEJmHCYnBMLuUn"
"7TkVTzCGMyxNnumX6ZsES5v3Wa3UnBZZAavF9zjMzDKC"
You need to perform a query to get every chunk like:
$ http post https://rpc.testnet.near.org/ id=dontcare jsonrpc=2.0 method=chunk params:='{"chunk_id": "GchAtNdcc16bKvnTa7RA3xkYAt2eMg22Qkmc9FfFTrK2"}' | jq .result.transactions[]
{
"signer_id": "art.artcoin.testnet",
"public_key": "ed25519:4o6mz55p1mNmfwg5EeTDXdtYFxQev672eU5wy5RjRCbw",
"nonce": 570906,
"receiver_id": "art.artcoin.testnet",
"actions": [
{
"FunctionCall": {
"method_name": "submit_asset_price",
"args": "eyJhc3NldCI6ImFCVEMiLCJwcmljZSI6IjM4MzQyOTEyMzgzNTEifQ==",
"gas": 30000000000000,
"deposit": "0"
}
}
],
"signature": "ed25519:2E6Bs8U1yRtAtYuzNSB1PUXeAywrTbXMpcM8Z8w6iSXNEtLRDd1aXDCGrv3fBTn1QZC7MoistoEsD5FzGSTJschi",
"hash": "BYbqKJq3c9qW77wspsmQG3KTKAAfZcSeoTLWXhk6KKuz"
}
And this way you can collect all the transactions from the block.
Alternatively, as #amgando said you can query the Indexer for Explorer database using public credentials
https://github.com/near/near-indexer-for-explorer#shared-public-access
But please be aware that the number of connections to the database is limited (resources) and often it's not that easy to get connected because a lot of people around the globe are using it.
Since NEAR is sharded, a "chunk" is what we call that piece of a block that was handled by a single shard
To build up the entire block you can either
construct the block from its chunk parts
use an indexer to capture what you need in real time
Details:
Apache CouchDB v. 3.1.1
about 5 GB of twitter data have been dumped in partitions
Map reduce function that I have written:
{
"_id": "_design/Info",
"_rev": "13-c943aaf3b77b970f4e787be600dd240e",
"views": {
"trial-view": {
"map": "function (doc) {\n emit(doc.account_name, 1);\n}",
"reduce": "_count"
}
},
"language": "javascript",
"options": {
"partitioned": true
}
}
when I am trying the following command in postman:
http://<server_ip>:5984/mydb/_partition/partition1/_design/Info/_view/trial-view?key="BT"&group=true
I am getting following error:
{
"error": "timeout",
"reason": "The request could not be processed in a reasonable amount of time."
}
Kindly help me how to apply mapReduce on such huge data?
So, I thought of answering my own question, after realizing my mistake. The answer to this is simple. It just needed more time, as the indexing takes a lot of time. you can see the metadata to see the db data being indexed.
I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}
I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).
AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}
I have a lambda function which does a transaction in DynamoDB similar to this.
try {
const reservationId = genId();
await transactionFn();
return {
statusCode: 200,
body: JSON.stringify({id: reservationId})
};
async function transactionFn() {
try {
await docClient.transactWrite({
TransactItems: [
{
Put: {
TableName: ReservationTable,
Item: {
reservationId,
userId,
retryCount: Number(retryCount),
}
}
},
{
Update: {
TableName: EventDetailsTable,
Key: {eventId},
ConditionExpression: 'available >= :minValue',
UpdateExpression: `set available = available - :val, attendees= attendees + :val, lastUpdatedDate = :updatedAt`,
ExpressionAttributeValues: {
":val": 1,
":updatedAt": currentTime,
":minValue": 1
}
}
}
]
}).promise();
return true
} catch (e) {
const transactionConflictError = e.message.search("TransactionConflict") !== -1;
// const throttlingException = e.code === 'ThrottlingException';
console.log("transactionFn:transactionConflictError:", transactionConflictError);
if (transactionConflictError) {
retryCount += 1;
await transactionFn();
return;
}
// if(throttlingException){
//
// }
console.log("transactionFn:e.code:", JSON.stringify(e));
throw e
}
}
It just updating 2 tables on api call. If it encounter a transaction conflict error, it simply retry the transaction by recursively calling the function.
The eventDetails table is getting too much db updates. ( checked it with aws Contributor Insights) so, made provisioned unit to a higher value than earlier.
For reservationTable Provisioned capacity is on Demand.
When I do load test over this api with 400 (or more) users using JMeter (master slave configuration) I am getting Throttled error for some api calls and some api took more than 20 sec to respond.
When I checked X-Ray for this api found that, DynamoDB is taking too much time for this transasction for the slower api calls.
Even with much fixed provisioning ( I tried on demand scaling too ) , I am getting throttled exception for api calls.
ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded.
Consider increasing your provisioning level with the UpdateTable API.
UPDATE
And one more thing. When I do the load testing, I am always uses the same eventId. It means, I am always updating the same row for all the api requests. I have found this article, which says that, a single partition can only have upto 1000 WCU. Since I am always updating the same row in the eventDetails table during load testing, is that causing this issue ?
I had this exact error and it helped me to change the On Demand to Provisioned under Read/write capacity mode. Try to change that, if that doesn't help, we'll go from there.
From the link you cite in your update, also described in an AWS help article here, it sounds like the issue is that all of your load testers are writing to the same entry in the table, which is going to be in the same partition, subject to the hard limit of 1,000 WCU.
Have you tried repeating this experiment with the load testers writing to different partitions?
I have a JSON file with large size and I would like to know if it is better to upload this information directly to Dynamodb using boto3 or instead it is better to upload this on s3 first and then using data pipeline, upload this to Dynamodb?
Here is a few samples data:
Sample1:
{
"updated":{
"n":"20181226"
},
"periodo":{
"n":"20180823"
},
"tipos":{
"m":{
"Disponible":{
"m":{
"total":{
"n":"200"
},
"Saldos de Cuentas de Ahorro":{
"n":"300"
}
}
}
}
},
"mediana_disponible":{
"n":"588"
},
"mediana_ingreso":{
"n":"658"
},
"mediana_egreso":{
"n":"200"
},
"documento":{
"s":"2-2"
}
}
For this sample, this is only one record and in average there are 68 millons of records and the file size is 70Gb.
Sample2:
{
"updated":{
"n":"20190121"
},
"zonas":{
"s":"123"
},
"tipo_doc":{
"n":"3123"
},
"cods_sai":{
"s":"3,234234,234234"
},
"cods_cb":{
"s":"234234,5435,45"
},
"cods_atm":{
"s":"54,45,345;345,5345,435"
},
"num_doc":{
"n":"345"
},
"cods_mf":{
"s":"NNN"
},
"cods_pac":{
"s":"NNN"
}
}
For this sample, this is only one record and in average there are 7 millons of records and the file size is 10Gb.
Thanks in advance
For your situation I would use AWS Data Pipeline to import your Json data files into DynamoDB from S3. There are many examples provided by AWS and by others on the Internet.
Your use case for me is just on the border between simply writing a Python import script and deploying Data Pipeline. Since your data is clean, deploying a pipeline will be very easy.
I would definitely copy your data to S3 first and then process your data from S3. The primary reason is the unreliability of the public Internet for this large of data.
If this task will be repeated over time, the I would definitely use AWS Data Pipeline.