The input that is being sent from previous state is in this form:
[
{
"bucketName": "test-heimdall-employee-data",
"executionId": "ca9f1e5e-4d3a-4237-8a10-8860bb9d58be_1586771571368",
"feedType": "lenel_badge",
"chunkFileKeys": "chunkFileLocation/lenel_badge/68ac7180-69a0-401a-b30c-8f809acf3a1c_1586771581154.csv",
"sanityPassFileKeys": "chunkFileLocation/lenel_badge/0098b86b-fe3c-45ca-a067-4d4a826ee2c1_1586771588882.json"
},
{
"bucketName": "test-heimdall-employee-data",
"executionId": "ca9f1e5e-4d3a-4237-8a10-8860bb9d58be_1586771571368",
"feedType": "lenel_badge",
"errorFilePath": "error/lenel_badge/2a899128-339d-4262-bb2f-a70cc60e5d4e/1586771589234_2e06e043-ad63-4217-9b53-66405ac9a0fc_1586771581493.csv",
"chunkFileKeys": "chunkFileLocation/lenel_badge/2e06e043-ad63-4217-9b53-66405ac9a0fc_1586771581493.csv",
"sanityPassFileKeys": "chunkFileLocation/lenel_badge/f6957aa7-6e22-496a-a6b8-4964da92cb73_1586771588793.json"
},
{
"bucketName": "test-heimdall-employee-data",
"executionId": "ca9f1e5e-4d3a-4237-8a10-8860bb9d58be_1586771571368",
"feedType": "lenel_badge",
"errorFilePath": "error/lenel_badge/8050eb12-c5e6-4ae9-8c4b-0ac539f5c189/1586771589293_1bb32e6c-03fc-4679-9c2f-5a4bca46c8aa_1586771581569.csv",
"chunkFileKeys": "chunkFileLocation/lenel_badge/1bb32e6c-03fc-4679-9c2f-5a4bca46c8aa_1586771581569.csv",
"sanityPassFileKeys": "chunkFileLocation/lenel_badge/48960b7c-04e0-4cce-a77a-44d8834289df_1586771588870.json"
}
]
state machine workflow design:
How do I extract "feedType"value from the above inputs and transit to next state and also pass entire inputs to next state?
Thanks
You can access the input JSON you started your statemachine with using: $$.Execution.Input.todo. Other than that you can't directly access previous state from one step to the next.
As an example lets say you have A->B->C
Lets say you went through A which gave a new field: a : 1, and then you went through B and it returns b : 2, when you get to C you will only have b : 2. But if B also return a : 1 you would then have {a : 1, b : 2} at C. Which is typically what you do to pass state from a step a couple of steps prior.
There are other things which people do, such as storing data in an s3 bucket and accessing that bucket in different stages. You can also query a step function as well but that can be messy.
Other hacks include adding a pass step in a parallel block, but these hacks are not good, the correct way is to pass the data on between your steps, or hopefully have what you need in your execution input.
Looking at your previous state input it looks like feed_type is a constant. Assuming key to your entire input is "input" so that it's dictionary like {"input":[{...},{...}]} and so on. So to access the value of feed_type you can simply do $.input[0].feed_type.
Choice state by default passes the entire input passed to it into the next stage. So to whatever next stage it goes to, that stage is going to have same input that was passed to choice state.
To understand it better or as a proof of concept check the below Step Function in which Hello state is a choice state and other 2 states are simple pass states.
And if you will see below the input and output of Choice state. It's the same.
Hope it helps.
Related
Say I have 3 states, A -> B -> C. Let's assume inputs to A include a field called names which is of type List and each element contains two fields firstName and lastName. State B will process the inputs to A and and return a response called newLastName. If I want to override every element in names such that names[i].lastName = newLastName before passing this input to state C, is there an built-in syntax to achieve that? Thanks.
You control the events passed to the next task in a Step Function with three defintion attributes: ResultPath and OutputPath on leaving one task and InputPath on entering the next one.
You have to first understand how the event to the next task is crafted by a State Machine, and each of the 3 above parameters changes it.
You have to at least have Result Path. This is the key in the event that the output of your lambda will be placed under. so ResultPath="$.my_path" would result in a json object that has a top level key of my_path with the value equal to whatever is outputted from the lambda.
If this is the only attribute, it is tacked onto whatever the input was. So if your Input event was a json object with keys original_key1 and some_other_key your output with just the above result path would be:
{
"original_key_1": some value,
"some_other_key": some other value,
"my_path": the output of your lambda
}
Now if you add OutputPath, this cuts off everything OTHER than the path (AFTER adding the result path!) in the next output.
If you added OutputPath="$.my_path" you would end up with a json of:
{ output of your lambda }
(your output better be a json comparable object, like a python dict!)
InputPath does the same thing ... but for the Input. It cuts off everything other than the path described, and that is the only thing sent into the lambda. But it does not stop the input from being appeneded - so InputPath + ResultPath results in less being sent into the lambda, but everything all together on the exit
There isn't really a loop logic like the one you describe however - Task and State Machine definitions are static directions, not dynamic logic.
You can simply handle it inside the lambda. This is kinda the preferred method. HOWEVER if you do this, then you should use a combination of OutputPath and ResultPath to 'cut off' the input, having replaced the various fields of the incoming event with whatever you want before returning it at the end.
I have a state-machine consisting of a first pre-process task that generates an array as output, which is used by a subsequent map state to loop over. The output array of the first task has gotten too big and the state-machine throws the error States.DataLimitExceeded: The state/task 'arn:aws:lambda:XYZ' returned a result with a size exceeding the maximum number of characters service limit.
Here is an example of the state-machine yaml:
stateMachines:
myStateMachine:
name: "myStateMachine"
definition:
StartAt: preProcess
States:
preProcess:
Type: Task
Resource:
Fn::GetAtt: [preProcessLambda, Arn]
Next: mapState
ResultPath: "$.preProcessOutput"
mapState:
Type: Map
ItemsPath: "$.preProcessOutput.data"
MaxConcurrency: 100
Iterator:
StartAt: doMap
States:
doMap:
Type: Task
Resource:
Fn::GetAtt: [doMapLambda, Arn]
End: true
Next: ### next steps, not relevant
A possible solution I came up with would be that state preProcess saves its output in an S3-bucket and state mapState reads directly from it. Is this possible? At the moment the output of preProcess is
ResultPath: "$.preProcessOutput"
and mapState takes the array
ItemsPath: "$.preProcessOutput.data"
as input.
How would I need to adapt the yaml that the map state reads directly from S3?
I am solving a similar problem at work currently too. Because a step function stores its entire state, you can pretty quickly have problems as your json grows as it maps over all the values.
The only real way to solve this is to use hierarchies of step functions. That is, step functions on your step functions. So you have:
parent -> [batch1, batch2, batch...N]
And then each batch have a number of single jobs:
batch1 -> [j1,j2,j3...jBATCHSIZE]
I had a pretty simple step function, and I found at ~4k was about the max batch size I could have before I would start hitting state limits.
Not a pretty solution be hey it works.
I don't think it is possible to read directly from S3 at this time. There are a few things you could try to do to get around this limitation. One is making your own iterator and not using Map State. Another is the following:
Have a lambda read your s3 file and chunk it by index or some id/key. The idea behind this step is to pass the iterator in Map State a WAY smaller payload. Say your data has the below structure.
[ { idx: 1, ...more keys }, {idx: 2, ...more keys }, { idx: 3, ...more keys }, ... 4,997 more objects of data ]
Say you want your iterator to process 1,000 rows at a time. Return the following tuples representing indexs from your lambda instead: [ [ 0, 999 ], [ 1000, 1999 ], [ 2000, 2999 ], [ 3000, 3999 ], [ 4000, 4999] ]
Your Map State will receive this new data structure and each iteration will be one of the tuples. Iteration #1: [ 0, 999 ], Iteration #2: [ 1000, 1999 ], etc
Inside your iterator, call a lambda which uses the tuple indexes to query into your S3 file. AWS has a query language over S3 buckets called Amazon S3 Select: https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html
Here’s another great resource on how to use S3 select and get the data into a readable state with node: https://thetrevorharmon.com/blog/how-to-use-s3-select-to-query-json-in-node-js
So, for iteration #1, we are querying the first 1,000 objects in our data structure. I can now call whatever function I normally would have inside my iterator.
What's key about this approach is the inputPath is never receiving a large data structure.
As of September 2020 the limit on step functions has been increased 8-fold
https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb/
Maybe now it fits within your requirements
Just writing this in case someone else comes across the issue - I recently had to solve this at work as well. I found what I thought to be a relatively simple solution, without the use of a second step function.
I'm using Python for this and will provide a few examples in Python, but the solution should be applicable to any language.
Assuming the pre-process output looks like so:
[
{Output_1},
{Output_2},
.
.
.
{Output_n}
]
And a simplified version of the section of the Step Function is defined as follows:
"PreProcess": {
"Type": "Task",
"Resource": "Your Resource ARN",
"Next": "Map State"
},
"Map State": {
Do a bunch of stuff
}
To handle the scenario where the PreProcess output exceeds the Step Functions payload:
Inside the PreProcess, batch the output into chunks small enough to not exceed the payload.
This is the most complicated step. You will need to do some experimenting to find the largest size of a single batch. Once you have the number (it may be smart to make this number dynamic), I used numpy to split the original PreProcess output into the number of batches.
import numpy as np
batches = np.array_split(original_pre_process_output, number_of_batches)
Again inside the PreProcess, upload each batch to Amazon S3, saving the keys in a new list. This list of S3 keys will be the new PreProcess output.
In Python, this looks like so:
import json
import boto3
s3 = boto3.resource('s3')
batch_keys = []
for batch in batches:
s3_batch_key = 'Your S3 Key here'
s3.Bucket(YOUR_BUCKET).put_object(Key=s3_batch_key, Body=json.dumps(batch))
batch_keys.append({'batch_key': s3_batch_key})
In the solution I implemented, I used for batch_id, batch in enumerate(batches) to easily give each S3 key its own ID.
Wrap the 'Inner' Map State in an 'Outer' Map State, and create a Lambda function within the Outer Map to feed the batches to the Inner Map.
Now that we have a small output consisting of S3 keys, we need a way to open one at a time, feeding each batch into the original (now 'Inner') Map state.
To do this, first create a new Lambda function - this will represent the BatchJobs state. Next, wrap the initial Map state inside an Outer map, like so:
"PreProcess": {
"Type": "Task",
"Resource": "Your Resource ARN",
"Next": "Outer Map"
},
"Outer Map": {
"Type": "Map",
"MaxConcurrency": 1,
"Next": "Original 'Next' used in the Inner map",
"Iterator": {
"StartAt": "BatchJobs",
"States": {
"BatchJobs": {
"Type": "Task",
"Resource": "Newly created Lambda Function ARN",
"Next": "Inner Map"
},
"Inner Map": {
Initial Map State, left as is.
}
}
}
}
Note the 'MaxConcurrency' parameter in the Outer Map - This simply ensures the batches are executed sequentially.
With this new Step Function definition, the BatchJobs state will receive {'batch_key': s3_batch_key}, for each batch. The BatchJobs state then simply needs to get the object stored in the key, and pass it to the Inner Map.
In Python, the BatchJobs Lambda function looks like so:
import json
import boto3
s3 = boto3.client('s3')
def batch_jobs_handler(event, context):
return json.loads(s3.get_object(Bucket='YOUR_BUCKET_HERE',
Key=event.get('batch_key'))['Body'].read().decode('utf-8'))
Update your workflow to handle the new structure of the output.
Before implementing this solution, your Map state outputs an array of outputs:
[
{Map_output_1},
{Map_output_2},
.
.
.
{Map_output_n}
]
With this solution, you will now get a list of lists, with each inner list containing the results of each batch:
[
[
{Batch_1_output_1},
{Batch_1_output_2},
.
.
.
{Batch_1_output_n}
],
[
{Batch_2_output_1},
{Batch_2_output_2},
.
.
.
{Batch_2_output_n}
],
.
.
.
[
{Batch_n_output_1},
{Batch_n_output_2},
.
.
.
{Batch_n_output_n}
]
]
Depending on your needs, you may need to adjust some code after the Map in order to handle the new format of the output.
That's it! As long as you set the max batch size correctly, the only way you will hit a payload limit is if your list of S3 keys exceeds the payload limit.
The proposed workarounds work for specific scenarios, but it is not in the one that the processing of a normal payload can generate a big list of items that can exceed the payload limit.
In a general form I think that the problem can repeat in the scenarios 1->N. I mean when one step might generate many step executions in the workflow.
One of the clear ways to break the complexity of some task is divide it into many others, so this is likely to be needed a lot of times. Also from the scalability perspective, there is a clear advantage in doing that, because the more you break the big computations into little ones there is more granularity and more parallelism and optimizations can be done.
That is what AWS intends to facilitate by increasing the max payload size. They call it dynamic parallelism.
The problem is that the Map state is the corner-stone of that. Beside the service integrations (database queries, etc.) is the only one that can dynamically derive many tasks from just one step. But there seems to be no way to specify to it that the payload is on a file.
I see a quick solution to the problem would be if they add one optional persistence spec to the each step, for example:
stateMachines:
myStateMachine:
name: "myStateMachine"
definition:
StartAt: preProcess
States:
preProcess:
Type: Task
Resource:
Fn::GetAtt: [preProcessLambda, Arn]
Next: mapState
ResultPath: "$.preProcessOutput"
OutputFormat:
S3:
Bucket: myBucket
Compression:
Format: gzip
mapState:
Type: Map
ItemsPath: "$.preProcessOutput.data"
InputFormat:
S3:
Bucket: myBucket
Compression:
Format: gzip
MaxConcurrency: 100
Iterator:
StartAt: doMap
States:
doMap:
Type: Task
Resource:
Fn::GetAtt: [doMapLambda, Arn]
End: true
Next: ### next steps, not relevant
That way the Map could perform its work even over large payloads.
There is now a Map State in Distributed Mode:
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-asl-use-map-state-distributed.html
Use the Map state in Distributed mode when you need to orchestrate
large-scale parallel workloads that meet any combination of the
following conditions:
The size of your dataset exceeds 256 KB.
The workflow's execution event history exceeds 25,000 entries.
You need a concurrency of more than 40 parallel iterations.
I use step functions for a big loop, so far no problem, but the day when my loop exceeded 8000 executions I came across the error "Maximum execution history size" which is 25000.
There is there a solution for not having the history events?
Otherwise, where I can easily migrate my step functions (3 lambda) because aws batch will ask me a lot of code rewrite ..
Thanks a lot
One approach to avoid the 25k history event limit is to add a choice state in your loop that takes in a counter or boolean and decides to exit the loop.
Outside of the loop you can put a lambda function that starts another execution (with a different id). After this, your current execution completes normally and another execution will continue to do the work.
Please note that the "LoopProcessor" in the example below must return a variable "$.breakOutOfLoop" to break out of the loop, which must also be determined somewhere in your loop and passed through.
Depending on your use case, you may need to restructure the data you pass around. For example, if you are processing a lot of data, you may want to consider using S3 objects and pass the ARN as input/output through the state machine execution. If you are trying to do a simple loop, one easy way would be to add a start offset (think of it as a global counter) that is passed into the execution as input, and each LoopProcessor Task will increment a counter (with the start offset as the initial value). This is similar to pagination solutions.
Here is a basic example of the ASL structure to avoid the 25k history event limit:
{
"Comment": "An example looping while avoiding the 25k event history limit.",
"StartAt": "FirstState",
"States": {
"FirstState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"Next": "ChoiceState"
},
"ChoiceState": {
"Type" : "Choice",
"Choices": [
{
"Variable": "$.breakOutOfLoop",
"BooleanEquals": true,
"Next": "StartNewExecution"
}
],
"Default": "LoopProcessor"
},
"LoopProcessor": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:ProcessWork",
"Next": "ChoiceState"
},
"StartNewExecution": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:StartNewLooperExecution",
"Next": "FinalState"
},
"FinalState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"End": true
}
}
}
Hope this helps!
To guarantee the execution of all the steps and their orders, step function stores the history of execution after the completion of each state, this storing is the reason behind the limit on the history execution size.
Having said that, one way to mitigate this limit is by following #sunnyD answer. However, it has below limitations
the invoker of a step function(if there is one) will not get the execution output of the complete data. Instead, he gets the output of the first execution in a chain of execution.
The limit on the number of execution history size has a high chance of increasing in the future versions so writing logic on this number would require you to modify the code/configuration every time the limit is increased or decreased.
Another alternate solution is to arrange step function as parent and child step functions. In this arrangement, the parent step function contains a task to loop through the entire set of data and create new execution of child step function for each record or set of records(a number which is will not exceed history execution limit of a child SF) in your data. The second step in parent step function will wait for a period of time before it checks the Cloudwatch metrics for the completion of all child function and exits with the output.
Few things to keep in mind about this solution are,
The startExecution API will throttle at 500 bucket size with 25 refills every second.
Make sure your wait time in parent SF is sufficient for child SFs to finish its execution otherwise implement a loop to check the completion of child SF.
In my application I have the concept of a Draw, and that Draw has to always be contained within an Order.
A Draw has a set of attributes: background_color, font_size, ...
Quoting the famous REST thesis:
Any information that can be named can be a resource: a document or
image, a temporal service (e.g. "today's weather in Los Angeles"), a
collection of other resources, a non-virtual object (e.g. a person),
and so on.
So, my collection of other resources here would be an Order. An Order is a set of Draws (usually more than thousands). I want to let the User create an Order with several Draws, and here is my first approach:
{
"order": {
"background_color" : "rgb(255,255,255)", "font_size" : 10,
"draws_attributes": [{
"background_color" : "rgb(0,0,0)", "font_size" : 14
}, {
"other_attribute" : "value",
},
]
}
}
A response to this would look like this:
"order": {
"id" : 30,
"draws": [{
"id" : 4
}, {
"id" : 5
},
]
}
}
So the User would know which resources have been created in the DB. However, when there are many draws in the request, since all those draws are inserted in the DB, the response takes a while. Imagine doing 10.000 inserts if an Order has 10.000 draws.
Since I need to give the User the ID of the draws that were just created (by the way, created but not finished, because when the Order is processed we actually build the Draw with some image manipulation libraries), so they can fetch them later, I fail to see how to deal with this in a RESTful way, avoiding to make the HTTP request take a lot time, but at the same time giving the User some kind of Ids for the draws, so they can fetch them later.
How do you deal with this kind of situations?
Accept the request wholesale, queue the processing, return a status URL that represents the state of the request. When the request is finished processing, present a url that represents the results of the request. Then, poll.
POST /submitOrder
301
Location: http://host.com/orderstatus/1234
GET /orderstatus/1234
200
{ status:"PROCESSING", msg: "Request still processing"}
...
GET /orderstaus/1234
200
{ status:"COMPLETED", msg: "Request completed", rel="http://host.com/orderresults/3456" }
Addenda:
Well, there's a few options.
1) They can wait for the result to process and get the IDs when it's done, just like now. The difference with what I suggested is that the state of the network connection is not tied to the success or failure of the transaction.
2) You can pre-assign the order ids before hitting the database, and return those to the caller. But be aware that those resources do not exist yet (and they won't until the processing is completed).
3) Speed up your system to where the timeout is simply not an issue.
I think your exposed granularity is too fine - does the user need to be able to modify each Draw separately? If not, then present a document that represents an Order, and that contains naturally the Draws.
Will you need to query specific Draws from the database based on specific criteria that are unrelated to the Order? If not, then represent all the Draws as a single blob that is part of a row that represents the Order.
We have a data set that grows while the application is processing the data set. After a long discussion we have come to the decision that we do not want blocking or asynchronous APIs at this time, and we will periodically query our data store.
We thought of two options to design an API for querying our storage:
A query method returns a snapshot of the data and a flag indicating weather we might have more data. When we finish iterating over the last returned snapshot, we query again to get another snapshot for the rest of the data.
A query method returns a "live" iterator over the data, and when this iterator advances it returns one of the following options: Data is available, No more data, Might have more data.
We are using C++ and we borrowed the .NET style enumerator API for reasons which are out of scope for this question. Here is some code to demonstrate the two options. Which option would you prefer?
/* ======== FIRST OPTION ============== */
// similar to the familier .NET enumerator.
class IFooEnumerator
{
// true --> A data element may be accessed using the Current() method
// false --> End of sequence. Calling Current() is an invalid operation.
virtual bool MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
enum class Availability
{
EndOfData,
MightHaveMoreData,
};
class IDataProvider
{
// Query params allow specifying the ID of the starting element. Here is the intended usage pattern:
// 1. Call GetFoo() without specifying a starting point.
// 2. Process all elements returned by IFooEnumerator until it ends.
// 3. Check the availability.
// 3.1 MightHaveMoreDataLater --> Invoke GetFoo() again after some time by specifying the last processed element as the starting point
// and repeat steps (2) and (3)
// 3.2 EndOfData --> The data set will not grow any more and we know that we have finished processing.
virtual std::tuple<std::unique_ptr<IFooEnumerator>, Availability> GetFoo(query-params) = 0;
};
/* ====== SECOND OPTION ====== */
enum class Availability
{
HasData,
MightHaveMoreData,
EndOfData,
};
class IGrowingFooEnumerator
{
// HasData:
// We might access the current data element by invoking Current()
// EndOfData:
// The data set has finished growing and no more data elements will arrive later
// MightHaveMoreData:
// The data set will grow and we need to continue calling MoveNext() periodically (preferably after a short delay)
// until we get a "HasData" or "EndOfData" result.
virtual Availability MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
class IDataProvider
{
std::unique_ptr<IGrowingFooEnumerator> GetFoo(query-params) = 0;
};
Update
Given the current answers, I have some clarification. The debate is mainly over the interface - its expressiveness and intuitiveness in representing queries for a growing data-set that at some point in time will stop growing. The implementation of both interfaces is possible without race conditions (at-least we believe so) because of the following properties:
The 1st option can be implemented correctly if the pair of the iterator + the flag represent a snapshot of the system at the time of querying. Getting snapshot semantics is a non-issue, as we use database transactions.
The 2nd option can be implemented given a correct implementation of the 1st option. The "MoveNext()" of the 2nd option will, internally, use something like the 1st option and re-issue the query if needed.
The data-set can change from "Might have more data" to "End of data", but not vice versa. So if we, wrongly, return "Might have more data" because of a race condition, we just get a small performance overhead because we need to query again, and the next time we will receive "End of data".
"Invoke GetFoo() again after some time by specifying the last processed element as the starting point"
How are you planning to do that? If it's using the earlier-returned IFooEnumerator, then functionally the two options are equivalent. Otherwise, letting the caller destroy the "enumerator" then however-long afterwards call GetFoo() to continue iteration means you're losing your ability to monitor the client's ongoing interest in the query results. It might be that right now you have no need for that, but I think it's poor design to exclude the ability to track state throughout the overall result processing.
It really depends on many things whether the overall system will at all work (not going into details about your actual implementation):
No matter how you twist it, there will be a race condition between checking for "Is there more data" and more data being added to the system. Which means that it's possibly pointless to try to capture the last few data items?
You probably need to limit the number of repeated runs for "is there more data", or you could end up in an endless loop of "new data came in while processing the last lot".
How easy it is to know if data has been updated - if all the updates are "new items" with new ID's that are sequentially higher, you can simply query "Is there data above X", where X is your last ID. But if you are, for example, counting how many items in the data has property Y set to value A, and data may be updated anywhere in the database at the time (e.g. a database of where taxis are at present, that gets updated via GPS every few seconds and has thousands of cars, it may be hard to determine which cars have had updates since last time you read the database).
As to your implementation, in option 2, I'm not sure what you mean by the MightHaveMoreData state - either it has, or it hasn't, right? Repeated polling for more data is a bad design in this case - given that you will never be able to say 100% certain that there hasn't been "new data" provided in the time it took from fetching the last data until it was processed and acted on (displayed, used to buy shares on the stock market, stopped the train or whatever it is that you want to do once you have processed your new data).
Read-write lock could help. Many readers have simultaneous access to data set, and only one writer.
The idea is simple:
-when you need read-only access, reader uses "read-block", which could be shared with other reads and exclusive with writers;
-when you need write access, writer uses write-lock which is exclusive for both readers and writers;