How can I structure learning for this kind of data set - ml.net

I have about 250 events that can be classified as A or B.
Each of these events is a time series with 2880 data points (1 event / min for 48h).
In each event, it's the progression of all these data points that leads to the classification.
For each event in the set, I have done the classification as A or B.
I want to train a model that, given a partial event (for example, 1500 data points), can tell me if it is likely this event will be of A or B type.
So essentially my events are like (F#):
type Event =
{
Data: float list // 2880 points from the time series
EventType: char // A or B
}
events: Dictionary<string, Event>() // id and event data, 250 entries
and then I have partial data:
partial: float list // may contain 1500 data points for example
and I want to know if it's going to be an A type or a B type and the confidence level.
How can this be organized?

Related

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

GCP Dataflow droppedDueToClosedWindow & Commit request for stage S8 and key 8 is larger than 2GB

We run into problems with our Dataflow on Google Cloud. Our pipeline consists of various input steps, which get data pushed in with GCP PubSub. We then aggregate the data and sort it. These 1 steps are clearly too heavy for Dataflow and the window we configured. We get an exception [2] on the step. Also we see these metrics:
droppedDueToClosedWindow 3,838,662 Bids/AggregateExchangeOrders
droppedDueToClosedWindow 21,060,627 Asks/AggregateExchangeOrders
Now I am seeking advice how to attack this issue. Should I break down the steps, so for example iterations and sorting can be done with parallel steps?
Is there a way to get more information about what exactly happens?
Should we increase the number of workers? (Currently 1).
We are rather new with Dataflow. .. Good advice is most welcome.
Edit: I am adding a bit of details on the steps.
This is how the steps below are 'chained' together:
#Override
public PCollection<KV<KV<String, String>, List<ExchangeOrder>>> expand(PCollection<KV<String, KV<String, String>>> input) {
return input.apply("PairWithType", new ByPairWithType(type))
.apply("UnfoldExchangeOrders", new ByAggregatedExchangeOrders())
.apply("AggregateExchangeOrders", GroupByKey.<KV<String, String>, KV<String, KV<BigDecimal, BigDecimal>>>create())
.apply("ReorderExchangeOrders", ParDo.of(new ReorderExchangeOrders()));
}
AggregateExchangeOrders:
So here, clearly we iterate through a collection of orders, and parse the type (twice), so it'a big decimal.
Which makes me think, we could skip one parse step as described here:
Convert string to BigDecimal in java
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, KV<String, String>> key = c.element().getKey();
List<KV<String, String>> value = c.element().getValue();
value.forEach(
exchangeOrder -> {
try {
BigDecimal unitPrice = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getKey()));
BigDecimal quantity = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getValue()));
if (quantity.compareTo(BigDecimal.ZERO) != 0) {
// Exclude exchange orders with no quantity.
c.output(KV.of(key.getValue(), KV.of(key.getKey(), KV.of(unitPrice, quantity))));
}
} catch (NumberFormatException e) {
// Exclude exchange orders with invalid element.
}
});
}
...next we group and sort. (And optionally reverse it), it seems this step is not taking a huge load.
ReorderExchangeOrders:
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, String> pairAndType = c.element().getKey();
Iterable<KV<String, KV<BigDecimal, BigDecimal>>> exchangeOrderBook = c.element().getValue();
List<ExchangeOrder> list = new ArrayList<>();
exchangeOrderBook.forEach(exchangeOrder -> list.add(
new ExchangeOrder(exchangeOrder.getKey(), exchangeOrder.getValue().getKey(), exchangeOrder.getValue().getValue())));
// Asks are sorted in ASC order
Collections.sort(list);
// Bids are sorted in DSC order
if (pairAndType.getValue().equals(EXCHANGE_ORDER_TYPE.BIDS.toString())) {
Collections.reverse(list);
}
c.output(KV.of(pairAndType, list));
}
[ 1 ] Dataflow screenshot:
[ 2 ] Exception: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed.
java.lang.IllegalStateException: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed. This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element.
com.google.cloud.dataflow.worker.StreamingDataflowWorker$Commit.getSize(StreamingDataflowWorker.java:327)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.lambda$new$0(StreamingDataflowWorker.java:342)
The error message is kind of straightforward.
The root cause of the problem, as many of the comments point out, is that the structure that contains all the results for one of the DoFn's is larger than 2GB, and your best option would be to partition your data in some way to make your work units smaller.
In the code I see that some of the structures returned by DoFn's are nested structures in the form KV>. This arrangement forces Dataflow to send the whole response back in one monolithic bundle, and prevents it from chunking it into smaller pieces.
One possible solution would be to use composite keys instead of nested structures for as long as possible in the pipeline, and only combine them when strictly necessary.
For example,
instead of KV>, the DoFn could return
KV<(concat(Key1, Key2)), Value>
This would split the work units into much smaller sets that can then be dispatched in parallel to multiple workers.
To answer the other questions, increasing the number of workers will have no effect as the huge collection generated by DoFn looks like is not splittable. Adding logging to see how the collection arrives at 2GB might provide useful tips to prevent this.

Creating an Akka stream for parallel processing of collection elements

I am trying to define a graph for Akka stream that contain parallel processing flow (I am using Akka.NET but this shouldn't matter). Imagine a data source of orders, each order consists of an order ID and a list of products (order items). The workflow is as follows:
Receive and order
Broadcast the order to two flows, flow A will deal with order items, channel B will deal with Order ID (some bookkeeping work)
Flow A: Split collection of order items into individual elements, each one to be processed separately
Flow A: For each order items that result from the split in the previous step call some external service which looks up extra information (price, availability etc.)
Flow B: do some extra bookkeeping for the given Order ID
Merge flows A and B
Send to the sink merged data from the previous step which result in enriched order information
Steps 1 (Source.From), 2 (Broadcast), 4-5 (Map), 6 (Merge), 7 (Sink) looks OK. But how is collection split implemented in Akka or reactive streams terms? This is not broadcasting or flattening, a collection of N elements need to be split into N independent substreams that will later be merged back. How is this achieved?
I recommend to do it in one flow. I know two flows looks cooler but trust me it's not worth it in terms of simplicity of design (I tried). You may write something like this
import akka.stream.scaladsl.{Flow, Sink, Source, SubFlow}
import scala.collection.immutable
import scala.concurrent.Future
case class Item()
case class Order(items: List[Item])
val flow = Flow[Order]
.mapAsync(4) { order =>
Future {
// Enrich your order here
order
}
}
.mapConcat { order =>
order.items.map(order -> _)
}
.mapAsync(4) { case (order, item) =>
Future {
// Enrich your item here
order -> item
}
}
.groupBy(2, tuple => tuple._1)
.fold[Map[Order, List[Item]]](immutable.Map.empty) { case (map, (order, item)) => map.updated(order, map.getOrElse(order, Nil) :+ item) }
.mapConcat { _.map { case (order, newItems) => order.copy(items = newItems)} }
but even this approach is bad. There are so many things can go wrong either with code above or your design. What will you do if enrichment of one of order's items fails? What if enrichment of order object fails ? What should happens to your stream(s) ?
If I were you I'd have Flow[Order] and process its children in mapAsync so at least it guarantees I don't have partially processed orders.

How to RESTfully support the creation of a resource which is a collection of other resources and avoiding HTTP timeouts due to DB creation?

In my application I have the concept of a Draw, and that Draw has to always be contained within an Order.
A Draw has a set of attributes: background_color, font_size, ...
Quoting the famous REST thesis:
Any information that can be named can be a resource: a document or
image, a temporal service (e.g. "today's weather in Los Angeles"), a
collection of other resources, a non-virtual object (e.g. a person),
and so on.
So, my collection of other resources here would be an Order. An Order is a set of Draws (usually more than thousands). I want to let the User create an Order with several Draws, and here is my first approach:
{
"order": {
"background_color" : "rgb(255,255,255)", "font_size" : 10,
"draws_attributes": [{
"background_color" : "rgb(0,0,0)", "font_size" : 14
}, {
"other_attribute" : "value",
},
]
}
}
A response to this would look like this:
"order": {
"id" : 30,
"draws": [{
"id" : 4
}, {
"id" : 5
},
]
}
}
So the User would know which resources have been created in the DB. However, when there are many draws in the request, since all those draws are inserted in the DB, the response takes a while. Imagine doing 10.000 inserts if an Order has 10.000 draws.
Since I need to give the User the ID of the draws that were just created (by the way, created but not finished, because when the Order is processed we actually build the Draw with some image manipulation libraries), so they can fetch them later, I fail to see how to deal with this in a RESTful way, avoiding to make the HTTP request take a lot time, but at the same time giving the User some kind of Ids for the draws, so they can fetch them later.
How do you deal with this kind of situations?
Accept the request wholesale, queue the processing, return a status URL that represents the state of the request. When the request is finished processing, present a url that represents the results of the request. Then, poll.
POST /submitOrder
301
Location: http://host.com/orderstatus/1234
GET /orderstatus/1234
200
{ status:"PROCESSING", msg: "Request still processing"}
...
GET /orderstaus/1234
200
{ status:"COMPLETED", msg: "Request completed", rel="http://host.com/orderresults/3456" }
Addenda:
Well, there's a few options.
1) They can wait for the result to process and get the IDs when it's done, just like now. The difference with what I suggested is that the state of the network connection is not tied to the success or failure of the transaction.
2) You can pre-assign the order ids before hitting the database, and return those to the caller. But be aware that those resources do not exist yet (and they won't until the processing is completed).
3) Speed up your system to where the timeout is simply not an issue.
I think your exposed granularity is too fine - does the user need to be able to modify each Draw separately? If not, then present a document that represents an Order, and that contains naturally the Draws.
Will you need to query specific Draws from the database based on specific criteria that are unrelated to the Order? If not, then represent all the Draws as a single blob that is part of a row that represents the Order.

Querying a growing data-set

We have a data set that grows while the application is processing the data set. After a long discussion we have come to the decision that we do not want blocking or asynchronous APIs at this time, and we will periodically query our data store.
We thought of two options to design an API for querying our storage:
A query method returns a snapshot of the data and a flag indicating weather we might have more data. When we finish iterating over the last returned snapshot, we query again to get another snapshot for the rest of the data.
A query method returns a "live" iterator over the data, and when this iterator advances it returns one of the following options: Data is available, No more data, Might have more data.
We are using C++ and we borrowed the .NET style enumerator API for reasons which are out of scope for this question. Here is some code to demonstrate the two options. Which option would you prefer?
/* ======== FIRST OPTION ============== */
// similar to the familier .NET enumerator.
class IFooEnumerator
{
// true --> A data element may be accessed using the Current() method
// false --> End of sequence. Calling Current() is an invalid operation.
virtual bool MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
enum class Availability
{
EndOfData,
MightHaveMoreData,
};
class IDataProvider
{
// Query params allow specifying the ID of the starting element. Here is the intended usage pattern:
// 1. Call GetFoo() without specifying a starting point.
// 2. Process all elements returned by IFooEnumerator until it ends.
// 3. Check the availability.
// 3.1 MightHaveMoreDataLater --> Invoke GetFoo() again after some time by specifying the last processed element as the starting point
// and repeat steps (2) and (3)
// 3.2 EndOfData --> The data set will not grow any more and we know that we have finished processing.
virtual std::tuple<std::unique_ptr<IFooEnumerator>, Availability> GetFoo(query-params) = 0;
};
/* ====== SECOND OPTION ====== */
enum class Availability
{
HasData,
MightHaveMoreData,
EndOfData,
};
class IGrowingFooEnumerator
{
// HasData:
// We might access the current data element by invoking Current()
// EndOfData:
// The data set has finished growing and no more data elements will arrive later
// MightHaveMoreData:
// The data set will grow and we need to continue calling MoveNext() periodically (preferably after a short delay)
// until we get a "HasData" or "EndOfData" result.
virtual Availability MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
class IDataProvider
{
std::unique_ptr<IGrowingFooEnumerator> GetFoo(query-params) = 0;
};
Update
Given the current answers, I have some clarification. The debate is mainly over the interface - its expressiveness and intuitiveness in representing queries for a growing data-set that at some point in time will stop growing. The implementation of both interfaces is possible without race conditions (at-least we believe so) because of the following properties:
The 1st option can be implemented correctly if the pair of the iterator + the flag represent a snapshot of the system at the time of querying. Getting snapshot semantics is a non-issue, as we use database transactions.
The 2nd option can be implemented given a correct implementation of the 1st option. The "MoveNext()" of the 2nd option will, internally, use something like the 1st option and re-issue the query if needed.
The data-set can change from "Might have more data" to "End of data", but not vice versa. So if we, wrongly, return "Might have more data" because of a race condition, we just get a small performance overhead because we need to query again, and the next time we will receive "End of data".
"Invoke GetFoo() again after some time by specifying the last processed element as the starting point"
How are you planning to do that? If it's using the earlier-returned IFooEnumerator, then functionally the two options are equivalent. Otherwise, letting the caller destroy the "enumerator" then however-long afterwards call GetFoo() to continue iteration means you're losing your ability to monitor the client's ongoing interest in the query results. It might be that right now you have no need for that, but I think it's poor design to exclude the ability to track state throughout the overall result processing.
It really depends on many things whether the overall system will at all work (not going into details about your actual implementation):
No matter how you twist it, there will be a race condition between checking for "Is there more data" and more data being added to the system. Which means that it's possibly pointless to try to capture the last few data items?
You probably need to limit the number of repeated runs for "is there more data", or you could end up in an endless loop of "new data came in while processing the last lot".
How easy it is to know if data has been updated - if all the updates are "new items" with new ID's that are sequentially higher, you can simply query "Is there data above X", where X is your last ID. But if you are, for example, counting how many items in the data has property Y set to value A, and data may be updated anywhere in the database at the time (e.g. a database of where taxis are at present, that gets updated via GPS every few seconds and has thousands of cars, it may be hard to determine which cars have had updates since last time you read the database).
As to your implementation, in option 2, I'm not sure what you mean by the MightHaveMoreData state - either it has, or it hasn't, right? Repeated polling for more data is a bad design in this case - given that you will never be able to say 100% certain that there hasn't been "new data" provided in the time it took from fetching the last data until it was processed and acted on (displayed, used to buy shares on the stock market, stopped the train or whatever it is that you want to do once you have processed your new data).
Read-write lock could help. Many readers have simultaneous access to data set, and only one writer.
The idea is simple:
-when you need read-only access, reader uses "read-block", which could be shared with other reads and exclusive with writers;
-when you need write access, writer uses write-lock which is exclusive for both readers and writers;