Cam Sagemaker dynamically allocate resources based on the load? For example, how easy would it be to run 5000 models in parallel? How does parameter tuning come into play?
You can certainly tune many models in parallel. You can set "Maximum number of parallel training jobs" for hyper-parameter tuning, and many jobs will run at the same time
Parallel tuning might not be the best idea however. Bayesian learning can be used to speed up tuning, but the next step relies on the previous step, making parallel's usefulness limited for Bayesian tuning. See https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html#automatic-model-tuning-parallelism
In terms of training, there is no built-in support for scaling. You can run a single training job on many machines (see "instance count"), or run many training jobs in parallel (by making many API calls), or use something like https://automl.github.io/auto-sklearn/master/, which would run many models in one training job
I'm working to design a middle layer for an application that will receive up to ~5000 requests every few seconds and need to retrieve information from a database. I've been looking at use the Play Framework (I use scala for my REST api design) as they say its fully async and built on Akka. However, the main bottleneck of any solution seems to happen during read/writes to the database. Many Database cannot support simultaneous read/writes from a database of such a scale. How is such high concurrency achieved then for an app like this? I would guess Facebook/Twitter/ (name other big company) may have achieved this for their Applications as millions of people may be using them concurrently.
As Tim's comment was saying caching may or may not be able to help in your case. If not I would also recommend looking into horizontally scalable databases, for example cockroachdb if you want a transactional SQL db. Otherwise there are many no-sql choices a la mongodb etc. And if you really want to stick to traditional SQL systems you'll have to vertically scale your servers (buy the most expensive hardware) and work with read-replicas.
A huge component is your data model and query access pattern. If each query is incrementing a shared counter that has to be synchronized there will be a ton of contention, but if each query is touch completely separate data on the other end the spectrum than there will be a lot less contention.
I think there are a couple of dimensions I would consider:
Data Schema and Access Patterns (discussed above)
Language Choice
This is important becaues if you were in a web server context and were using prefork by default each process may have its own connection to the database. In an environment like python or ruby you may need hundreds of processes to handle your load. Contrast this with akka or another async networking based runtime (node, python gevent/asyncio, go, etc) where a single instance with a small thread pool can handle a large number of requests. Each have their tradeoffs.
Distributed Systems
Depending on your data schema and access patterns 5000 requests per second to a RDBMS is completely achievable. It would probably require relatively beefy hardware but but I'v personally done it a number of times. Getting to larger scales requires more computers in order to distribute the work/load. If your workload is right heavy and you can support potentially stale reads, a read replica is one option. With another machine in the mix reads are distributed over 2 machines but writes are still directed at a single machine (leader). Caching is another option.
At much higher workloads some sort of partitioning needs to occur in order to overcome the constraints of a single machine. https://github.com/vitessio/vitess
Many of the big contenders have solutions to horizontally scaling their databases. This has many drawbacks as well and will require careful planning.
The one thing I'd recommend is that if 5000 requests per second is projected for the near future, start with the minimal amount of hardware necessary (single instance) query patterns and operation get exponentially more complicated with a distributed database.
I'm training a model for recognizing short, one to three sentence strings of text using the MITIE back-end in Rasa. The model trains and works using spaCy, but it isn't quite as accurate as I'd like. Training on spaCy takes no more than five minutes, but training for MITIE ran for several days non-stop on my computer with 16GB of RAM. So I started training it on an Amazon EC2 r4.8xlarge instance with 255GB RAM and 32 threads, but it doesn't seem to be using all the resources available to it.
In the Rasa config file, I have num_threads: 32 and set max_training_processes: 1, which I thought would help use all the memory and computing power available. But now that it has been running for a few hours, CPU usage is sitting at 3% (100% usage but only on one thread), and memory usage stays around 25GB, one tenth of what it could be.
Do any of you have any experience with trying to accelerate MITIE training? My model has 175 intents and a total of 6000 intent examples. Is there something to tweak in the Rasa config files?
So I am going to try to address this from several angles. First specifically from the Rasa NLU angle the docs specifically say:
Training MITIE can be quite slow on datasets with more than a few intents.
and provide two alternatives:
Use the mite_sklearn pipeline which trains using sklearn.
Use the MITIE fork where Tom B from Rasa has modified the code to run faster in most cases.
Given that you're only getting a single cores used I doubt this will have an impact, but it has been suggested by Alan from Rasa that num_threads should be set to 2-3x your number of cores.
If you haven't evaluated both of those possibilities then you probably should.
Not all aspects of MITIE are multi-threaded. See this issue opened by someone else using Rasa on the MITIE GitHub page and quoted here:
Some parts of MITIE aren't threaded. How much you benefit from the threading varies from task to task and dataset to dataset. Sometimes only 100% CPU utilization happens and that's normal.
Specifically on training data related I would recommend that you look at the evaluate tool recently introduced into the Rasa repo. It includes a confusion matrix that would potentially help identify trouble areas.
This may allow you to switch to spaCy and use a portion of your 6000 examples as an evaluation set and adding back in examples to the intents that aren't performing well.
I have more questions on where the 6000 examples came from, if they're balanced, and how different each intent is, have you verified that words from the training examples are in the corpus you are using, etc but I think the above is enough to get started.
It will be no surprise to the Rasa team that MITIE is taking forever to train, it will be more of a surprise that you can't get good accuracy out of another pipeline.
As a last resort I would encourage you to open an issue on the Rasa NLU GitHub page and and engage the team there for further support. Or join the Gitter conversation.
I am new to akka and am trying to see if it answers the problematics i am facing. I have data from databases to extract, transform with algorithms and send by and to actors. This involves a lot of computing.
Can akka handle all this (communication and computing)? Or do i have to call upon another tool to manage the calculus part?
Thank you all.
wip
Well, all I can offer here is my experience. As a matter of fact I am currently working on something similar (i.e an ETL with text files). We're essentially taking a lot of text files and loading their lines up into a PostgreSQL database. This is our setup :
Intel Xeon 8 cores + SSD
Files and app on the same machine
Remote database
We're able to fetch, parse and load 26 millions file lines and creating specific database indices in about 12 minutes, which is about 1.3GB worth of files and 3GB in database. On a much crappier mono-core and HDD setup we can do it in about 40 minutes.
The good thing about Akka is that it will allow you to save up resources and scale more since several actors can share one thread.
Akka can easily handle many millions of message sends per second, oldie but goodie on this topic here in this letitcrash.com post. As long as you factor out blocking operations in separate dispatchers (thread pools) the actor model eases parallel computations a lot, which of course gives you nice wall-clock-time in such data crunching apps.
I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let's say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let's say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at?
If not, why not?
I put the example down, but now wish that I didn't. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?
MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.
To touch on why this is the case, consider the way MapReduce works (general, high-level overview):
A bunch of nodes receive portions of
the input data (called splits) and do
some processing (the map step)
The intermediate data (output from
the last step) is repartitioned such
that data with like keys ends up
together. This usually requires some
data transfer between nodes.
The reduce nodes (which are not
necessarily distinct from the mapper
nodes - a single machine can do
multiple jobs in succession) perform
the reduce step.
Result data is collected and merged
to produce the final output set.
While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.
Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.
A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML
Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.
If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.
MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.
The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.
There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).
Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.
Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.
I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.
Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.
The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.
You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.
It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.