I'm working to design a middle layer for an application that will receive up to ~5000 requests every few seconds and need to retrieve information from a database. I've been looking at use the Play Framework (I use scala for my REST api design) as they say its fully async and built on Akka. However, the main bottleneck of any solution seems to happen during read/writes to the database. Many Database cannot support simultaneous read/writes from a database of such a scale. How is such high concurrency achieved then for an app like this? I would guess Facebook/Twitter/ (name other big company) may have achieved this for their Applications as millions of people may be using them concurrently.
As Tim's comment was saying caching may or may not be able to help in your case. If not I would also recommend looking into horizontally scalable databases, for example cockroachdb if you want a transactional SQL db. Otherwise there are many no-sql choices a la mongodb etc. And if you really want to stick to traditional SQL systems you'll have to vertically scale your servers (buy the most expensive hardware) and work with read-replicas.
A huge component is your data model and query access pattern. If each query is incrementing a shared counter that has to be synchronized there will be a ton of contention, but if each query is touch completely separate data on the other end the spectrum than there will be a lot less contention.
I think there are a couple of dimensions I would consider:
Data Schema and Access Patterns (discussed above)
Language Choice
This is important becaues if you were in a web server context and were using prefork by default each process may have its own connection to the database. In an environment like python or ruby you may need hundreds of processes to handle your load. Contrast this with akka or another async networking based runtime (node, python gevent/asyncio, go, etc) where a single instance with a small thread pool can handle a large number of requests. Each have their tradeoffs.
Distributed Systems
Depending on your data schema and access patterns 5000 requests per second to a RDBMS is completely achievable. It would probably require relatively beefy hardware but but I'v personally done it a number of times. Getting to larger scales requires more computers in order to distribute the work/load. If your workload is right heavy and you can support potentially stale reads, a read replica is one option. With another machine in the mix reads are distributed over 2 machines but writes are still directed at a single machine (leader). Caching is another option.
At much higher workloads some sort of partitioning needs to occur in order to overcome the constraints of a single machine. https://github.com/vitessio/vitess
Many of the big contenders have solutions to horizontally scaling their databases. This has many drawbacks as well and will require careful planning.
The one thing I'd recommend is that if 5000 requests per second is projected for the near future, start with the minimal amount of hardware necessary (single instance) query patterns and operation get exponentially more complicated with a distributed database.
I am working on Distributed Tensorflow, particularly the implementation of Reinspect model using Distributed Tensorflow given in the following paper https://github.com/Russell91/TensorBox .
We are using Between-graph-Asynchronous implementation of Distributed tensorflow settings but the results are very surprising. While bench marking, we have come to see that Distributed training takes almost more than 2 times more training time than a single machine training. Any leads about what could be happening and what else could be tried be would be really appreciated. Thanks
Note: There is a correction in the post, we are using between-graph implementation not in-graph implementation. Sorry for the mistake
In general, I wouldn't be surprised if moving from a single-process implementation of a model to a multi-machine implementation would lead to a slowdown. From your question, it's not obvious what might be going on, but here are a few general pointers:
If the model has a large number of parameters relative to the amount of computation (e.g. if it mostly performs large matrix multiplications rather than convolutions), then you may find that the network is the bottleneck. What is the bandwidth of your network connection?
Are there a large number of copies between processes, perhaps due to unfortunate device placement? Try collecting and visualizing a timeline to see what is going on when you run your model.
You mention that you are using "in-graph replication", which is not currently recommended for scalability. In-graph replication can create a bottleneck at the single master, especially when you have a large model graph with many replicas.
Are you using a single input pipeline across the replicas or multiple input pipelines? Using a single input pipeline would create a bottleneck at the process running the input pipeline. (However, with in-graph replication, running multiple input pipelines could also create a bottleneck as there would be one Python process driving the I/O with a large number of threads.)
Or are you using the feed mechanism? Feeding data is much slower when it has to cross process boundaries, as it would in a replicated setting. Using between-graph replication would at least remove the bottleneck at the single client process, but to get better performance you should use an input pipeline. (As Yaroslav observed, feeding and fetching large tensor values is slower in the distributed version because the data is transferred via RPC. In a single process these would use a simple memcpy() instead.)
How many processes are you using? What does the scaling curve look like? Is there an immediate slowdown when you switch to using a parameter server and single worker replica (compared to a single combined process)? Does the performance get better or worse as you add more replicas?
I was looking at similar thing recently, and I noticed that moving data from grpc into Python runtime is slower than expected. In particular consider following pattern
add_op = params.assign_add(update)
...
sess.run(add_op)
If add_op lies on a different process, then sess.run adds a decoding step that happens at rate of 50-100 MB/second.
Here's a benchmark and relevant discussion
I am new to akka and am trying to see if it answers the problematics i am facing. I have data from databases to extract, transform with algorithms and send by and to actors. This involves a lot of computing.
Can akka handle all this (communication and computing)? Or do i have to call upon another tool to manage the calculus part?
Thank you all.
wip
Well, all I can offer here is my experience. As a matter of fact I am currently working on something similar (i.e an ETL with text files). We're essentially taking a lot of text files and loading their lines up into a PostgreSQL database. This is our setup :
Intel Xeon 8 cores + SSD
Files and app on the same machine
Remote database
We're able to fetch, parse and load 26 millions file lines and creating specific database indices in about 12 minutes, which is about 1.3GB worth of files and 3GB in database. On a much crappier mono-core and HDD setup we can do it in about 40 minutes.
The good thing about Akka is that it will allow you to save up resources and scale more since several actors can share one thread.
Akka can easily handle many millions of message sends per second, oldie but goodie on this topic here in this letitcrash.com post. As long as you factor out blocking operations in separate dispatchers (thread pools) the actor model eases parallel computations a lot, which of course gives you nice wall-clock-time in such data crunching apps.
I'm trying to perform statistical analysis on relatively flat time series data with AWS Elastic MapReduce. AWS gives you the option of using Hive, Pig, or HBase for EMR jobs- which one would be best for this type of analysis? I don't think data analysis is gonna be on the terrabyte scale- items in my tables are mostly under 1K. I've also never used any of the three, but learning curve shouldn't be an issue. I'm more concerned with what is going to be more efficient; I'm also handing this project off soon, so something that is relatively to understand for people with noSQL experience would be nice- but I'm mostly looking to make the sensible choice for the data I have. An example query I might make is something like "Find all accounts between last week and today with an event value over 20 for each day".
IMHO, none of these. You use MR, Hive, Pig, etc when your data is big, really big and you are talking about a dataset which not even of ~TB. And you want your system to be efficient as well. In such a scenario using these tools would be an overkill. So the sensible choice for the data you have would be a RDBMS of your choice.
And if it is just for learning purpose then use HDFS+Hive or Pig(Depending on what suits you better).
In response to your comment :
If I had such a situation like this, I would use HDFS, to store my flat data, with Hive. The reason why I would go with Hive is that I don't see a lot of transformation kind of things going on here. So, yes, I would go with Hive. And, I don't really see any HBase need as of now. HBase is normally used when you need random real-time access to some part of your data. And if your use case really demands HBase you need to be careful while designing your schema since you are dealing with timeseries data.
But the decision on whether to use Hive or Pig needs some deeper analysis of the kind of operations you are going to perform on your data. You might find these links helpful :
http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
P.S. : You might wanna have a look at R project.
A short summary answer:
Hive is an easy "first option" for your data analysis, because it will use familiar SQL syntax. Because of this there are many convenient connectors to front end analysis tools: Excel, Tableau, Pentaho, Datameer, SAS, etc.
Pig is used more for ETL (transformation) of data incoming to Hadoop. Your data analysis may require some "transformation" of the data before it is stored in Hive. For example you may choose to strip out headers, apply information from other sources, etc. A good example of how this works is provided with the free Hortonworks sandbox tutorials.
HBase is more valuable when you're explicitly looking for a NoSQL store on top of hadoop (example).
I have a question relating to databases and at what point is worth diving into one. I am primarily an embedded engineer, but I am writing an application using Qt to interface with our controller.
We are at an odd point where we have enough data that it would be feasible to implement a database (around 700+ items and growing) to manage everything, but I am not sure it is worth the time right now to deal with. I have no problems implementing the GUI with files generated from excel and parsed in, but it gets tedious and hard to track even with VBA scripts. I have been playing around with converting our data into something more manageable for the application side with Microsoft Access and that seems to be working well. If that works out I am only a step (or several) away from using an SQL database and using the Qt library to access and modify it.
I don't have much experience managing data at this level and am curious what may be the best way to approach this. So what are some of the real benefits of using a database if any in this case? I realize much of this can be very application specific, but some general ideas and suggestions on how to straddle the embedded/application programming line would be helpful.
This is not about putting a database in an embedded project. It is also not a business type application where larger databases are commonly used. I am designing a GUI for a single user on a desktop to interface with a micro-controller for monitoring and configuration purposes.
I decided to go with SQLite. You can do some very interesting things with data that I didn't really consider an option when first starting this project.
A database is worthwhile when:
Your application evolves to some
form of data driven execution.
You're spending time designing and
developing external data storage
structures.
Sharing data between applications or
organizations (including individual
people)
The data is no longer short and
simple.
Data Duplication
Evolution to Data Driven Execution
When the data is changing but the execution is not, this is a sign of a data driven program or parts of the program are data driven. A set of configuration options is a sign of a data driven function, but the whole application may not be data driven. In any case, a database can help manage the data. (The database library or application does not have to be huge like Oracle, but can be lean and mean like SQLite).
Design & Development of External Data Structures
Posting questions to Stack Overflow about serialization or converting trees and lists to use files is a good indication your program has graduated to using a database. Also, if you are spending any amount of time designing algorithms to store data in a file or designing the data in a file is a good time to research the usage of a database.
Sharing Data
Whether your application is sharing data with another application, another organization or another person, a database can assist. By using a database, data consistency is easier to achieve. One of the big issues in problem investigation is that teams are not using the same data. The customer may use one set of data; the validation team another and development using a different set of data. A database makes versioning the data easier and allows entities to use the same data.
Complex Data
Programs start out using small tables of hard coded data. This evolves into using dynamic data with maps, trees and lists. Sometimes the data expands from simple two columns to 8 or more. Database theory and databases can ease the complexity of organizing data. Let the database worry about managing the data and free up your application and your development time. After all, how the data is managed is not as important as to the quality of the data and it's accessibility.
Data Duplication
Often times, when data grows, there is an ever growing attraction for duplicate data. Databases and database theory can minimize the duplication of data. Databases can be configured to warn against duplications.
Moving to using a database has many factors to be considered. Some include but are not limited to: data complexity, data duplication (including parts of the data), project deadlines, development costs and licensing issues. If your program can run more efficiently with a database, then do so. A database may also save development time (and money). There are other tasks that you and your application can be performing than managing data. Leave data management up to the experts.
What you are describing doesn't sound like a typical business application, and many of the answers already posted here assume that this is the kind of application you are talking about, so let me offer a different perspective.
Whether or not you use a database for 700 items is going to depend greatly on the nature of the data.
I would say that, about 90% of the time at this scale, you will benefit from a light-weight database like SQLite, provided that:
The data may potentially grow substantially larger than what you are describing,
The data may be shared by more than one user,
You may need to run queries against the data (which I don't think you're doing right now), and
The data can easily be described in table form.
The other 10% of the time, your data will be highly structured, hierarchical, object-based, and doesn't neatly fit into the table model of a database or Excel table. If this is the case, consider using XML files.
I know developers instinctively like to throw databases at problems like this, but if you are currently using Excel data to design user interfaces (or display configuration settings), rather than display a customer record, XML may be a better fit. XML is more expressive than either Excel or database tables, and can be easily manipulated with a simple text editor.
XML parsers and data binders for C++ are easy to find.
I recommend you to introduce a Database in your app, your application will gain flexibility and will be easier to maintain and to improve with new features in the future.
I would start with a lightweight file based db like Sqlite.
With a well designed db you'll have:
Reduced data redundancy
Greater data integrity
Improved data security
Last but not least, using a database will save you from the Excel import/update/export Hell!
Reasons for using a database:
Concurrent writes. It's easy to achieve concurrency in databases
Easy querying. SQL queries tend to be much concise than procedural code to search data. UPDATEs, INSERT INTOs can also do lots of stuff with very little code
Integrity. Constraints are very easy to define and are enforced without writing code. If you have a non-null constraint, you can rest assured that the value won't be null, no need to write checks anywhere. If you have a foreign key constraint in place, you won't have "dangling references".
Performance over large datasets. Indexing is very simple to add to an SQL database
Reasons for not using a database:
It tends to be an extra dependency (although there exist very lightweight databases- I like H2 for Java, for instance)
Data not well suited to a relational schema. Things that are basically key/value maps. XML (although databases often support XPath, etc.).
Sometimes files are more convenient. They can be diff'ed, merged, edited with a plain text editor, etc. Sometimes spreadsheets can be more practical (you don't have to build an editor- you can use a spreadsheet program)
Your data is already somewhere else
When you have a lot of data that you're not sure how they will be exploited in the future.
For example you might want to add an SQLite database in an embedded application that need to register statistics that you're not sure how will be used. Later you send the full database for injection in a bigger one running on a central server and those data can easily be exploited, using requests.
In fact, if your application's purpose is to "gather data" then having a database is a must have.
I see quite a few requirements that well met by databases:
1). Ad hoc queries. Find me all the {X} that meet criteria Y
2). Data with structure that can benefit from normalisation - factoring out common values into separate "tables". You can save space and reduce the possibility of inconsistency this way. Once you've done this then those ad-hoc queries start to be really useful.
3). Large data volumes. Professional database are very good at making good use of resoruces, clever query optmisations and paging strategies. Trying to write this stuff yourself is a real challenge.
You're clearly not needing that last one, but the other two, maybe do apply to you.
Don't forget that the appropriate database can be quite different depending on your requirements (and don't forget that a text file could be used as a database if you're requirements are simple enough - for example, config files are just a specific kind of database). Such parameters might be:
number of records
size of data items
does the database need to be shared with other devices? Concurrently?
how complex are the relationships between the various pieces of data
is the database read only (created at build time and not changed, for example)?
does the database need to be updated by multiple entities concurrently?
do you need to support complex queries?
For a database with 700 entries, an in-memory sorted array loaded from a text file might well be appropriate. But I could also see the need for an embedded SQL database or maybe having the controller request data from the database over a network connection depending on what the various requirements (and resource limitations) are.
There isn't a specific point at which a database is worthwhile. Instead I usually ask the following questions:
Is the amount of data the application uses/creates growing?
Is the upper limit of this data growth unknown (or unclear)?
Will the application need to aggregate or filter this data?
Could there be future uses of the data that may not be obvious right now?
Is performance of data retrieval and/or storage important?
Are there (or could there be) multiple users of the application who share data?
If I answer 'Yes' to most of these questions I almost always choose a database (as opposed to other options such as XML/ini/CSV/Excel/text files or the filesystem).
Also, if the application will have many users who could be accessing the data concurrently, I'll lean towards a full database server (MySQL, SQl Server, Oracle etc).
But often in a single user (or small concurrency) situation, a local database such as SQLite cannot be beaten for portability and ease of deployment.
To add a negative: not suitable for real-time processing, due to non-deterministic latency. However, It would be quite ample for looking up and setting operating parameters, for instance during startup. I would not put database accesses on critical time paths.
You don't need a database if you have a few thousand rows in one or two tables to handle in a single user app (for the embedded point).
If it is for multiple users (concurrent access, locking) or the need of transactions you definitly should consider a database.
Handling complex datastructures in normalized tables and maintain integrety, or a huge amount of data would be another indication you should use a database.
It sounds like your application is running on a desktop computer and simply communicating to the embedded device.
As such using a database is much more feasible. Using one on an embedded platform is a much more complex issue.
On the desktop front I use a database when there is the need to store new information continuously and the need to extract that information in a relational way. What i don't use databases for is storing static information, information i read once at load and thats it. The exception is when the application has many users and there is the need to storage this information on a per user basis.
It sounds be to me like your collecting information from your embedded device, storing it somehow, then using it later to display via a GUI.
This is a good case for using a database, especially if you can architect the system such that there is a data collection daemon that manages the continuous communication with the embedded device. This app can then just write the data into the database. When the GUI is launched it can extract the data for display.
Using the database will also ease your GUI develop if you need to display different views, such as "show me all the entries between 2 dates". With a database you just ask it for the correct values to display with a proper SQL query and the GUI displays whatever comes back allowing you to decouple much of the "business logic" code from the GUI.
We are also facing a similar situation. We have set of data coming from different test setups and it is currently being dumped into excel sheets, processed using Perl or VBA.
We found out this method had lot of problems:
i. Managing data using excel sheets is quite cumbersome. After some time you have a whole lot of excel sheets and there is no easy way to retrieve required data from it.
ii. People start sending the excel sheets to and fro for comments and review through e-mails. E-Mail becomes the primary mode of managing the comments related to the data. These comments are lost at a later point of time and there is no way of retrieving it back.
iii. Multiple copies of the files get created and changes in one copy are not reflected in the other - there is no versioning.
This is for the same reasons we have decided to move to a database based solution and are currently working on it. Let me summaries what we are trying to do:
i. The database is in a central server accessible by PC in all the test setups.
ii. All the data goes into a temporary location (local hard disk in files) as soon as it is generated. From the files, it is pushed into database by a process running in the background (so even if there is a network problem, data will be present in the local files system).
iii. We have a web based application which allows users to log in and access data in the format they want. The portal will allow them to add comment, generate different kind of reports, share it with other users after review etc. It will also have the ability to export data into excel sheet, just in case you need to take it with you.
Let know if this can be better implemented.
"At what point is it worth using a database?"
If and when you've got data to manage ?