Can you eagerly load all data into memory with Datomic?

Can you eagerly load all data into memory with Datomic? - clojure

I've previously asked this question on how to improve performance with Datomic but I have yet to find a good solution. One thing that struck me was that when using the Datomic Console to execute the query I got the impression that the query was MUCH faster. But I also noticed a great increase of startup time and memory consumption when using the Datomic Console compared to when I start my application standalone. This to me implies that Datomic Console pulls all data into memory before I explore the contents.
Am I right that this is the case?
If so, is this something I could do myself programmatically from a peer?
If (2) then how can this be done in Clojure?

As described here in the Datomic Documentation, the Peer Library loads index segments in the (in-process) Object Cache when it fetches them for querying.
Am I right that this is the case?
I doubt that the Datomic Console explicitly chooses to pull all datoms into memory, but it is possible that the Datomic console eagerly traverses a large chunk of your data in order to show its dashboard.
If so, is this something I could do myself programmatically from a peer?
Well, I guess you could always artificially scan through all the segments. One easy way to do this is via the Datoms API.
If (2) then how can this be done in Clojure?
(defn scan-whole-db [db]
(doseq [index [:eavt :aevt :avet :vaet]]
(dorun (seq (d/datoms db index)))))
That all being said, I'm not sure at all you should expect performance improvements from this strategy. Your Object Cache had better be large enough!

Related

Save RocksDB stored values between runs

My C++ app is using RocksDB to store in-memory key-value sets.
At some points, I want my app to be able to keep the DB values until its next run. Meaning, the program will shut down, start again and read the same values from the DB as it had before it shut down.
What would be the quickest and simplest way to achieve this?
I found the following article for backup & restore routine - https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F, but maybe its an overkill?

Adding to what yinqiwen said, RocksDB was not meant to be just an in memory data store. It works really well with a variety of storage types. And it is especially good in terms of performance when it comes to flash storage. You may use a variety of RocksDB Options to experiment with what configuration is best to use for your workload, but for most cases, even with the default settings for the persistent storage types, rocks db should work just fine.

rocksdb already provide some ways to persist in-memory RocksDB database. u can see this link to conigure your rocksdb. http://rocksdb.org/blog/245/how-to-persist-in-memory-rocksdb-database/

How to profile an openedge database?

Is there a Progress profiling tool that allows me to see the queries executing against an OpenEdge database?
We're doing a migration from an OpenEdge database into a SQL database. In order to map the data correctly we'd like to run certain application reports on the OpenEdge database and see what database queries are being executed to retrieve the data.
Is this possible with some kind of Progress profiling tool (a la SQL Server Profiling)? Preferably free...

Progress is record oriented, not set oriented like SQL, so your reports aren't a single query or a set of queries, it is more likely a lot of record lookups combined with what you'd consider query-like operations.
Depending on the version you're running, there is a way to send a signal to the client to see what it is currently doing, however doing so will almost certainly not give you enough information to discern what's going on "under the hood."
Long story short, your options are to get a Dataserver product so you can attach the Progress client to an SQL database - this will enable you to use an SQL database w/out losing the Progress functionality. The second option is to get a copy of the program's source code to find out how the reports are structured.

Tim is quite right -- without the source code, looking at the queries is unlikely to provide you with much insight.
None the less there are some tools and capabilities that will provide information about queries. Probably the most useful for your purpose would be to specify something similar to:
-logentrytypes QryInfo -logginglevel 3 -clientlog "mylog.log"
at session startup.

You can use session triggers to identify almost anything done by any program, without modifying or having access to the source of those programs. Setting this up may be more work than is worth it for your purpose. We have a testing system built around this idea. One big flaw: triggers cannot be fired for CAN-FIND.

SQL Query minimizing/caching in a C++ application

I'm writing a project in C++/Qt and it is able to connect to any type of SQL database supported by the QtSQL (http://doc.qt.nokia.com/latest/qtsql.html). This includes local servers and external ones.
However, when the database in question is external, the speed of the queries starts to become a problem (slow UI, ...). The reason: Every object that is stored in the database is lazy-loaded and as such will issue a query every time an attribute is needed. On average about 20 of these objects are to be displayed on screen, each of them showing about 5 attributes. This means that for every screen that I show about 100 queries get executed. The queries execute quite fast on the database server itself, but the overhead of the actual query running over the network is considerable (measured in seconds for an entire screen).
I've been thinking about a few ways to solve the issue, the most important approaches seem to be (according to me):
Make fewer queries
Make queries faster
Tackling (1)
I could find some sort of way to delay the actual fetching of the attribute (start a transaction), and then when the programmer writes endTransaction() the database tries to fetch everything in one go (with SQL UNION or a loop...). This would probably require quite a bit of modification to the way the lazy objects work but if people comment that it is a decent solution I think it could be worked out elegantly. If this solution speeds up everything enough then an elaborate caching scheme might not even be necessary, saving a lot of headaches
I could try pre-loading attribute data by fetching it all in one query for all the objects that are requested, effectively making them non-lazy. Of course in that case I will have to worry about stale data. How would I detect stale data without at least sending one query to the external db? (Note: sending a query to check for stale data for every attribute check would provide a best-case 0x performance increase and a worst-caste 2x performance decrease when the data is actually found to be stale)
Tackling (2)
Queries could for example be made faster by keeping a local synchronized copy of the database running. However I don't really have a lot of possibilities on the client machines to run for example exactly the same database type as the one on the server. So the local copy would for example be an SQLite database. This would also mean that I couldn't use an db-vendor specific solution. What are my options here? What has worked well for people in these kinds of situations?
Worries
My primary worries are:
Stale data: there are plenty of queries imaginable that change the db in such a way that it prohibits an action that would seem possible to a user with stale data.
Maintainability: How loosely can I couple in this new layer? It would obviously be preferable if it didn't have to know everything about my internal lazy object system and about every object and possible query
Final question
What would be a good way to minimize the cost of making a query? Good meaning some sort of combination of: maintainable, easy to implement, not too aplication specific. If it comes down to pick any 2, then so be it. I'd like to hear people talk about their experiences and what they did to solve it.
As you can see, I've thought of some problems and ways of handling it, but I'm at a loss for what would constitute a sensible approach. Since it will probable involve quite a lot of work and intensive changes to many layers in the program (hopefully as few as possible), I thought about asking all the experts here before making a final decision on the matter. It is also possible I'm just overlooking a very simple solution, in which case a pointer to it would be much appreciated!
Assuming all relevant server-side tuning has been done (for example: MySQL cache, best possible indexes, ...)
*Note: I've checked questions of users with similar problems that didn't entirely satisfy my question: Suggestion on a replication scheme for my use-case? and Best practice for a local database cache? for example)
If any additional information is necessary to provide an answer, please let me know and I will duly update my question. Apologies for any spelling/grammar errors, english is not my native language.
Note about "lazy"
A small example of what my code looks like (simplified of course):
QList<MyObject> myObjects = database->getObjects(20, 40); // fetch and construct object 20 to 40 from the db
// ...some time later
// screen filling time!
foreach (const MyObject& o, myObjects) {
o->getInt("status", 0); // == db request
o->getString("comment", "no comment!"); // == db request
// about 3 more of these
}

At first glance it looks like you have two conflicting goals: Query speed, but always using up-to-date data. Thus you should probably fall back to your needs to help decide here.
1) Your database is nearly static compared to use of the application. In this case use your option 1b and preload all the data. If there's a slim chance that the data may change underneath, just give the user an option to refresh the cache (fully or for a particular subset of data). This way the slow access is in the hands of the user.
2) The database is changing fairly frequently. In this case "perhaps" an SQL database isn't right for your needs. You may need a higher performance dynamic database that pushes updates rather than requiring a pull. That way your application would get notified when underlying data changed and you would be able to respond quickly. If that doesn't work however, you want to concoct your query to minimize the number of DB library and I/O calls. For example if you execute a sequence of select statements your results should have all the appropriate data in the order you requested it. You just have to keep track of what the corresponding select statements were. Alternately if you can use a looser query criteria so that it returns more than one row for your simple query that ought to help performance as well.

Desktop App w/ Database - How to handle data retrieval?

Imagine to have a Desktop application - could be best described as record keeping where the user inserts/views the records - that relies on a DB back-end which will contain large objects' hierarchies and properties. How should data retrieval be handled?
Should all the data be loaded at start-up and stored in corresponding Classes/Structures for later manipulation or should the data be retrieved only at need, stored in mock-up Classes/Structures and then reused later instead of being asked to the DB again?
As far as I can see the former approach would require a bigger memory portion used and possible waiting time at start-up (not so bad if a splash screen is displayed), while the latter could possibly subject the user to delays during processing due to data retrieval and would require to perform some expensive queries on the database, whose results and/or supporting data structures will most probably serve no purpose once used*.
Something tells me that the solution lies on an in-depth analysis which will lead to a mixture of the two approaches listed above based on data most frequently used, but I am very interested in reading your thoughts, tips and real life experiences on the topic.
For discussion's sake, I'm thinking about C++ and SQLite.
Thanks!
*assuming that you can perform on Classes/Objects faster operations rather than have to perform complicated queries on the DB.
EDIT
Some additional details:
No concurrent access to the data, meaning only 1 user works on the data which is stored locally.
Data is sent back depending on changes made humanly - i.e. with low frequency. This is not necessarily true for reading data from the DB, where I can expect to have few peaks of lots of reads which I'd like to be fast.
What I am most afraid of is the user getting the feeling of slowness when displaying a complex record (because this has to be read in from the DB).

Use Lazy Load and Data Mapper (pg.165) patterns.

I think this question depends on too many variables to be able to give a concrete answer. What you should consider first is how much data you need to read from the database in to your application. Further, how often are you sending that data back to the database and requesting new data? Also, will users be working on the data concurrently? If so, loading the data initially is probably not a good idea.
After your edits I would say it's probably better to leave the data at the database. If you are going to be accessing it with relatively low frequency there is no reason to load up or otherwise try to cache it in your application at launch. Of course, only you know your application best and should decide what bits may be loaded up front to increase performance.

You might consider to user intermediate server (WCF) that will contain cached data from the database in memory, this way users don't have to go every time to the database. Also since it is only one access point to for all users if somebody changes/added record you can update cache as well. Static data can be reloaded every x hours (for example every hour). It still might not the best option, since data needs to be marshaled from Server to the Client, but you can use netTcp binding if you can, which is fast and small.

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let's say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let's say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at?
If not, why not?
I put the example down, but now wish that I didn't. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?

MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.
To touch on why this is the case, consider the way MapReduce works (general, high-level overview):
A bunch of nodes receive portions of
the input data (called splits) and do
some processing (the map step)
The intermediate data (output from
the last step) is repartitioned such
that data with like keys ends up
together. This usually requires some
data transfer between nodes.
The reduce nodes (which are not
necessarily distinct from the mapper
nodes - a single machine can do
multiple jobs in succession) perform
the reduce step.
Result data is collected and merged
to produce the final output set.
While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.
Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.

A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML
Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.
If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.

MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.
The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.
There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).
Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.
Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.

I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.
Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.
The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.
You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.

It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js