Large data processing technology & books [closed] - c++

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I am looking for good resources on how to query large volume of data efficiently.
Each data item is represented as many different attributes such as quantity, price, history info, etc. The client will provide different query criteria but without requirement to change the dataset. By simply storing all data into MS SQL is not a good method b/c the scalability of MS SQL is not that good. Here we are targeting many tera byte data and need 200-300 CPU clusters.
I am interested in good resources or books that I can at least do some research.

Did you consider NoSql solution as MongoDb ?

If query speed is not your number one issue you should see if you could build a solution with ROOT, possibly in conjunction with PROOF. In contrast to a NoSql solution you would here trade consistency for some speed.
It is used by the CERN experiments to store and retrieve their experimental data (much more than you require) and if you can find a way to handle the I/O it can be made to scale pretty well.
I have heard it is used by some firms doing quantitative finance.

Related

test/training effect on classifers results [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
im struggling to understand the effect of training/test data's effect on my correctly classified instances result.
An example with naive bayes if i apply more test data in percentage split the algorithm becomes more reliable?
The point of splitting your entire data set into training and test is that the model you want to learn (naive Bayes or otherwise) should reflect the true relationship between cause and effect (features and prediction) and not simply the data. For example, you can always fit a curve perfectly to a number of data points, but doing that will likely make it useless for the prediction you were trying to make.
By using a separate test set, the learned model is tested on unseen data. Ideally, the error (or whatever you're measuring) on training and test set would be about the same, suggesting that your model is reasonably general and not overfit to the training data.
If in your case decreasing the size of the training set increases performance on the test set, it suggests that the learned model is too specific and cannot be generalised. Instead of changing the training/test split, you should tweak the parameters of your learner however. You might also want to consider using cross validation instead of a simple training/test split as it will provide more reliable performance estimates.

Generating word library - C or C++ [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I need to create a simple application, but speed is very important here. Application is pretty much simple.
It will generate all available chars by saving them to text file. User will enter length that will be used for generating so the application will use recursive function with loop inside.
Will C be faster then C++ in this matter, or it does not matter?
Speed is very important because if my application needs to generate/save to file 10 million+ words.
It doesn't really matter, chances are your application will be I/O bound rather than CPU bound unless you have enough RAM to hold all that in memory.
It's much more important that you choose the best algorithm, and the best data structures to back that algorithm up.
Then implement that in the language you're most familiar with. C++ has the advantage of having easy to use containers in its standard libraries, but that's about it. You can write slow code in both, and fast code in both.

get instant energy consumption [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I am looking to get instant energy consumption, in shell or C++
any ideas ?
Thanks
Your question could do with a bit more detail, but if I understand you correctly a program named Joulemeter does this the following way:
Joulemeter estimates the energy usage
of a VM, computer, or software by
measuring the hardware resources (CPU,
disk, memory, screen etc) being used
and converting the resource usage to
actual power usage based on
automatically learned realistic power
models.
That is one way to go. If you're just doing this for your own project, I guess you could throw together some hardware that measured from the wall socket and gave you the data that way. Maybe something like that exists already.
Well, if you have a Laptop you could use the answer presented for this similar question:
/usr/sbin/system_profiler SPPowerDataType | grep Wattage

Web service for PDF generation [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
Does anybody know where I can find a Public Restful Web Service for PDF Generation? If so, do you have any experience using it (is it reliable/fast etc if commercial)?
The service needs to be able to take in any number of formats and return a PDF document.
EDIT: Please refrain from commenting or answering unless you know what a RESTful web service is and does. The comment war below was due mostly to my assumption that this was generally obvious to present-day programmers.
Here's one: pdflayer API
I've also created my own in the past, using Sun's Star Office Server (now Oracle Open Office Server, I think). The pricing is ridiculous, of course.
Not REST specific, but if you want to convert html documents or strings to pdf, please note that complex css-based formatting rules can be problematic in places.
Maybe some experts can bet it right, but that was my experience a few years ago.

Of these four libraries, which are you most likely to use? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I'm trying to pick out my next hackery project. It'll likely be one of the following:
A sparse radix trie Implementation with extremely fast set operations
A really good soft heap implementation
A bloomier filter implementation
A collection of small financial algorithms, such as deriving total returns given a set of dividends and minimal information about them.
But I can't choose. So I thought I'd put my fate in the hands of my peers. Which of those four would you find most useful? Most interesting to work on? Which do you think is the most needed?
I didn't know what a bloomier (maybe Bloom?) filter is until reading your question. Sounds cool and useful.