suggestion needed for storage system for an crawler

suggestion needed for storage system for an crawler - c++

I am planning to write a web crawler in c++ which crawls N number of pages daily. The main problem is that I am getting confusing with storage system . So I need a distributed db which efficient to store my crawled datas. Can anyone suggest me db which fulfill conditions?

MongoDB is likely a good fit since it supports almost all requirements in a straight forward and high-efficient way (including a nice query API). Distribution is accomplished through "Sharding".
Do not ask for a comparision of the databases (often discussed including stackoverflow ).

unless N is very large, or you plan on storing a lot of versions, you probably don't need a distributed DB. Try starting with MySQL

Related

Sharding existing postgresql database with PostgresXL

We want to shard our PostgreSQL DB, due to high disk load. Firstly, we looked at django-sharding library, but:
Very much rewriting in our backend
Migrating all tables to 64-bit primary keys is hard work on 300-400gb tables
Generating ids with Postgres Specific algorithm makes it impossible to move data from shard to shard. More than that, we have a large database with old ids. Updating all of them is a big problem too.
Generating ids with special tables makes us do a special SELECT query to main database every time we insert data. We have high write load, so it's not good.
Considring all these, we decided too look on Postgres database sharding solutions. We found 2 opportunities - Citus and PostgresXL. Citus makes us change data format too much and rewrite a big bunch of backend at the same time, so we are about to try PostgresXL as more transparent solution. But reading the docs, I can't understand some things and will be greatfull for recomendations:
Are there any other sharding workarounds except for Citus and PostgresXL? It would be good not to change much in our database on migrating.
Some questions about PostgresXL:
Do I understand correctly, that it's not Postgres extension, it's a standalone fork? So I should build all its parts from sources and than move data in some way?
How are Postgres and PostgresXL versions compatible? We have PostgreSQL 9.4. I don't see such a version in PostgresXL (9.2 or 9.5 no middle?). So can I use, for example, streaming replication for migration?
If yes/no, what is the best solution to migrate data? If I have 2Tb database with heavy write, can I migrate it somehow without stopping for a long period of time?
Thanks.

First off to save your self a LOT of headache have you looked at options Like Amazon's Auora, Dynomo, Red Shift, etc services? They are VERY cost effective at scale, as well as optimized and managed for you.
Actually Amazon's straight Postgress databases can handle MASSIVE amounts of reads or writes. We can go into 2,000- 6,000 IOPS on reads and another 2,000 to 6,000 IOPS in writes without issue. I would really look into this as the option. Azure, Oracle, and Google also have competing services.
Also be aware that Postgres-XL beyond all reason has no HA support. If you lose a single node you lose everything. The nodes can not fail over.
it's a standalone fork?
Yes, They are very different apps and developed separate from each other.
How are Postgres and PostgresXL versions compatible?
They arn't compatible. You can not just migration Postgres to Postgresl-XL. They work VERY differently.
Generating ids with Postgres Specific algorithm makes it impossible to >move data from shard to shard
Not following this, but with sharing you are not supposed to move data from one shard to another. The key being used generally needs to be something specific and unique to split/segregate your data on. Like a date, or a "type" field, or some other (hopefully ordered) field(s)/column(s). This breaks things up but has obvious pain in the a$$ limitations.
Are there any other sharding workarounds except for Citus and
PostgresXL? It would be good not to change much in our database on >>migrating.
Tons of options, but right off the bat going from a standard RDS, to a NoSql, or MPP database is going to be a major migration, a lot of effort, and have a LOT of limitations no matter what you do.
Next Postress-XL and Citus are MPP (massive parallel processing) clustering apps, not sharing specifically. That is part of what they can do, but it is not their focus.
Other options for MPP
pgPool -- (not great for heavy writes )
haProxy -- ( have not done it but read about it. Lost of work to setup and maintain. )
MySql Cluster -- (Huge pain to use the OSS version and major $$$ for the commercial version)
Green Plumb
Teradata
Vertica
what is the best solution to migrate data?
Very unlikely to find a simple migration for this kind of switch. You can expect to likely need to export the data your self from the existing RDS and import it to the new DB and will likely have to write something your self to get it the way you want it.

What is the different between AWS Elasticsearch and AWS Redshift

I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift

Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch

I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!

This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.

Using Apache Spark to serve real time web services queries

We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.
For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?
Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.
Would like to hear from the experts on this board.

If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.
But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.

So you're batch processing the raw data, yes?
The ideal way would be to store the outcome on a key-value format, as you mention with ElephandDB, and also project Voldemort has been shown to be a good fit as read-only storage.
I recommend you to read this article (combining batch and realtime layers) by Nathan Marz: How to beat the CAP theorem
It has however been questioned by Jay Kreps in his article Questioning the Lambda Architecture. The main concern (with the lambda architecture) is that there is problematic to maintain the "same" system logic in different distributed systems to produce the same result.
But since you are using Spark, you can use the same logic with Spark Streaming. Which was not "in the market" when Nathan Marz and Jay Kreps wrote their articles.
You can still use SparkSQL to query the raw data interactively, but since Spark was first implemented as scheduled batch jobs, this will not be the perfect use case. But as you've probably noticed, is that it takes some time to submit spark jobs, this is an overhead that "kills" the idea of fast queries.
Please look into github.com/spark-jobserver/spark-jobserver, the job-server supports sub-second low-latency jobs via long-running job contexts. And can share Spark RDDs between different jobs, which can be proved to be very optimized for different interactive logic on the same dataset. Combine machine learning result and ad-hoc (SparkSQL) queries via HTTP requests. Read more about spark job-server, there are some talks about it online on different Spark Summits.

Use RDBMS to reduce memory consumption?

In my Visual C++ application, I want to allocate a lot of objects, which will use up all available memory in the system. To solve this problem, I decide to store the objects in database. I just have 3 candidates: MySQL, PostgreSQL, and SQLite. But don’t know which one is more appropriate.
What I need is:
Store objects in the database instead of memory.
Fast to find the objects via a key.
Light-weight so the RDBMS will not require a lot of system resources, including both the memory and disk spaces.
No server required.
Easy to deploy.
Which one should be best for my needs? Of course, if you have any other better alternatives, then just tell me.
SQLite provides a detailed doc how when it should be used. But MySQL and PostgreSQL does not so it is a little difficult to choose as I am not familiar with these two. Thanks.

I'd use SQLite. It doesn't require a service and is cross platform. It is easy to deploy and is light-weight. It supports transaction. It's in the public domain.

Your questions:
Store objects in the database instead of memory.
Any database can do this, that's the definition of a database.
Fast to find the objects via a key.
Also standard functionality, if you can't find your data, what's the point of using a database.
Light-weight so the RDBMS will not require a lot of system
resources, including both the memory and disk spaces.
That's mostly in your hands, bad queries generate a lot of overhead. No matter what brand of database (or software language) you use.
No server required.
Do you mean "hardware" or "client-server model" ? Both MySQL and PostgreSQL are services in a client-server model. SQLite works best for a single client.
Easy to deploy.
All 3 databases are easy to deploy, but SQLite is the easiest one. It's not a server like the others.
It looks like SQLite is the best fit, but also check your other requirements, the ones you didn't mention: performance, reliability, backup, failover, etc. etc. And do you needs an RDBMS for this kind of work? A C++ object in memory is very different from a bunch of records in a couple of databases that can be accessed by using SQL.

Full text search with Amazon Services

I would to move my application to Amazon SimpleDB, since I’m not going to maintain database service on my own. This application lives under heavy load. There are a lot of reads/writes per second. I don’t need consistency and atomicity and I want to keep things as simple as possible, so SimpleDB is good choice.
The problem is, that I need full-text search capacities. And I don’t know how to make it better with Amazon SimpleDb. I had implemented before hand-written full-text search with MongoDB database. I had to split text to words in my application layer, and build my own index. It was not hard, but I don’t want to do it again with SimpleDB.
I found an interesting article
http://codingthriller.blogspot.com/2008/04/simpledb-full-text-search-or-how-to.html
But I would like to not have to implement it myself. I’m looking for a pre-made solution
What are the options?
Is it better to use Amazon RDS + Lucene?
Or probably there are out of the box solutions for SimpleDB?
Requirements are:
Ability to handle a lot of concurrency requests
Full-text search (text size would not be greater then 1MB (SimpleDB restriction))
Preferable not to admin it on my own.

Lucene or similar is usually the way people do it, but not knowing what platform you're working with its hard to suggest anything in particular. Simol is an .NET object-persistence framework for SimpleDB which can use Lucene.NET for indexing. I've also looked at some basic Lucene.NET examples which aren't too bad. If you're looking for a hosted indexing service you could take a look at this question.
For your indexing to do its job well, you're more than likely going to have to tailor it to your application.

Amazon looks like they will announce something to do with search on Jan 18 2012. http://pandodaily.com/2012/01/17/good-news-for-ec2-customers-amazon-may-launch-new-cloud-search-tomorrow/
SimpleDB for full text search is not great. It will not search more than about 300,000 documents on a single field, using the %like% operator, for instance. It will take about 2 or three tries - about 15 seconds to run through only a hundred MB of text looking for a match. I think its too slow, as do others. See the AWS forums...

Amazon CloudSearch has been released but does not have an easy way to move data from your SimpleDB to CloudSearch without you writing code.
The API, however, is fairly simple and it probably could get up in running in a week or two depending on your needs (if you use the existined SDKs). If you're using a programming language without an SDK, then it will take you longer.
http://aws.amazon.com/cloudsearch/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js