I would like to use scrapy to crawl fairly large websites. In some cases I will already have the links to scrape and in others I will need to extract (crawl) them. I will also need to access a database twice when running. Once in order to determine if a url is required to be scraped (Spider middleware) and once in order to store the extracted information (Item pipeline).
Ideally, I would be able to run concurrent or distributed crawls in order to speed things up. What is the recommended way to run concurrent or distributed crawls with scrapy?
You should check scrapy_redis.
It is very simple to implement. Your scheduler and duplicate filter will be store in a redis queue. All the spiders will work concurrently, and you should speed up your crawl time.
Hope this helps.
The Scrapy Cluster documentation contains a page listing many existing Scrapy-based solutions for distributed crawls.
Related
I'm working on retrieving data like Products, Orders eCommerce platforms such as BigCommerce, Shopify, etc., and save it in our own databases. To improve the data retrieval speed from their APIs, we're planning to use the Bluebird library.
Earlier, the data retrieval logic was like retrieving one page at a time. Since we're planning to make concurrent calls "n" number of pages will be retrieved concurrently.
For example, Bicommerce allows us to make up to 3 concurrent calls at a time. So, we need to make the concurrent calls so that we will not retrieve the same page more than once, and in case if a request failed then a request for that page will be resent.
What's the best way to implement this? One idea that strikes my mind is,
One Possible Solution - Keep an index of ongoing requests in the database and update it on the API completion, so we will know which are unsuccessful.
Is there a better way of doing this? Any suggestions/ideas on this would be highly appreciated.
Any recommendation on how to make superset faster?
Cache seems to load full data from the cache, I thought it load only old data from the cache, and real-time data from the database, isn't it like this?
What about some parallel processing?
This answer is valid as of Superset 0.37.0.
At the moment, dashboard performance is affected by a few different factors. I'll enumerate them below along with methods to improve performance:
Database concurrency limits can have an impact on dashboard performance. Dashboards load their information in parallel via concurrent web requests. Make sure that the database user provided allows enough concurrency that queries aren't being queued at the database layer.
Cache performance your caching layer should be able to return multiple results, if not in parallel, extremely quickly. We've had success leveraging S3 for our cache.
Cache hit percentage Superset will hit the cache only for queries that exactly match one that has been run recently. Otherwise the full query will fall through to the underlying analytical DB (Druid in this case). You can reduce the query load on Druid by using a less granular resolution on your dashboard - if it's possible to have it update less frequently, say a couple of times a day rather than in real-time, this can hit cache for all requests other than the first request in the new period under consideration.
Python Web Process Concurrency Limits make sure that your web application server can handle enough parallel requests. The browser will request multiple charts' data at the same time, and the system will need to be able to handle these requests in parallel.
Chart Query Performance As data is frequently requested, especially for real-time data from a database like Druid, optimizing the queries run by the charts can be very useful. I'd take a look at any virtual datasources that are being leveraged to see if they can be materialized or made more efficient.
Web browser concurrent request limits By default most web browsers limit the number of concurrent requests that can be made to the same FQDN. If you have more than 6 charts on the same dashboard, it can be helpful to balance requests across multiple FQDNs running Superset to get around this browser limitation. There's more information on the approach to that in the issue history on Github, but Superset does support this type of configuration.
The community is very interested in improving performance over time, and as such there have been recommendations to move all analytical queries to Celery as well as making other architectural changes to improve performance. I hope this description helps and that something in here will help you track down the issue!
We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.
For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?
Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.
Would like to hear from the experts on this board.
If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.
But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.
So you're batch processing the raw data, yes?
The ideal way would be to store the outcome on a key-value format, as you mention with ElephandDB, and also project Voldemort has been shown to be a good fit as read-only storage.
I recommend you to read this article (combining batch and realtime layers) by Nathan Marz: How to beat the CAP theorem
It has however been questioned by Jay Kreps in his article Questioning the Lambda Architecture. The main concern (with the lambda architecture) is that there is problematic to maintain the "same" system logic in different distributed systems to produce the same result.
But since you are using Spark, you can use the same logic with Spark Streaming. Which was not "in the market" when Nathan Marz and Jay Kreps wrote their articles.
You can still use SparkSQL to query the raw data interactively, but since Spark was first implemented as scheduled batch jobs, this will not be the perfect use case. But as you've probably noticed, is that it takes some time to submit spark jobs, this is an overhead that "kills" the idea of fast queries.
Please look into github.com/spark-jobserver/spark-jobserver, the job-server supports sub-second low-latency jobs via long-running job contexts. And can share Spark RDDs between different jobs, which can be proved to be very optimized for different interactive logic on the same dataset. Combine machine learning result and ad-hoc (SparkSQL) queries via HTTP requests. Read more about spark job-server, there are some talks about it online on different Spark Summits.
I would like to expose a web service in front of Hadoop, that is used to forward data to Hadoop ecosystem. I have two branches in Hadoop, slower, that works on whole data periodically, and fast, that does some computation on every input, and stores the data for periodical job. But the user does not see the slower branch, and has a feeling that only the fast job is done, not knowing for the slower job that runs on data aggregated during time.
How to organize my architecture best? I am new to Hadoop architecture, I read about Oozie, and have a feeling that it can help me to some point. But I don't know how to connect the service with Hadoop, how to pass the data through service, since Hadoop works primarily on files, and is distributed system.
Data should get into system in a streaming fashion. There should be "real time" branch, that works with individual values that get into system, and they would also be accumulated for periodic batch processing.
Any help would be great, thanks.
You might want to look into hue . This provides a set of web front-ends: there's one for HDFS (the filesystem) where you can upload files; there are means to track jobs too.
If you aim more regular and automated putting of files into HDFS, please elaborate your question further: where and what is the data initially (logs? db? bunch of gzipped csv-s?), what should trigger retrieval/
One can as well use API-s to deal with the filesystem and to track jobs.
As for what oozie concerns, this is more of an orchestrating tool, use it to organize related jobs into workflows.
I am planning to write a web crawler in c++ which crawls N number of pages daily. The main problem is that I am getting confusing with storage system . So I need a distributed db which efficient to store my crawled datas. Can anyone suggest me db which fulfill conditions?
MongoDB is likely a good fit since it supports almost all requirements in a straight forward and high-efficient way (including a nice query API). Distribution is accomplished through "Sharding".
Do not ask for a comparision of the databases (often discussed including stackoverflow ).
unless N is very large, or you plan on storing a lot of versions, you probably don't need a distributed DB. Try starting with MySQL