Too aggressive bot? - web-services

Too aggressive bot? - web-services

I'm making a little bot to crawl a few websites.
Now, I'm just testing it out right now and I tried 2 types of settings :
about 10 requests every 3 seconds - the IP got banned, so I said - ok , that's too fast.
2 requests every 3 seconds - the IP got banned after 30 minutes and 1000+ links crawled .
Is that still too fast ? I mean we're talking about close to 1.000.000 links should I get the message that "we just don't want to be crawled ?" or is that still too fast ?
Thanks.
Edit
Tried again - 2 requests every 5 seconds - 30 minutes and 550 links later I got banned .
I'll go with 1 request every 2 seconds but I suspect the same will happen. I guess I'll have to contact an admin - if I can find him.

Here are some guidelines for web crawler politeness.
Typically, if a page takes x amount of seconds to download, it is polite to wait at least 10x-15x before re-downloading.
Also make sure you are honoring robots.txt as well.

Yes. It is too fast.
Generally the crawlers keep a rate of 1 requests per minute.
Honestly It is a low crawling rate. But after few minutes you can have a queue of URLs (a long list :) ). You can rotate over this list until the next turn to the particular url comes.
If you have an option of having some sort of distributed architecture (Multiple nodes with different network connections even HyperVs or VMs) you may think of a higher speed. The different hosts in the grid can grab the contents more effectively.

One of the best considerations to take into account is the site owners. As others have mentioned the robots.txt file is the standard for site's to do this.
In short you have 3 ways in robots.txt that are used to limit request speed.
Crawl-delay: # , an integer which represents the amount in seconds to wait between requests.
Request-rate: # / # , the numerator representing how many pages and the denominator representing how many per seconds. i.e: 1/3 = 1 page every 3 seconds.
Visit-time: ####-#### , two 4 digit numbers separated by hyphen which represent the time (HH:MM GMT based) that you should crawl their site.
Given these suggestions/requests you may find some sites do not have any of these in their robots.txt, in which its in your control. I would suggest keeping it to a reasonable rate at a minimum of 1 page per second while also limiting how many pages you consume a day.

Related

Why does the "hatch rate" matter when performance testing?

I'm using Locust for performance testing. It has two parameters: the number of users and the rate at which the users are generated. But why are the users not simply generated all at once? Why does it make a difference?

Looking at Locust Configuration Options I think correct option is spawn-rate
Coming back to your question, in Performance Testing world the more common term is ramp-up
The idea is to increase the load gradually, as this way you will be able to correlate other performance metrics like response time, throughput, etc. with the increasing load.
If you release 1000 users at once you will get a limited view and will be able to answer only to question whether your system supports 1000 users or not. However you won't be able to tell what is the maximum number, what is the saturation point, etc.
When you increase the load gradually you can state that i.e.
Up to 250 users the system behaves normally, i.e. response time is the same, throughput increases as the load increases
After 250 users response time starts growing
After 400 users response time starts exceeding acceptable thresholds
After 600 users errors start occurring
etc.
Also if you decrease the load gradually you can tell whether the system gets back to normal when the load decreases.

Monitor that lambda executes in NewRelic

I'm trying to monitor if my Lambda has been executed within the last 25 hours within New Relic. I want to alert if it hasn't.
I have the following NRQL which gives me the graph I want to see:
SELECT sum(`provider.invocations.Sum`) FROM ServerlessSample WHERE provider.resource = 'my_lambda_name'
I then just want to say that if it dips below 1 for 1500 minutes (25 hours) then alert, but NR only allows me to set an alarm for 120 minutes. Any tips on how to get around this?

Interesting question, as I have seen in New Relic discussion page, or Explorers Hub, there might be solution for your task.
Can you please review this link:
https://discuss.newrelic.com/t/relic-solution-extending-the-functionality-of-nrql-alert-conditions-beyond-a-single-minute/75441
If you think about this for a moment, you might see how NRQL queries using percentile or stddev are a lot less useful than they seem, when used in an alert condition. After all, if you calculate the standard deviation over an hour (or 24 hours), that can be meaningful. But stddev(duration), or percentile(duration,95) calculated over only 60 seconds is less meaningful.
I think that limit is 24 hours but I haven't test it yet.
Hope this will help you, I will try to give it a go as well to see will this work.

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.

Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

CloudSearch performance with frequent updates of small batches

I have a use case where I need to upload small document batches (typical 1 to 10 documents of 1KB each) to CloudSearch. Every 2 or 3 seconds a new batch is uploaded. The CloudSearch docs for bulk uploads say:
Make sure your batches are as close to the 5 MB limit as possible. Uploading a larger amount of smaller batches slows down the upload and indexing process.
It's ok if there is a 30 seconds delay before the documents show up in search results. Will my implementation work well as my document count is increasing, let's say to 500.000 docs?

Indexing time should be well under your 30 second SLA even with 500k docs, regardless of how or whether you batch your submissions.
I say this based on my own testing with an index of 300k docs and 38 index fields on an m1.small instance type, where it takes less than 3 seconds for a document to be searchable. There are a lot of variables that could affect your own situation, such as how many index fields you have, your instance size, etc, but I think my setup reflects the unfavorable conditions (m1.small instance with complex indexing schema) and is still an order of magnitude faster than your SLA. It's anecdotal evidence of course, but you should be fine.

Amount of Test Data needed for load testing of a web service

I am currently working on a project that requires load testing of web services.
One of the services is being called 60,000 times in the production during Busy-Day/Busy-HR.
{PerfTest Env=PROD}
Input Account Number
Output AccountDetails
Do I really need 60,000 unique account numbers(TEST DATA) for this loadrunner script to simulate the production scenario?
If unique data is required, for endurance test I will have to prepare lot of test data for each web service.
If I don't get that much test data, what is the chance of Load Test being affected due to Application Server Cache mechanism??
Can somebody help me?
Thanks
Ram

Are you simulating a day or the highest volume hour in the last year? This can help you to shape the amount of data that you need. Rarely would you start with a 24 hour test. Instead you would be looking at your high water test of an hour with a ramp up and ramp down, so you would need approximately 1.333* your high water hour's worth of data.
So this can drop your 60K to (potentially) 20K(?) I am making an assumption that your worst hour over the last year is somewhere around 1/3 of your traditional day. I have observed this pattern over and over again in different environments over the past two decades. You will want to objectively verify this with log data or query data to support the number in your environment.
Next up, how many of these inquiries are actually unique? You are really going to need a log of the queries across a day (or your high water hour) to determine this. Log processing tools such as Microsoft Logparser or Splunk/Splunk Storm can help you to pull the observed distribution of unique account references within your data, including counts of those which are multiple. Once you know this you can simply use a data file with a fixed block size for each user for unique data and once the data is exhausted the user exits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js