I am currently evaluating if I can use Amazon CloudSearch for our search need instead of Elastic Search.
Right now, I just have about 4K small documents for my testing purpose. Let's if I try to performance the search after a good 2-3 hours, the first search is about 8 to 10 times slower compared to the subsequent searches. The first search after some idle time takes about about 300ms where as the subsequent searches take about 40ms. I am not using the same search terms in the first and subsequent searches, so dont think the subsequent searches are faster due to cached results.
Please note that if I change my instance type to search.m3.xlarge or search.m3.2xlarge instead of default, the response of first search is not all that bad. I tried to look into the documentation if it is an expected behavior, but could not find any. Can someone throw some light on this please?
Related
I need to get all resources based on label, I used the following code which works, However, it takes too much time ( ~20sec) to get the response, even which I restrict it to only one namespace (vrf), any idea what im doing wrong here?
resource.NewBuilder(flags).
Unstructured().
ResourceTypes(res...).
NamespaceParam("vrf").AllNamespaces(false).
LabelSelectorParam("a=b").SelectAllParam(selector == "").
Flatten().
Latest().Do().Object()
https://pkg.go.dev/k8s.io/cli-runtime#v0.26.1/pkg/resource#Builder
As I already using label and ns, not sure what should else I do in this case.
Ive checked the cluster connection and it seems that everything is ok, running regular kubectl are getting very fast response, just this query took much time.
The search may be heavy due to the sheer size of the resources the query has to search into. Have you looked into this possibility and further reduce the size using one more label or filter on top of current.
Also check the performance of you Kubernetes api server when the operation is being performed and optimize it.
Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.
I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.
So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.
There is no universal solution.
Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
Use RSS when possible.
Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
Only hash interesting elements of page (build site-specific model) excluding volatile parts
Hash whole content (useless for dynamic pages)
Safest solution:
download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.
Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.
Using Head
Request page using HEAD verb and check the Header Tags:
Last-Modified: Server should provide last time page generated or Modified.
ETag: A checksum-like value which is defined by server and should change as soon as content changed.
Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch
Using GET
Request page using GET verb and using conditional Header Tags:
* If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified
Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.
Finally, maybe mix of above solution is optimum way for doing such action.
If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.
Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?
That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.
Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).
An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.
You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".
You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.
Hope this helps.
store the html files -- two versions..
one was the html which was taken before an hour. -- first.html
second is the html which was taken now -- second.html
Run the command :
$ diff first.html second.html > diffs.txt
If the diffs has some text then the file is changed.
Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.
You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.
Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:
2 37 en/index.html
2 insertions were made, 37 deletions were made to en/index.html
Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).
Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).
You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.
import requests
response = requests.head(url)
datetime_str = response.headers["last-modified"]
And keep checking if that field changes in a while loop and compare the datetime difference.
I did a little program on Python to do that:
https://github.com/javierdechile/check_updates_http
i'm working on an academic project(a search engine), the main functions of this search engine are:
1/-crawling
2/-storing
3/-indexing
4/-page ranking
all the sites that my search engine will crawl are available locally which means it's an intranet search engine.
after storing the files found by the crawler, these files need to be served quickly for caching purpose.
so i wonder what is the fastest way to store and retrieve these file ?
the first idea that came up is to use FTP or SSH, but these protocols are connection based protocols, the time to connect, search for the file and get it is lengthy.
i've already read about google's anatomy, i saw that they use a data repository, i'd like to do the same but i don't know how.
NOTES: i'm using Linux/debian, and the search engine back-end is coded using C/C++. HELP !
Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.
Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.
Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.
Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.
I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.
As told in this "Sitecore Quick Search" uses the system index to perform instant search.
How to point "Sitecore Quick Search" to use custom index while searching?
The instant search eventually starts the core search pipeline, which is used by other types of searches within the backend. One of the processors of search pipeline (SearchSystemIndex) instructs the system explicitly to search the system index.
As long as this is a core of search functionality I would not recommend you changing the logic there. If you still decide to go this way, then inserting a custom processor into the search pipeline seems to be the starting point. I suspect there to be a number of unexpected things to overcome in order not to break the way search works throughout the system.
Will a long filter string affect the search perfomance in ldap?
Will it affect the search time?
Thanks.
Off course it will, simply because its large :)
I would suggest 2 things:
1) Post a example search of what your trying to do in the post, perhaps you definition of large is different from other ppl.
2) Benchmark yourself to see, if the search takes 1 second and thats ok then you have solved your problem.
On a side note, you can direct ldap queries to spesifc domain controlers, so during testing and production perhaps aim it one of the "backup" domain contolers might be a good idea