using cfhttp in multiple files taking too much time - coldfusion

I don't know if its possible but just want to ask if we can cfhttp or any other thing to read selected amount of data instead of putting whole file in CFHTTP.FileContent.
I am using cfhttp and want to read only last two lines from a remote xml files(about 20 of them) and read middle two lines from some text files (about 7 of them). Is there any way I could just read that specific data instead of getting all files because its taking a lot of time right now(about 15-20 seconds). I just want to reduce the run time of my .cfm page.
Any suggestions ???

Hmm, not really any special way to get just parts of the remote files.
Do you have to do it every time? Could you fetch the files in the background, write them locally, and have your actual incoming requests just read those files? Make the reading of the remote files asynchronous to the incoming requests?
If not, and you're using CF8+, you could use CFTHREAD to thread out the various requests to run in parallel: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=Tags_t_04.html
You can use the "join" action in the end to make wait for all the threads to complete.
Edit:
Here's a great tutorial by Ben Nadel on using CFThread to parallelize CFHTTP requests:
http://www.bennadel.com/blog/749-Learning-ColdFusion-8-CFThread-Part-II-Parallel-Threads.htm
There's something else, though:
27-30 sequential http requests should not take 20-30 seconds. It really shouldn't even take 1-2 seconds - so you may have some serious other issue going on here.

HTTP does not have the ability to read a file in that manner. This has nothing to do with ColdFusion.
You can use some smart caching to reduce the time somewhat at the cost of a longer time the first time you run it using CFHTTP's method="HEAD" which does not.
Do you have a local copy of the page?
No, use CFHTTP method="GET" to grab and store it
Yes, use CFHTTP method="HEAD" to check the timestamp and compare it to the cached version. If cache is newer, use it, else CFHTTP method="GET" to grab and parse the file you want.
method="HEAD" will only grab the http headers and not the entire file which will speed things up ever so slightly. Either way, you are making almost 30 file requests, so this isn't going to be instantaneous either way you cut it.

How about ask CF to only serve that chunk of file with URL params?
Since it is XML, I guess you can use xmlSearch() and return only the result?
as for text file, u can pass in the startline & numOfLines and return only those lines as string?

Related

How should I go about serving a json file to a website in my current architecture

sorry for absolutly murdering the tilte. But I am not sure how to frame this question, please edit this if there is a better way of explaining my problem.
I am reading a bitstream from a program which I convert into json data, write it to a socket, where another program reads this data and appends it to a log.json file. I am doing all of this in C++
Now I want to display this data in a better way. So why not try to display this in an html document, with some css applied on it.
My first thought was to simply fetch this with javascript. But now-a-days this throws an error.
So my second thought was to create a simple node.js server which accepts GET requests and then use this to serve the file. But this feels like its a bit overkill.
My third thought is now to perhaps use my original server (who continuously reads from the socket). And use that one to also accept http requests. But then I would have to multithread it, which again seems kinda overkill.
So im kinda falling back to needing 2 different "servers". One that reads from the socket and appends to the log file and another to serve this file to the website.
Am I'm thinking wrong here? What would be a good way to solve this?

How to check if content of webpage has been changed?

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.
I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.
So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.
There is no universal solution.
Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
Use RSS when possible.
Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
Only hash interesting elements of page (build site-specific model) excluding volatile parts
Hash whole content (useless for dynamic pages)
Safest solution:
download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.
Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.
Using Head
Request page using HEAD verb and check the Header Tags:
Last-Modified: Server should provide last time page generated or Modified.
ETag: A checksum-like value which is defined by server and should change as soon as content changed.
Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch
Using GET
Request page using GET verb and using conditional Header Tags:
* If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified
Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.
Finally, maybe mix of above solution is optimum way for doing such action.
If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.
Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?
That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.
Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).
An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.
You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".
You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.
Hope this helps.
store the html files -- two versions..
one was the html which was taken before an hour. -- first.html
second is the html which was taken now -- second.html
Run the command :
$ diff first.html second.html > diffs.txt
If the diffs has some text then the file is changed.
Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.
You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.
Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:
2 37 en/index.html
2 insertions were made, 37 deletions were made to en/index.html
Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).
Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).
You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.
import requests
response = requests.head(url)
datetime_str = response.headers["last-modified"]
And keep checking if that field changes in a while loop and compare the datetime difference.
I did a little program on Python to do that:
https://github.com/javierdechile/check_updates_http

web.config vs. text file for storing a comma-separated value

We have a collection of VB.NET / IIS web services on some of our servers, and they have web.config files in the websites' root directories that they're already reading configurations from. There is a new configuration that needed to be added that will immediately be quite a bit longer than the others, and it'll only stand to grow. It's essentially a comma-separated value, and I'm wanting to keep it specifically in a configuration file of some sort.
At first I started doing this with a text file, but there was a problem with that. The text file's contents could change while web service threads and processes are running, so they would need to essentially re-read the file every time they needed to access its values. I thought about using some sort of caching, but unless the web services are completely restarted each time the file is updated, caching would block updates to the file from being used immediately. But reading from a text file each time is slow...
Then came the idea of putting that value in web.config, along with the other configurations the services are already using. When web.config is altered, the changes are able to be cached in the code, on top of coming into play immediately. However web.config is, well, web.config, and it's not a totally trivialized text file that is simply read out of in the code. IIS treats web.config in a special manner.
I'm tempted to think any negative consequences of putting a comma-separated value in web.config would be outweighed, in comparison to storing them in a text file (or a database, which probably can't be used for this anyway), but I guess I better ask.
What are the implications of storing a possibly lengthy, comma-separated value in web.config, instead of in its own little text file? Is either file a particularly good or bad idea? To me, it seems like web.config would be easy to get along with without having to re-read the file over and over, but there's certainly more to it than the common user is aware. Thanks!
I recommend using the Application Cache for this:
http://msdn.microsoft.com/en-us/library/vstudio/6hbbsfk6(v=vs.100).aspx

store a high amount of HTML files

i'm working on an academic project(a search engine), the main functions of this search engine are:
1/-crawling
2/-storing
3/-indexing
4/-page ranking
all the sites that my search engine will crawl are available locally which means it's an intranet search engine.
after storing the files found by the crawler, these files need to be served quickly for caching purpose.
so i wonder what is the fastest way to store and retrieve these file ?
the first idea that came up is to use FTP or SSH, but these protocols are connection based protocols, the time to connect, search for the file and get it is lengthy.
i've already read about google's anatomy, i saw that they use a data repository, i'd like to do the same but i don't know how.
NOTES: i'm using Linux/debian, and the search engine back-end is coded using C/C++. HELP !
Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.
Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.
Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.
Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.
I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.

Replace strings in large file

I have a server-client application where clients are able to edit data in a file stored on the server side. The problem is that the file is too large in order to load it into the memory (8gb+). There could be around 50 string replacements per second invoked by the connected clients. So copying the whole file and replacing the specified string with the new one is out of question.
I was thinking about saving all changes in a cache on the server side and perform all the replacements after reaching a certain amount of data. After reaching that amount of data I would perform the update by copying the file in small chunks and replace the specified parts.
This is the only idea I came up with but I was wondering if there might be another way or what problems I could encounter with this method.
When you have more than 8GB of data which is edited by many users simultaneously, you are far beyond what can be handled with a flatfile.
You seriously need to move this data to a database. Regarding your comment that "the file content is no fit for a database": sorry, but I don't believe you. Especially regarding your remark that "many people can edit it" - that's one more reason to use a database. On a filesystem, only one user at a time can have write access to a file. But a database allows concurrent write access for multiple users.
We could help you to come up with a database schema, when you open a new question telling us how your data is structured exactly and what your use-cases are.
You could use some form of indexing on your data (in a separate file) to allow quick access to the relevant parts of this gigantic file (we've been doing this with large files successfully (~200-400gb), but as Phillipp mentioned you should move that data to a database, especially for the read/write access. Some frameworks (like OSG) already come with a database back-end for 3d terrain data, so you can peek there, how they do it.