Google Bot information? - c++

Does anyone know any more details about google's web-crawler (aka GoogleBot)? I was curious about what it was written in (I've made a few crawlers myself and am about to make another) and if it parses images and such. I'm assuming it does somewhere along the line, b/c the images in images.google.com are all resized. It also wouldn't surprise me if it was all written in Python and if they used all their own libraries for most everything, including html/image/pdf parsing. Maybe they don't though. Maybe it's all written in C/C++. Thanks in advance-

you can find a bit about how googlebot works here:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=158587
for example the "fetch as googlebot" tool lets you see a page as Googlebot sees it.

The crawler is very likely written in C or C++, at least backrub's crawler was written in one of these.
Be aware that the crawler only takes a snapshot of the page, then stores it in a temporary database for later processing. The indexing and other attached algorithms will extract the data, for example the image references.

Officially allowed languages at Google, I think, are Python/C++/Java.
The bot likely uses all 3 for different tasks.

Related

Is there any browsers that sends multipart/form-data sub-parts?

I am writing a webserver in C++. I am looking at the POST documentation on w3:
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
I see that a POST is supposed to support the full multi-parts scheme: parts and sub-parts (and obviously, sub-sub-parts...) just like for email attachments.
Is there any browser and/or tool that do that on a normal basis? In other words, is it really important for a server to support parts and sub-parts?
The obvious problem with that is the fact that it could mean that two files are uploaded under the same name. That's quite a problem if you ask me. Also, from what I can see in PHP it is not supported at all in that realm. Am I correct?
Ah! I guess I should have searched a little more and to tell you the truth I had not thought of looking at HTML5 for the answer.
The following paragraph actually includes the answer:
http://www.w3.org/html/wg/drafts/html/master/forms.html#multipart-form-data
Note: In particular, this means that multiple files submitted as
part of a single element will
result in each file having its own field; the "sets of
files" feature ("multipart/mixed") of RFC 2388 is not used.
So it is clear that sub-parts (multipart/mixed) are not to be supported.

Creating QR Codes with ColdFusion

Is there any way with pure ColdFusion/cfscript to produce a QR code, without relying on external APIs or JavaScript?
No. ColdFusion cannot generate bar codes by itself. You need a separate tool or library. It is easy enough to install a java library, like ZXing. Then generate the images from CF. Alternately, you could do a <cfhttp> call to an external server that generates the bar code image for you, or basically do the same thing with javascript. You would not need to install anything for the latter two (2) options. But they still rely on an external resource.
Bottom line you need something more than just ColdFusion. What is the reason you cannot use either an external API or javascript? Because without either of those, you are probably out of luck.
Edit based on comments:
If the only restriction is the images must generated locally, then you can use ZXing as described in the link above -OR- any of the other components/libraries mentioned in the other responses, like Joe's suggestion which uses iText (though also based on ZXing).
Some other external APIs
http://cfbarbecue.riaforge.org/
http://zanstra.com/my/Barcode.html?barcode=3PTSP8827A231
If you really wanted to, you could look up (perhaps you need to buy?) the encoding standard for QR codes, which I believe is an ISO standard. Then you could write a program which would output a table with the appropriate number of rows and columns, each with either a black or a white background. I wouldn't recommend this form of "rolling your own" though; it's a lot of work to do essentially what's been done before.
Tim Cunningham wrote a library that is hosted on Github that utilizes iText that does just this very thing. https://github.com/boltz/QRToad

Scan for changed files

I'm looking for a good efficient method for scanning a directory structure for changed files in Windows XP+. Something like how git does it is exactly what I'm looking for, when running a git status it displays all modified files, all new (untracked) files and deleted files very quickly which is exactly what I would like to do.
I have a basic model up and running which performs an initial scan and stores all filenames, size, dates and attributes.
On a subsequent scan it checks if the size, attributes or date have changed and marks as a changed file.
My issue now comes in detecting moved and deleted files. Is there a tried and tested method for this sort of thing? I'm struggling to come up with a good method.
I should mention that it will eventually use ReadDirectoryChangesW to monitor files and alert the user when something changes so a full scan is really a last resort after the initial scan.
Thanks,
J
EDIT: I think I may have described the problem badly. The issue I'm facing is not so much detecting the changes - I have ReadDirectoryChangesW() using IOCP on multiple threads to detected when a change happens, the issue is more what to do with the information. For example, a moved file is reported as a delete followed by a create and a rename comes in 2 parts, old name, followed by new name. So what I'm asking is how to differentiate between the delete as part of a move and an actual delete. I'm guessing buffering the changes and processing batches would be an option but feels messy.
In native code FileSystemWatcher is replaced by ReadDirectoryChangesW. Using this properly is not simple, there is a good baseline to build off here.
I have used this code in a previous job and it worked pretty well. The Win32 API itself (and FileSystemWatcher) are prone to problems that are described in the docs and also discussed in various places online, but impact of those will depending on your use cases.
EDIT: the exact change is indicated in the FILE_NOTIFY_INFORMATION structure that you get back - adds, removals, rename data including old and new name.
I voted Liviu M. up. However, another option if you don't want to use the .NET framework for some reason, would be to use the basic Win32 API call FindFirstChangeNotification.
You can use USN journaling if you are up to it, that is pretty low level (NTFS level) stuff.
Here you can find detailed information and source code included. It is written in C# but most of it is PInvoking C/C++ functions.

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.
I want to experiment with implementing a map-reduce algorithm.
Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.
Edit: Any that aren't torrents? I can't get those at work.
The stackoverflow database is available for download.
Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.
One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).
If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.
If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.
Out of curiosity, what are you using all this data for?
You could use a web crawler and scrape 100MB of data?
There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.
One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).
https://dumps.wikimedia.org/metawiki/latest/
You want to look for any files with the -articles.xml.bz2 suffix.

How do I write a Perl script to filter out digital pictures that have been doctored?

Last night before going to bed, I browsed through the Scalar Data section of Learning Perl again and came across the following sentence:
the ability to have any character in a string means you can create, scan, and manipulate raw binary data as strings.
An idea immediately hit me that I could actually let Perl scan the pictures that I have stored on my hard disk to check if they contain the string Adobe. It seems by doing so, I can tell which of them have been photoshopped. So I tried to implement the idea and came up with the following code:
#!perl
use autodie;
use strict;
use warnings;
{
local $/="\n\n";
my $dir = 'f:/TestPix/';
my #pix = glob "$dir/*";
foreach my $file (#pix) {
open my $pic,'<', "$file";
while(<$pic>) {
if (/Adobe/) {
print "$file\n";
}
}
}
}
Excitingly, the code seems to be really working and it does the job of filtering out the pictures that have been photoshopped. But problem is many pictures are edited by other utilities. I think I'm kind of stuck there. Do we have some simple but universal method to tell if a digital picture has been edited or not, something like
if (!= /the origianl format/) {...}
Or do we simply have to add more conditions? like
if (/Adobe/|/ACDSee/|/some other picture editors/)
Any ideas on this? Or am I oversimplifying due to my miserably limited programming knowledge?
Thanks, as always, for any guidance.
Your best bet in Perl is probably ExifTool. This gives you access to whatever non-image information is embedded into the image. However, as other people said, it's possible to strip this information out, of course.
I'm not going to say there is absolutely no way to detect alterations in an image, but the problem is extremely difficult.
The only person I know of who claims to have an answer is Dr. Neal Krawetz, who claims that digitally altered parts of an image will have different compression error rates from the original portions. He claims that re-saving a JPEG at different quality levels will highlight these differences.
I have not found this to be the case, in my investigations, but perhaps you might have better results.
No. There is no functional distinction between a perfectly edited image, and one which was the way it is from the start - it's all just a bag of pixels in the end, after all, and any other metadata you can remove or forge all you want.
The name of the graphics program used to edit the image is not part of the image data itself but of something called meta data - which may be stored in the image file but, as others have noted, is neither required (so some programs may not store it, some may allow you an option of not storing it) nor reliable - if you forged an image, you might have forged the meta data as well.
So the answer to your question is "no, there's no way to universally tell if the pic was edited or not, although some image editing software may write its signature into the image file and it'll be left there by carelessness of the editing person.
If you're inclined to learn more about image processing in Perl, you could take a look at some of the excellent modules CPAN has to offer:
Image::Magick - read, manipulate and write of a large number of image file formats
GD - create colour drawings using a large number of graphics primitives, and emit the drawings in various formats.
GD::Graph - create charts
GD::Graph3d - create 3D Graphs with GD and GD::Graph
However, there are other utilities available for identifying various image formats. It's more of a question for Super User, but for various unix distros you can use file to identify many different types of files, and for MacOSX, Graphic Converter has never let me down. (It was even able to open the bizarre multi-file X-ray of my cat's shattered pelvis that I got on a disc from the vet.)
How would you know what the original format was? I'm pretty sure there's no guaranteed way to tell if an image has been modified.
I can just open the file (with my favourite programming language and filesystem API) and just write whatever I want into that file willy-nilly. As long as I don't screw something up with the file format, you'd never know it happened.
Heck, I could print the image out and then scan it back in; how would you tell it from an original?
As other's have stated, there is no way to know if the image was doctored. I'm guessing what you basically want to know is the difference between a realistic photograph and one that has been enhanced or modified.
There's always the option of running some extremely complex image recognition algorithm that would analyze every pixel in your image and do some very complicated stuff to determine if the image was doctored or not. This solution would probably involve AI which would examine millions of photos that are both doctored and those that are not and learn from them. However, this is more of a theoretical solution and isn't very practical... you would probably only see it in movies. It would be extremely complex to develop and probably take years. And even if you did get something like this to work, it probably still wouldn't be 100% correct all the time. I'm guessing AI technology still isn't at that level and could take a while until it is.
A not-commonly-known feature of exiftool allows you to recognize the originating software through an analysis of the JPEG quantization tables (not relying on image metadata). It recognizes tables written by many applications. Note that some cameras may use the same quantization tables as some applications, so this isn't a 100% solution, but it is worth looking into. Here is an example of exiftool run on two images, the first was edited by photoshop.
> exiftool -jpegdigest a.jpg b.jpg
======== a.jpg
JPEG Digest : Adobe Photoshop, Quality 10
======== b.jpg
JPEG Digest : Canon EOS 30D/40D/50D/300D, Normal
2 image files read
This will work even if the metadata has been removed.
There is existing software out there which uses various techniques (compression artifacting, comparison to signature profiles in a database of cameras, etc.) to analyze the actual image data for evidence of alteration. If you have access to such software and the software available to you provides an API for external access to these analysis functions, then there's a decent chance that a Perl module exists which will interface with that API and, if no such module exists, it could probably be created rather quickly.
In theory, it would also be possible to implement the image analysis code directly in native Perl, but I'm not aware of anyone having done so and I expect that you'd be better off writing something that low-level and processor-intensive in a fully-compiled language (e.g., C/C++) rather than in Perl.
http://www.impulseadventure.com/photo/jpeg-snoop.html
is a tool that does the job almost good
If there has been any cloning , there is a variation in the pixel density..or concentration which sometimes shows up.. upon manual inspection
a Photoshop cloned area will have even pixel density(my meaning is variation of Pixels wrt a scanned image)