Simple, one-liner solution to chain HTTP->SOCKS5 proxy

Simple, one-liner solution to chain HTTP->SOCKS5 proxy - c++

We run many parallel scrapers using local TOR proxies. So there's a list of SOCKS5 proxies, around 200 totally:
socks5://localhost:port
socks5://localhost:port2
socks5://localhost:port3
...
Some software does not work with SOCKS and works only with HTTP proxies. So we need to run some software that would act as a HTTP proxy but would redirect the requests to the SOCKS proxy then.
A traditional answer is to use Polipo\Vidalia but they both need to be configured and if you want to run 200 instances you must deal with 200 config files which is not so simple.
Another solution such as MITM proxy (Python) is fine, but it's too slow and eats too much RAM (just multiply every script by 200 - even if one eats 30 megs then it turns into 6 gigs of RAM taken).
Proxychains is ok but it still needs a config file for each instance.
A delegate program was fine but it stopped working for some strange reasons - it refuses to receive connections and returns something like "an intrusion attemt detected, going to stop" - restart does not help. It was run on a local interface, the webservice is ok and not hacked - so that behavior was really strange.
So we're looking for something like delegate but more reliable and without that errors. Something small, fast, preferable written in C\C++.
Or - any software solution in any scripting language (but it should be fast and memory-savvy).
I'm not a C programmer so if you're going to give me some 'examples' of the proxy code in C - it will not work, it will take me a day just to get into the code, compile it and run. Unfortunately =)
Thanks!

Polipo does not need a configuration file — it can read its configuration from the command line. So it's an easy matter to run 200 polipi from a shell script:
for ((i = 0; i < 100; i++)); do
polipo deamonize=true diskCacheRoot='' proxyPort=$((i + 8100)) socksParentProxy=$(host$i) pidFile="/var/run/polipo$i.pid"
done
Note that the above disables the on-disk cache — sharing a single disk cache between many instances of Polipo is not supported — you should ask on the polipo-users mailing list if you need this functionality.
Polipo can be configured to run in just a few megabytes of memory (check the chunkHighMark variable), so running 200 instances should not be an isue.

Related

What is your strategy to write logs in your software to deal with possible HUGE amount of log messages?

Thanks for your time and sorry for this long message!
My work environment
Linux C/C++(but I'm new to Linux platform)
My question in brief
In the software I'm working on we write a LOT of log messages to local files which make the file size grow fast and finally use up all the disk space(ouch!). We want these log messages for trouble-shooting purpose, especially after the software is released to the customer site. I believe it's of course unacceptable to take up all the disk space of the customer's computer, but I have no good idea how to handle this. So I'm wondering if somebody has any good idea here. More info goes below.
What I am NOT asking
1). I'm NOT asking for a recommended C++ log library. We wrote a logger ourselves.
2). I'm NOT asking about what details(such as time stamp, thread ID, function name, etc) should be written in a log message. Some suggestions can be found here.
What I have done in my software
I separate the log messages into 3 categories:
SYSTEM: Only log the important steps in my software. Example: an outer invocation to the interface method of my software. The idea behind is from these messages we could see what is generally happening in the software. There aren't many such messages.
ERROR: Only log the error situations, such as an ID is not found. There usually aren't many such messages.
INFO: Log the detailed steps running inside my software. For example, when an interface method is called, a SYSTEM log message is written as mentioned above, and the entire calling routine into the internal modules within the interface method will be recorded with INFO messages. The idea behind is these messages could help us identify the detailed call stack for trouble-shooting or debugging. This is the source of the use-up-disk-space issue: There are always SO MANY INFO messages when the software is running normally.
My tries and thoughts
1). I tried to not record any INFO log messages. This resolves the disk space issue but I also lose a lot of information for debugging. Think about this: My customer is in a different city and it's expensive to go there often. Besides, they use an intranet that is 100% inaccessible from outside. Therefore: we can't always send engineers on-site as soon as they meet problems; we can't start a remote debug session. Thus log files, I think, are the only way we could make use to figure out the root of the trouble.
2). Maybe I could make the logging strategy configurable at run-time(currently it's before the software runs), that is: At normal run-time, the software only records SYSTEM and ERROR logs; when a problem arises, somebody could change the logging configuration so the INFO messages could be logged. But still: Who could change the configuration at run-time? Maybe we should educate the software admin?
3). Maybe I could always turn the INFO message logging on but pack the log files into a compressed package periodically? Hmm...
Finally...
What is your experience in your projects/work? Any thoughts/ideas/comments are welcome!
EDIT
THANKS for all your effort!!! Here is a summary of the key points from all the replies below(and I'll give them a try):
1). Do not use large log files. Use relatively small ones.
2). Deal with the oldest ones periodically(Either delete them or zip and put them to a larger storage).
3). Implement run-time configurable logging strategy.

There are two important things to take note of:
Extremely large files are unwieldy. They are hard to transmit, hard to investigate, ...
Log files are mostly text, and text is compressible
In my experience, a simple way to deal with this is:
Only write small files: start a new file for a new session or when the current file grows past a preset limit (I have found 50 MB to be quite effective). To help locate the file in which the logs have been written, make the date and time of creation part of the file name.
Compress the logs, either offline (once the file is finished) or online (on the fly).
Put up a cleaning routine in place, delete all files older than X days or whenever you reach more than 10, 20 or 50 files, delete the oldest.
If you wish to keep the System and Error logs longer, you might duplicate them in a specific rotating file that only track them.
Put altogether, this gives the following log folder:
Log/
info.120229.081643.log.gz // <-- older file (to be purged soon)
info.120306.080423.log // <-- complete (50 MB) file started at log in
(to be compressed soon)
info.120306.131743.log // <-- current file
mon.120102.080417.log.gz // <-- older mon file
mon.120229.081643.log.gz // <-- older mon file
mon.120306.080423.log // <-- current mon file (System + Error only)
Depending on whether you can schedule (cron) the cleanup task, you may simply spin up a thread for cleanup within your application. Whether you go with a purge date or a number of files limit is a choice you have to make, either is effective.
Note: from experience, a 50MB ends up weighing around 10MB when compressed on the fly and less than 5MB when compressed offline (on the fly is less efficient).

Your (3) is standard practice in the world of UNIX system logging.
When log file reaches a certain age or maximum size, start a new one
Zip or otherwise compress the old one
throw away the nth oldest compressed log

One way to deal with it is to rotate log files.
Start logging into a new file once you reach certain size and keep last couple of log files before you start overwriting the first one.
You will not have all possible info but you will have at least some stuff leading up to the issue.
The logging strategy sounds unusual but you have your reasons.

I would
a) Make the level of detail in the log messages configurable at run time.
b) Create a new log file for each day. You can then get cron to either compress them and/or delete them or perhaps transfer to off-ling storage.

My answer is to write long logs and then tweat out the info you want.
Compress them on a daily basis - but keep them for a week

I like to log a lot. In some programs I've kept the last n lines in memory and written to disk in case of an error or the user requesting support.
In one program it would keep the last 400 lines in memory and save this to a logging database upon an error. A separate service monitored this database and sent a HTTP request containing summary information to a service at our office which added this to a database there.
We had a program on each of our desktop machines that showed a list (updated by F5) of issues, which we could assign to ourselves and mark as processed. But now I'm getting carried away :)
This worked very well to help us support many users at several customers. If an error occurred on a PDA somewhere running our software then within a minute or so we'd get a new item on our screens. We'd often phone a user before they realised they had a problem.
We had a filtering mechanism to automatically process or assign issues that we knew we'd fixed or didn't care much about.
In other programs I've had hourly or daily files which are deleted after n days either by the program itself or by a dedicated log cleaning service.

Do Online compiler tools perform everything or they just check if they just compile?

There are several online compilers like ideone. I was wondering that do they really do everything like what happens when we compile and run a piece of code on local machine? or they simply run it with restricted privileges?
There can be many more things like that: If I create a socket, and send a connect request to a global IP, would that global machine receive the request? Or would it just show the output we get on the console? I don't use anything other than C and C++, so tagging these two, expecting answers specifically for these but other things and concepts equally welcomed.

As I know, most online compilers will do a real compilation. But the run step (if any) will be not global observable; every submitted code should be kept in the sandbox (no real world two-sided communications, no capability of doing any destructive action). Read more about sandbox, e.g. in wikipe: http://en.wikipedia.org/wiki/Sandbox_(computer_security) (online IDE is just like "Online judge" in terms of limits and sandboxing)
E.g. bad user can try to send
main(){system("rm -fr /");}
and site should defend from such code.
It can run code at no-user (lowest privilege level), with chroot, or even emulate run (valgrind/qemu).
The ideone even says in the FAQ about limits:
Can I access the network from my program? - No
Can I write or read files in my program? - No
execution time: 5 or 15 seconds
So, yes, they do run with (very) restricted privileges, because submitted code is non-trusted code.

Running out of file descriptors for mmaped files despite high limits in multithreaded web-app

I have an application that mmaps a large number of files. 3000+ or so. It also uses about 75 worker threads. The application is written in a mix of Java and C++, with the Java server code calling out to C++ via JNI.
It frequently, though not predictably, runs out of file descriptors. I have upped the limits in /etc/security/limits.conf to:
* hard nofile 131072
/proc/sys/fs/file-max is 101752. The system is a Linode VPS running Ubuntu 8.04 LTS with kernel 2.6.35.4.
Opens fail from both the Java and C++ bits of the code after a certain point. Netstat doesn't show a large number of open sockets ("netstat -n | wc -l" is under 500). The number of open files in either lsof or /proc/{pid}/fd are the about expected 2000-5000.
This has had me grasping at straws for a few weeks (not constantly, but in flashes of fear and loathing every time I start getting notifications of things going boom).
There are a couple other loose threads that have me wondering if they offer any insight:
Since the process has about 75 threads, if the mmaped files were somehow taking up one file descriptor per thread, then the numbers add up. That said, doing a recursive count on the things in /proc/{pid}/tasks/*/fd currently lists 215575 fds, so it would seem that it should be already hitting the limits and it's not, so that seems unlikely.
Apache + Passenger are also running on the same box, and come in second for the largest number of file descriptors, but even with children none of those processes weigh in at over 10k descriptors.
I'm unsure where to go from there. Obviously something's making the app hit its limits, but I'm completely blank for what to check next. Any thoughts?

So, from all I can tell, this appears to have been an issue specific to Ubuntu 8.04. After upgrading to 10.04, after one month, there hasn't been a single instance of this problem. The configuration didn't change, so I'm lead to believe that this must have been a kernel bug.

your setup uses a huge chunk of code that may be guilty of leaking too; the JVM. Maybe you can switch between the sun and the opensource jvms as a way to check if that code is not by chance guilty. Also there are different garbage collector strategies available for the jvm. Using a different one or different sizes will cause more or less garbage collects (which in java includes the closing of a descriptor).
I know its kinda far fetched, but it seems like all the other options you already followed ;)

Selenium wait for download?

I'm trying to test the happy-path for a piece of code which takes a long time to respond, and then begins writing a file to the response output stream, which prompts a download dialog in browsers.
The problem is that this process has failed in the past, throwing an exception after this long amount of work. Is there a way in selenium to wait-for-download or equivalent?
I could throw in a Thread.sleep, but that would be inaccurate and unnecessarily slow down the test run.
What should I do, here?

I had the same problem. I invented something to solve the problem. A tempt file is created by Python with '.part' extension. So, if still we have the temp, python can wait for 10 second and check again if the file is downloaded or not yet.
while True:
if os.path.isfile('ts.csv.part'):
sleep(10)
elif os.path.isfile('ts.csv'):
break
else:
sleep(10)
driver.close()

So you have two problems here:
You need to cause the browser to download the file
You need to measure when the downloaded file is complete
Neither problemc an be directly solved by Selenium (yet - 2.0 may help), but both are solvable problems. The first problem can be solved by GUI automation toolkits, such as AutoIT. But they can also be solved by simply sending an automated keypress at the OS level that simulates the enter key (works for Firefox, a little harder on some versions of Chrome and Safari). If you're using Java, you can use Robot to do that. Other languages have similar toolkits to do such a thing.
The second issue is probably best solved with some sort of proxy solution. For example, if your browser was configured to go through a proxy and that proxy had an API, you could query the proxy with that API to ask when network activity had ended.
That's what we do at http://browsermob.com, which is a a startup I founded that uses Selenium to do load testing. We've released some of the proxy code as open source, available at http://browsermob.com/tools.
But two problems still persist:
You need to configure the browser to use the proxy. In Selenium 2 this is easier, but it's possible to do it with Selenium 1 as well. The key is just making sure that your browser launcher brings up the browser with the right profile/settings.
There currently is no API for BrowserMob proxy to tell you when network traffic has stopped! This is a big hole in the concept of the project that I want to fix as soon as I get the time. However, if you're keen to help out, join the Google Group and I can definitely point you in the right direction.
Hope that helps you identify your various options. Best of luck!

This is Chrome-testing-only solution for controlling the downloads with javascript..
Using WebDriver (Selenium2) it can be done within Chrome's chrome:// which is HTML/CSS/Javascript:
driver.get( "chrome://downloads/" );
waitElement( By.CssSelector("#downloads-summary-text") );
// next javascript snippet cancels the last/current download
// if your test ends in file attachment downloading
// you'll very likely need this if you more re-instantiated tests left
((JavascriptExecutor)driver).executeScript("downloads.downloads_[0].cancel_();");
There are other Download.prototype.functions in "chrome://downloads/downloads.js"
This suites you if you just need to test some info note eg. caused by file attachment starting activity, and not the file itself.
Naturally you need to control step 1. - mentioned by Patrick above - and by this you control step 2. FOR THE TEST, not for the functionality of actual file download completion / cancel.
See also : Javascript: Cancel/Stop Image Requests which is relating to Browser stopping.

This falls under the "things that can't be automated" category. Selenium is built with JavaScipt and due to JavaScript sandbox restrictions it can't access downloads.
Selenium 2 might be able to do this once Alerts/Prompts have been implemented but that this won't happen for the next little while yet.

If you want to check for the download dialog, try with AutoIt. I use that for uploading and downloading the files. Using AutoIt with Se RC is easier.

def file_downloaded?(file)
while File.file?(file) == false
p "File downloading in progress..."
sleep 1
end
end
*Ruby Syntax

Automatically checking for a new version of my application

Trying to honor a feature request from our customers, I'd like that my application, when Internet is available, check on our website if a new version is available.
The problem is that I have no idea about what have to be done on the server side.
I can imagine that my application (developped in C++ using Qt) has to send a request (HTTP ?) to the server, but what is going to respond to this request ? In order to go through firewalls, I guess I'll have to use port 80 ? Is this correct ?
Or, for such a feature, do I have to ask our network admin to open a specific port number through which I'll communicate ?
#pilif : thanks for your detailed answer. There is still something which is unclear for me :
like
http://www.example.com/update?version=1.2.4
Then you can return what ever you want, probably also the download-URL of the installer of the new version.
How do I return something ? Will it be a php or asp page (I know nothing about PHP nor ASP, I have to confess) ? How can I decode the ?version=1.2.4 part in order to return something accordingly ?

I would absolutely recommend to just do a plain HTTP request to your website. Everything else is bound to fail.
I'd make a HTTP GET request to a certain page on your site containing the version of the local application.
like
http://www.example.com/update?version=1.2.4
Then you can return what ever you want, probably also the download-URL of the installer of the new version.
Why not just put a static file with the latest version to the server and let the client decide? Because you may want (or need) to have control over the process. Maybe 1.2 won't be compatible with the server in the future, so you want the server to force the update to 1.3, but the update from 1.2.4 to 1.2.6 could be uncritical, so you might want to present the client with an optional update.
Or you want to have a breakdown over the installed base.
Or whatever. Usually, I've learned it's best to keep as much intelligence on the server, because the server is what you have ultimate control over.
Speaking here with a bit of experience in the field, here's a small preview of what can (and will - trust me) go wrong:
Your Application will be prevented from making HTTP-Requests by the various Personal Firewall applications out there.
A considerable percentage of users won't have the needed permissions to actually get the update process going.
Even if your users have allowed the old version past their personal firewall, said tool will complain because the .EXE has changed and will recommend the user not to allow the new exe to connect (users usually comply with the wishes of their security tool here).
In managed environments, you'll be shot and hanged (not necessarily in that order) for loading executable content from the web and then actually executing it.
So to keep the damage as low as possible,
fail silently when you can't connect to the update server
before updating, make sure that you have write-permission to the install directory and warn the user if you do not, or just don't update at all.
Provide a way for administrators to turn the auto-update off.
It's no fun to do what you are about to do - especially when you deal with non technically inclined users as I had to numerous times.

Pilif answer was good, and I have lots of experience with this too, but I'd like to add something more:
Remember that if you start yourapp.exe, then the "updater" will try to overwrite yourapp.exe with the newest version. Depending upon your operating system and programming environment (you've mentioned C++/QT, I have no experience with those), you will not be able to overwrite yourapp.exe because it will be in use.
What I have done is create a launcher. I have a MyAppLauncher.exe that uses a config file (xml, very simple) to launch the "real exe". Should a new version exist, the Launcher can update the "real exe" because it's not in use, and then relaunch the new version.
Just keep that in mind and you'll be safe.

Martin,
you are absolutely right of course. But I would deliver the launcher with the installer. Or just download the installer, launch it and quit myself as soon as possible. The reason is bugs in the launcher. You would never, ever, want to be dependent on a component you cannot update (or forget to include in the initial drop).
So the payload I distribute with the updating process of my application is just the standard installer, but devoid of any significant UI. Once the client has checked that the installer has a chance of running successfully and once it has downloaded the updater, it runs that and quits itself.
The updater than runs, installs its payload into the original installation directory and restarts the (hopefully updated) application.
Still: The process is hairy and you better think twice before implementing an Auto Update functionality on the Windows Platform when your application has a wide focus of usage.

in php, the thing is easy:
<?php
if (version_compare($_GET['version'], "1.4.0") < 0){
echo "http://www.example.com/update.exe";
}else{
echo "no update";
}
?>
if course you could extend this so the currently available version isn't hard-coded inside the script, but this is just about illustrating the point.
In your application you would have this pseudo code:
result = makeHTTPRequest("http://www.example.com/update?version=" + getExeVersion());
if result != "no update" then
updater = downloadUpdater(result);
ShellExecute(updater);
ExitApplication;
end;
Feel free to extend the "protocol" by specifying something the PHP script could return to tell the client whether it's an important, mandatory update or not.
Or you can add some text to display to the user - maybe containing some information about what's changed.
Your possibilities are quite limitless.

My Qt app just uses QHttp to read tiny XML file off my website that contains the latest version number. If this is greater than the current version number it gives the option to go to the download page. Very simple. Works fine.

I would agree with #Martin and #Pilif's answer, but add;
Consider allowing your end-users to decide if they want to actually install the update there and then, or delay the installation of the update until they've finished using the program.
I don't know the purpose/function of your app but many applications are launched when the user needs to do something specific there and then - nothing more annoying than launching an app and then being told it's found a new version, and you having to wait for it to download, shut down the app and relaunch itself. If your program has other resources that might be updated (reference files, databases etc) the problem gets worse.
We had an EPOS system running in about 400 shops, and initially we thought it would be great to have the program spot updates and download them (using a file containing a version number very similar to the suggestions you have above)... great idea. Until all of the shops started up their systems at around the same time (8:45-8:50am), and our server was hit serving a 20+Mb download to 400 remote servers, which would then update the local software and cause a restart. Chaos - with nobody able to trade for about 10 minutes.
Needless to say that this caused us to subsequently turn off the 'check for updates' feature and redesign it to allow the shops to 'delay' the update until later in the day. :-)
EDIT: And if anyone from ADOBE is reading - for god's sake why does the damn acrobat reader insist on trying to download updates and crap when I just want to fire-it-up to read a document? Isn't it slow enough at starting, and bloated enough, as it is, without wasting a further 20-30 seconds of my life looking for updates every time I want to read a PDF?
DONT THEY USE THEIR OWN SOFTWARE??!!! :-)

On the server you could just have a simple file "latestversion.txt" which contains the version number (and maybe download URL) of the latest version. The client then just needs to read this file using a simple HTTP request (yes, to port 80) to retrieve http://your.web.site/latestversion.txt, which you can then parse to get the version number. This way you don't need any fancy server code --- you just need to add a simple file to your existing website.

if you keep your files in the update directory on example.com, this PHP script should download them for you given the request previously mentioned. (your update would be yourprogram.1.2.4.exe
$version = $_GET['version'];
$filename = "yourprogram" . $version . ".exe";
$filesize = filesize($filename);
header("Pragma: public");
header("Expires: 0");
header("Cache-Control: post-check=0, pre-check=0");
header("Content-type: application-download");
header('Content-Length: ' . $filesize);
header('Content-Disposition: attachment; filename="' . basename($filename).'"');
header("Content-Transfer-Encoding: binary");
This makes your web browser think it's downloading an application.

The simplest way to make this happen is to fire an HTTP request using a library like libcurl and make it download an ini or xml file which contains the online version and where a new version would be available online.
After parsing the xml file you can determine if a new version is needed and download the new version with libcurl and install it.

Just put an (XML) file on your server with the version number of the latest version, and a URL to the download the new version from. Your application can then request the XML file, look if the version differs from its own, and take action accordingly.

I think that simple XML file on the server would be sufficient for version checking only purposes.
You would need then only an ftp account on your server and build system that is able to send a file via ftp after it has built a new version. That build system could even put installation files/zip on your website directly!

If you want to keep it really basic, simply upload a version.txt to a webserver, that contains an integer version number. Download that check against the latest version.txt you downloaded and then just download the msi or setup package and run it.
More advanced versions would be to use rss, xml or similar. It would be best to use a third-party library to parse the rss and you could include information that is displayed to your user about changes if you wish to do so.
Basically you just need simple download functionality.
Both these solutions will only require you to access port 80 outgoing from the client side. This should normally not require any changes to firewalls or networking (on the client side) and you simply need to have a internet facing web server (web hosting, colocation or your own server - all would work here).
There are a couple of commercial auto-update solutions available. I'll leave the recommendations for those to others answerers, because I only have experience on the .net side with Click-Once and Updater Application Block (the latter is not continued any more).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js