libtorrent speed greater limits - c++

I'm tring to use libtorrent library and I have a some following problem.
I create torrents with default settings and set them limits (e.g., 100Kb) as
torrent_handle.set_download_limit(limit);
And when I ask speed for the current torrent:
torrent_handle.status ().download_payload_rate
sometimes I get greater value then the limit (e.g., about 200Kb or 300Kb).
What's wrong? Why my torrent limits aren't applyed?

Related

When applying WriteFiles to an unbounded PCollection, must specify number of output shards explicitly

So far I've been trying to write 1 parquet file per Window object, but I get so many small files in the end that I can't figure out what's going on until I saw a method that I forgot about, withNumSharsd().
I was using it as all the examples lead to that and in development I didn't need more than that.
Once I tested it with much more events, the wall time started to increase exponentially until more than 1 day!
So, digging into the docs in the code it basically says that placing a 0 instead of any other number will lead to run-time specification of those required shards.
When running mvn compile, the following message pops up.
When applying WriteFiles to an unbounded PCollection, must specify number of output shards explicitly
Isn't there an option which allows you to specify the number of shards when deploying the Dataflow job?
I've tried adding --outputNumShards=20 --errorOutputNumShards=10 to Dexec.args.
It seems that if you use WriteFiles for files writing (all FileBasedSink IOs use it under the hood, like FileIO, TextIO, etc) then you still need to set number of shards manually by withNumShards(int) for unbounded sources and it should be greater than 0 (see: https://github.com/apache/beam/blob/release-2.16.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L314)

Code changes needed for custom distributed ML Engine Experiment

I completed this tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml file. If using the tf.estimator.Estimator API, are any additional code changes needed to create a custom tier of any size? For example, the article suggests: "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." so this would suggest the config.yaml file below would be possible
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: complex_model_m
workerCount: 10
parameterServerCount: 4
Are any code changes needed to the mnist tutorial to be able to use this custom configuration? Would this distribute the X number of batches across the 10 workers as the tutorial suggests would be possible? I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed? My current understanding is that if using the Estimator API (even within your own defined model) that there should not need to be any additional changes.
Does any of this change if the config.yaml specifies using GPUs? This article suggests for the Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." However, since the config.yaml is specifically identifying the machine type for parameter servers and workers, I am expecting that within ML-Engine the ClusterSpec will be configured properly based on the config.yaml file. However, I am not able to find any ml-engine documentation that confirms no changes are needed to take advantage of GPUs.
Last, within ML-Engine I am wondering if there are any ways to identify usage of different configurations? The line "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." suggests that the use of additional workers would be roughly linear, but I don't have any intuition around how to determine if more parameter servers are needed? What would one be able to check (either within the cloud dashboards or tensorboard) to determine if they have a sufficient number of parameter servers?
are any additional code changes needed to create a custom tier of any size?
No; no changes are needed to the MNIST sample to get it to work with different number or type of worker. To use a tf.estimator.Estimator on CloudML engine, you must have your program invoke learn_runner.run, as exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec. It will automatically do the right thing on Parameter Server nodes and it will use the provided Estimator to start training and evaluation.
Most of the magic happens because tf.estimator.Estimator automatically uses a device setter that distributes ops correctly. That device setter uses the cluster information from the RunConfig object whose constructor, by default, uses TF_CONFIG to do its magic (e.g. here). You can see where the device setter is being used here.
This all means that you can just change your config.yaml by adding/removing workers and/or changing their types and things should generally just work.
For sample code using a custom model_fn, see the census/customestimator example.
That said, please note that as you add workers, you are increasing your effective batch size (this is true regardless of whether or not you are using tf.estimator). That is, if your batch_size was 50 and you were using 10 workers, that means each worker is processing batches of size 50, for an effective batch size of 10*50=500. Then if you increase the number of workers to 20, your effective batch size becomes 20*50=1000. You may find that you may need to decrease your learning rate accordingly (linear seems to generally work well; ref).
I poked around some of the other ML Engine samples and found that
reddit_tft uses distributed training, but they appear to have defined
their own runconfig.cluster_spec within their trainer package:
task.pyeven though they are also using the Estimator API. So, is there
any additional configuration needed?
No additional configuration needed. The reddit_tft sample does instantiate its own RunConfig, however, the constructor of RunConfig grabs any properties not explicitly set during instantiation by using TF_CONFIG. And it does so only as a convenience to figure out how many Parameter Servers and workers there are.
Does any of this change if the config.yaml specifies using GPUs?
You should not need to change anything to use tf.estimator.Estimator with GPUs, other than possibly needing to manually assign ops to the GPU (but that's not specific to CloudML Engine); see this article for more info. I will look into clarifying the documentation.

How to check if content of webpage has been changed?

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.
I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.
So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.
There is no universal solution.
Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
Use RSS when possible.
Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
Only hash interesting elements of page (build site-specific model) excluding volatile parts
Hash whole content (useless for dynamic pages)
Safest solution:
download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.
Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.
Using Head
Request page using HEAD verb and check the Header Tags:
Last-Modified: Server should provide last time page generated or Modified.
ETag: A checksum-like value which is defined by server and should change as soon as content changed.
Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch
Using GET
Request page using GET verb and using conditional Header Tags:
* If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified
Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.
Finally, maybe mix of above solution is optimum way for doing such action.
If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.
Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?
That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.
Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).
An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.
You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".
You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.
Hope this helps.
store the html files -- two versions..
one was the html which was taken before an hour. -- first.html
second is the html which was taken now -- second.html
Run the command :
$ diff first.html second.html > diffs.txt
If the diffs has some text then the file is changed.
Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.
You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.
Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:
2 37 en/index.html
2 insertions were made, 37 deletions were made to en/index.html
Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).
Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).
You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.
import requests
response = requests.head(url)
datetime_str = response.headers["last-modified"]
And keep checking if that field changes in a while loop and compare the datetime difference.
I did a little program on Python to do that:
https://github.com/javierdechile/check_updates_http

Disable application after expiry date for trial

I am writing a simple application for a semi-trusted client, and have no say on certain specifics. The client must be given a copy of a binary, myTestApp, which makes use of proprietary code in an external library, libsecrets. It is a Windows application that will run on a few separate Windows 7 laptops. I have been informed that after the application has served its purpose, it will be deleted. I know there is no perfect solution to this, but I would like to implement an expiry date in the program, and hinder efforts to potentially reverse engineer the code, or at least to prevent the contents of libsecrets from being exposed too easily.
So, my first step will be to statically link myTestApp against libsecrets so everything is contained in one binary, so only the needed pieces of libsecrets is included in the final binary, and its interfaces are no longer published.
Second, I want to implement some sort of getTime mechanism that is not naive. Is there anything in Windows that does a "secure" getTime call, so it can't be tricked by changing the time in the system tray or the BIOS?
Thirdly, if there is no "secure" getTime call, I could also modify myTestApp to use NTP to query a trusted time server, and fail if it can't get the time from it or the trial period has elapsed. But this could be fooled by messing with DNS on the gateway, unless there is some sort of certificates mechanism in place to verify the time server. I don't know much about this though, and would need some suggestions on how to implement it.
Next, is there some way to alter the binary so that it is impractical for individuals to attempt to reverse engineer it by viewing the assembly code? Maybe some sort of wrapper that encrypts the binary and requires a third-party authentication tool? Or maybe some sort of certificate I create that is required to run it and expires later?
Finally, is there any software out there (ie: packaging or publishing software) that can do this for me, either by repacking the final .exe or as some sort of plugin for Microsoft Visual Studio?
Thank you all in advance.
Edit: This is NOT meant to be a bullet proof system, and if it fails, that is acceptable. I just want to make it inconvenient for a non-technical person to attempt to crack. The people using it are technical Luddites, and the only way the software would be cracked is if they hired someone to do it. Since the names and company name are watermarked into the application, and only one person could benefit from its use, it's unlikely they would redistribute it.
You can't make things complete secure, but you can make it hard(er).
Packing with UPX adds some level of complexity to the hacker.
You can check at runtime if you're running under a debugger in several places or if you're running under a virtual machine.
You can encrypt a DLL you're using and load it manually (complicated).
You can write a loader that checks a hash of your application and your application can check the hash of the loader.
You can get the system time and compare it to a system time you already wrote to disk and see that it's monotonic.
All depends on the level of protection you want.
If you go to PirateBay or any other torrent site, you'll see that everything get's hacked if hackers are interested.
There is one way to make it really difficult for them to use it after expiry. The main theme of this trick is to make your expiration date independent of system time and make it depend on hours passed, irrespective of whatever the system time may be.
you will have to create a separate thread to perform this task.
Suppose you want the application to expire after they use it 70 hours.
Create a binary file called "record", and store any number in it, which should be hard to guess (I will tell you latter why you have to put this number in binary file).
When your application starts, it checks if that number is present there if yes, your application should get the current time, and store it in that file along with hour=1 (replacing the already present number), and the thread you created should keep on checking if hour in system time has changed or not, when it changes store current time in that file along with hour=2. A time will come when hour=70.
Add this code at two places inside that thread and on the start of your applicaiton
/*the purpose of storing current time is to find out later if hour has changed or not*/
/*read hour from file.*/
if(hour==70)
{
cout<<"Your trial period has expired"<<endl;
return EXIT_SUCESS;
}
now when ever hour=70 application will not work.
Earlier I told you to keep any number in your binary file, when ever they will run your application, binary file will be read and if that number is found there your application will replace it with current time and hour=1, now suppose they use your application for 5 hours and close it and run it after some time, now when your application will be run it will check that binary file if that number has been replaced with time stored previously and hour=5 it means now you will have to store current time along with hour=stored hour in file +1; . In this even if they change time or do anything else it will not effect your expiration period. Because now your expiration checking is not based on system time any more, it is now based on hours passed, irrespective whatever the time may be.
The absence of that number indicates file is not being accessed for first time and currently present hour in file should be incremented, and use binary file so that client can't see that number.
One last thing
Your binary file's format should be like this
current time, hour="any number", another_secret_number
another_secret_number will be placed so that even if they any how change your binary they will not be able to put that another_secret_number there because they don't know it. It means while reading your binary file you will have to make sure that, the end of any entry in your binary file contains "another_secret_number" at end.
For checking purposes both hidden numbers will also be hard coded in your code, which surely they can't see, and they can't read the binary also, so there is no way they can know them.
I hope it will help you.
Nothing stop the hackers!!!
Your question is like a a searching needle at the hay.
Assembly is large room for the responses.
You may thing only hrder, nothing, never stop 'bad' persons.
For UPX: Is well known, dont use it!!!

prioritizing torrent download sequences using libtorrent

Suppose I have 2+ clients (developed by me) ALL using libtorrent ( http://www.rasterbar.com/products/libtorrent/manual.html#queuing )
Can I prioritize download of a file from other clients effectively so that they download the file's pieces/chunks (whatever is torrent terminology here) from beginning of the file towards its end and not quite in random order?
(of course I'm allowing some "multiplexing" / "intertwining" pieces for reasons of availability and performance, but the goal here is to download as linearly and quickly from the start of the file towards the end as possible)
The goal I'm thinking about here is obviously previewing the file quickly. How to do this most effectively using libtorrent / possibly other C++ torrent library?
(I'm not quite interested in torrent implementations using non-binary languages, like Java or Python - I need machine code for reasons of performance and security, so, C, C++ or possibly D would all fit the bill)
You can certainly prioritize pieces and files with torrent_handle::prioritize_pieces() and torrent_handle::prioritize_files(). See the documentation.
This won't be enough to download in-order though. To do that, you can enable sequential download with torrent_handle::set_sequential_download(). This will issue new piece requests in-order. Keep in mind that the time a request take to be satisfied varies a lot depending on which peer you talk to. Making the requests in-order does not necessarily mean receiving the pieces in order.
There is another mechanism to attempt to do that. torrent_handle::set_piece_deadline() is used to set a target completion time for a piece. Such pieces are considered time-critical pieces, and they are ordered by their deadline and the fastest peers are used to request blocks from those pieces, attempting to download them in deadline-order.
Now, I also get the impression that you want two separate clients (presumably running on different machines) to coordinate which pieces they download. Is that right? It's not entirely clear what you're asking about, but there's no simple way of asking libtorrent to do that.
You could write a plugin for libtorrent that implements a new extension message for these clients to chat and coordinate, which could de-select certain pieces the other client is downloading by setting their priority to 0.