How to save google safe browsing v3 list - safe-browsing

I'm trying to download Google's phishing and malware list from their safe browsing API.
I want to use the new V3 API.
I was managed to get the redirect URL that makes the list.
Here is the response i get:
n:1710 i:googpub-phish-shavar u:safebrowsing-cache.google.com/safebrowsing/rd/ChRnb29ncHViLXBoaXNoLXNoYXZhcjgBQAJKDAgBEOeYARjqmAEgAUoMCAEQu5gBGOWYASABSgwIARDplwEYuZgBIAFKDAgBENmXARjnlwEgAUoMCAEQxpcBGNeXASABSgwIARDDlwEYxJcBIAFKDAgBELCXARjBlwEgAUoMCAEQgpcBGK6XASABSgwIARD-lgEYgJcBIAFKDAgBEOGWARj8lgEgAUoMCAEQ2pYBGN-WASABSgwIARDQlgEY2JYBIAFKDAgBEMKWARjOlgEgAUoMCAEQvZYBGMCWASABSgwIARC6lgEYu5YBIAFKDAgBELWWARi4lgEgAUoMCAEQsJYBGLOWASABSgwIARCmlgEYrpYBIAFKDAgBEJ6WARihlgEgAUoMCAEQm5YBGJyWASABSgwIARCWlgEYmZYBIAFKDAgBEJOWARiUlgEgAUoMCAEQjZYBGJGWASABSgwIARD-lQEYi5YBIAFKDAgBEPeVARj7lQEgAUoMCAEQ9JUBGPSVASABSgwIARDolQEY8JUBIAFKDAgBEOSVARjmlQEgAUoMCAEQ4JUBGOKVASABSgwIARDYlQEY3JUBIAFKDAgBENGVARjVlQEgAUoMCAEQzZUBGM-VASABSgwIARDIlQEYyZUBIAFKDAgBEMCVARjGlQEgAUoMCAEQvpUBGL6VASABSgwIARC7lQEYvJUBIAFKDAgBELiVARi4lQEgAUoMCAEQs5UBGLaVASABSgwIARCwlQEYsZUBIAFKDAgBEK6VARiulQEgAUoMCAEQqpUBGKyVASABSgwIARCmlQEYqJUBIAFKDAgBEKKVARiilQEgAUoMCAEQnZUBGJ2VASABSgwIARCWlQEYl5UBIAFKDAgBEJSVARiUlQEgAUoMCAEQj5UBGJCVASABSgwIARCNlQEYjZUBIAFKDAgBEIWVARiIlQEgAUoMCAEQgZUBGIOVASABSgwIARD7lAEY_5QBIAFKDAgBEPWUARj4lAEgAUoMCAEQ8JQBGPCUASAB u:safebrowsing-cache.google.com/safebrowsing/rd/ChRnb29ncHViLXBoaXNoLXNoYXZhcjgBQAJKEAgAEIydExicqRMgASoC0QU u:safebrowsing-cache.google.com/safebrowsing/rd/ChRnb29ncHViLXBoaXNoLXNoYXZhcjgBQAJKEAgAEMCOExiLnRMgASoC0gg u:safebrowsing-cache.google.com/safebrowsing/rd/ChRnb29ncHViLXBoaXNoLXNoYXZhcjgBQAJKEAgAEKP-Ehi_jhMgASoC_A4 u:safebrowsing-cache.google.com/safebrowsing/rd/ChRnb29ncHViLXBoaXNoLXNoYXZhcjgBQAJKFAgAENjsEhii_hIgASoGsALLBeMF u:safebrowsing-cache.google.com/safebrowsing/rd/ChRnb29ncHViLXBoaXNoLXNoYXZhcjgBQAJKIAgAEObcEhjX7BIgASoS0QOuBfQGgwfnCLAJsQmyCd0J
My problems are:
1. How do i save the list in to the DB? Does each row in the chunk file is only hashed or i need to deserialize it using Protocol Buffer?
2. How do i check if a given URL is bad? Do i need to hash it?
3. How do i need which chunks do i have?

You may want to check the unofficial Python client implementation for v3 safe browsing API
https://github.com/afilipovich/gglsbl

Related

Using great expectations with streamed data

I am using great expectations to test streaming data (I collect a sample into a batch and test the batch). The issue is I cannot use the docs because this will results in 100 of 1000s of html pages being generated. What I would like to do is use my api to generate the page requested from the json result when the specific test results are clicked on (via the index page). Is great expectations able to generate only 1 html which can be disposed of when it is closed?
If you are using a ValidationOperator / Checkpoint, then using the UpdateDataDocsAction action supports only building the resources that were validated in that run, and is the recommended approach.
If you are interacting directly with the DataContext API, then the build_data_docs method on DataContext supports a resource identifier option that you can use to request only a single asset is built. I think to get the behavior you're looking for (a truly ephemeral build of just that page), you'd want to pair that with a site configuration for a site in a temporary location, e.g. /tmp.
The docs on the build_data_docs method are here:
https://docs.greatexpectations.io/en/latest/autoapi/great_expectations/data_context/data_context/index.html#great_expectations.data_context.data_context.BaseDataContext.build_data_docs
Note that the resource_identifiers parameter requires, e.g. a ValidationResultIdentifier object, such as:
context.build_data_docs("local_site", resource_identifiers=[ValidationResultIdentifier(
run_id="20201203T182816.362147Z",
expectation_suite_identifier=ExpectationSuiteIdentifier("foo"),
batch_identifier="b739515cf1c461d67b4e56d27f3bfd02",
)])

RESTservice, resource with two different outputs - how would you do it?

Im currently working on a more or less RESTful webservice, a type of content api for my companys articles. We currently have a resource for getting all the content of a specific article
http://api.com/content/articles/{id}
will return a full set of article data of the given article id.
Currently we control alot of the article's business logic becasue we only serve a native-app from the webservice. This means we convert tags, links, images and so on in the body text of the article, into a protocol the native-app can understand. Same with alot of different attributes and data on the article, we will transform and modify its original (web) state into a state that the native-app will understand.
fx. img tags will be converted from a normal <img src="http://source.com"/> into a <img src="inline-image//{imageId}"/> tag, samt goes for anchor tags etc.
Now i have to implement a resource that can return the articles data in a new representation
I'm puzzled over how best to do this.
I could just implement a completely new resource, on a different url like: content/articles/web/{id} and move the old one to content/article/app/{id}
I could also specify in my documentation of the resource, that a client should always specify a specific request header maybe the Accept header for the webservice to determine which representation of the article to return.
I could also just use the original url, and use a url parameter like .../{id}/?version=app or .../{id}/?version=web
What would you guys reckon would be the best option? My personal preference lean towards option 1, simply because i think its easier to understand for clients of the webservice.
Regards, Martin.
EDIT:
I have chosen to go with option 1. Thanks for helping out and giving pros and cons. :)
I would choose #1. If you need to preserve the existing URLS you could add a new one content/articles/{id}/native or content/native-articles/{id}/. Both are REST enough.
Working with paths make content more easily cacheable than both header or param options. Using Content-Type overcomplicates the service especially when both are returning JSON.
Use the HTTP concept of Content Negotiation. Use the Accept header with vendor types.
Get the articles in the native representation:
GET /api.com/content/articles/1234
Accept: application/vnd.com.exmaple.article.native+json
Get the articles in the original representation:
GET /api.com/content/articles/1234
Accept: application/vnd.com.exmaple.article.orig+json
Option 1 and Option 3
Both are perfectly good solutions. I like the way Option 1 looks better, but that is just aesthetics. It doesn't really matter. If you choose one of these options, you should have requests to the old URL redirect to the new location using a 301.
Option 2
This could work as well, but only if the two responses have a different Content-Type. From the description, I couldn't really tell if this was the case. I would not define a custom Content-Type in this case just so you could use Content Negotiation. If the media type is not different, I would not use this option.
Perhaps option 2 - with the header being a Content-Type?
That seems to be the way resources are served in differing formats; e.g. XML, JSON, some custom format

How does Django create request.session and interface with WSGI?

I'm working with SWFUpload and Django, and I've noticed that authentication tends to break.
There is one part that is holding me up and I'm looking for direction more then a solution as I think know the solution is not yet available. (So I'm making it. )
I need to know how Django creates the WSGI request-object and how it's handled.
After looking at the source of django, it seems that csrf is done via the WSGIobject which have the appropriate cookeis appended to it. Naturally flash posts do not support this unless specified. SWFUpload offers the ability to send cookie data in the post params via a plugin, however I'd like to send them via headers on the URLRequest object. ( So that the Auth-Middleware and CSRF-Middleware can see it. )
My goal is to upgrade SWFUpload to send headers containing the values for what ever objects I pass it. The hard part for me is to figure out how those headers will be interpreted.
How does Django create the request.META object? | Where is the request.session object created?
I'm reading up on the WSGInterface now, but I'd like to accelerate this research. Thanks!
I believe what you're looking for is django.core.handlers.wsgi.

Webservice that returns an image plus a number

If I have a webservice that returns an image, is there a standard way to also have it return some structured data (in the simplest case, an additional string or number)?
See this question for a PHP example of a webservice that returns an image.
(But my question now is not specific to PHP.)
Possible solutions (mostly not very satisfying):
Put the extra data in the metadata of the image.
Put the extra data in the HTTP headers.
Have the webservice return a nonce URL from which the image can be fetched for a limited amount of time.
Base64-encode the image and make it huge string field in a JSON or XML data structure. (Thanks to Carles and Tiago.)
How would you do it?
(I don't really want to just split the webservice into two separate calls because the webservice takes a ton of parameters and it seems wasteful to send them all twice.)
HTTP headers are a viable choice as they can be parsed easily server side and client side. Another thing you can do is setup a 302 that redirects it to a URL with the image and data in the URL (e.g ->
hit http://mysite.com/bestimageever.png
get 302 to http://mysite.com/realbestimage.png?supercoolinfo=42
That'd be transparent to the user and most API clients will work (since they handle redirects)
Return the image's binary data encoded in base64, and the additional field:
{ image: "mIIerhdkwje...", data: "data" }
Base64 encoding gives you an overhead of 33% (you need 4 bytes for each 3 original bytes). Depending on the size of your data it may be better to replace this with the url of the binary data and issue a second request...
Protocol buffers are another choice. Protocol buffers aren't self-describing like XML or JSON, but they are much more compact on the wire. The Google library (http://code.google.com/p/protobuf) provides C++, Java, and Python libraries, and contributors have provided libraries for a lot of other languages (http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns), including Javascript and PHP, so client writers should have an easy time working with the format.
isn't it possible to include the binary data to form the image inside the return json/xml? By this way, you could add as many fields as necessary and you could process this information in the client.

Online JSONP converter/wrapper

I would like to fetch a source of file and wrap it within JSONP.
For example, I want to retrieve pets.txt as text from a host I don't own. I want to do that by using nothing but client-side JavaScript.
I'm looking for online service which can convert anything to JSONP.
YQL
Yahoo Query Language is one of them.
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D"http://elv1s.ru/x/pets.txt"&format=json&callback=grab
This works if URL is not blocked by robots.txt. YQL have respect to robots.txt. I can't fetch http://userscripts.org/scripts/source/62706.user.js because it blocked via robots.txt.
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D"http://userscripts.org/scripts/source/62706.user.js"&format=json&callback=grab
"forbidden":"robots.txt for the domain disallows crawling for url: http://userscripts.org/scripts/source/62706.user.js"
So I'm looking for another solutions.
I built jsonpwrapper.com.
It's unstable and slower than YQL, but it doesn't care about robots.txt.
Here's another one, much faster, built on DigitalOcean & CloudFlare, utilizing caching et al: http://json2jsonp.com
Nononono. No. Just please; no. That is not JSONP, it is javascript that executes a function with an object as its parameter that contains more javascript. Aaah!
This is JSON because it's just one object:
{
'one': 1,
'two': 2,
'three':3
}
This is JSONP because it's just one object passed through a function; if you go to http://somesite/get_some_object?jsonp=grab, the server will return:
grab({
'one': 1,
'two': 2,
'three':3
});
This is not JSON at all. It's just Javascript:
alert("hello");
And this? Javascript code stored inside a string (ouch!) inside an object passed to a function that should evaluate the string (but it might or might not):
grab({"body": "alert(\"Hello!\");\n"});
Look at all those semicolons and backslashes! I get nightmares from this kind of stuff. It's like a badly written Lisp macro because it's much more complicated than it needs to (and should!) be. Instead, define a function called grab in your code:
function grab(message) {
alert(message.body);
}
and then use JSONP to have the server return:
grab({body: "Hello!"});
Don't let the server decide how to run your web page Instead, let your web page decide how to run the web page and just have the server fill in the blanks.
As for an online service that does this? I don't know of any, sorry
I'm not sure what you're trying to do here, but nobody will use something like this. Nobody is going to trust your service to always execute as it should and output expected JavaScript code. You see Yahoo doing it because people trust Yahoo, but they will not trust you.