How to geocode a large number of addresses? - geocoding

I need to geocode, i.e. translate street address to latitude,longitude for ~8,000 street addresses. I am using both Yahoo and Google geocoding engines at http://www.gpsvisualizer.com/geocoder/, and found out that for a large number of addresses those engines (one of them or both) either could not perform geocoding (i.e.return latitude=0,longitude=0), or return the wrong coordinates (incl. cases when Yahoo and Google give different results).
What is the best way to handle this problem? Which engine is (usually) more accurate? I would appreciate any thoughts, suggestions, ideas from people who had previous experience with this kind of task.

When doing a large number of requests to Google geocoding service you need to throttle the requests as responses start failing. To give you a better idea here is a snippet from a Drupal (PHP) module that I wrote.
function gmap_api_geocode($address) {
$delay = 0;
$api_key = keys_get_key('gmap_api');
while( TRUE ) {
$response = drupal_http_request('http://maps.google.com/maps/geo?q='. drupal_urlencode($address) .'&output=csv&sensor=false&oe=utf8&key='. $api_key);
switch( $response->code ) {
case 200: //OK
$data = explode(',', $response->data);
return array(
'latitude' => $data[2],
'longitude' => $data[3],
);
// Adopted from http://code.google.com/apis/maps/articles/phpsqlgeocode.html
case 620: //Too many requests
$delay += 100000;
break;
default:
return FALSE;
}
usleep($delay);
}
}

you can use
https://github.com/darkphnx/fetegeo/
for offline batch geocoding

Related

Golang match domain names wild card

I have hostnameWhitelist map
var hostnameWhitelist = map[string] bool { "test.mydomain.com":true, "test12.mydomaindev.com":true}
And the way I check if incoming request's hostname is allowed or not is -
url, errURL := url.Parse("test.mydomain.com")
if errURL != nil {
fmt.Println("error during parsing URL")
return false
}
fmt.Println("HOSTNAME = " + url.Hostname())
if ok := hostnameWhitelist[url.Hostname()]; ok {
fmt.Println("valid domain, allow access")
} else {
fmt.Println("NOT valid domain")
return false
}
While this works great, how do I do a wild card match like -
*.mydomain.com
*.mydomaindev.com
Both of these should pass.
While,
*.test.com
*.hello.com
should fail
Doing a wildcard match with the wildcard at the start is highly expensive. Regex could be difficult with regards to performance, depending on the size of your dataset and the speed of evaluating against your dataset. You could try using a suffix tree, but I suspect the performance could become a problem (I havent tested it on our data).
One approach we use is building a Radix Trie (compact prefix trie) with the signature domainname's labels in reverse octet order. Your signature domain *.foo.example.com becomes com.example.foo.*, which puts the wildcard at the end. Your custom built Radix tree will then only need to stop matching if it reaches a wildcard node. Your Trie could support both exact string matching and wildcard matching. If you wish to allow the wildcard to sit in the middle of the domainname the performance could become a problem.
One of the biggest challenges we'v had using Trie's to evaluate domainnames is not the searchtime but the memory consumption and as such how long it takes to start the program when you have a lot of signatures.
We'v evaluated a few implementations (at start mainly without wildcard-support) testing loadtime, allocations, # of internal nodes, memoryconsumption, GC time and search/insert/remove time.
Implementations we'v tested:
golang maps
https://github.com/armon/go-radix
https://github.com/tchap/go-patricia
https://github.com/fanyang01/radix
our own implementation
Obviously, using a golang map will give best performance, but when one needs to retrieve (whence the word Trie) e.g. prefixed information from the dataset, golang maps doesn't give us the features we need.
We keep an approximately 700 000 domainname signatures in our Trie. Buildtime is 2 seconds, 300MB memory, 5 million allocation, 2second GC and searching costs 150ns/op.
If we use golang map for the same signatures (without wildcards) we get loadtime 0.5seconds, 50MB memory, negligible allocations, 1.6second GC and searching costs 25ns/op.
In our initial implementation buildtime was 6seconds, 1GB memory, 60 million allocations, 5second GC and searching cost ~200 ns/op.
As you can see from these results we managed to lower the memory consumption and loadtime, while the searching cost remained approximately the same.
If your going to do CIDR matching, I would recommend checking out https://github.com/kentik/patricia. To lower the GC time it is implemented to avoid pointers.
Good luck with your work!
Regex is the to go solution for your problem, map[string]bool may not work as expected as you are trying to match a regex with single value.
package main
import (
"fmt"
"regexp"
)
func main() {
if matched, _ := regexp.MatchString(".*\\.?mydomain.*", "mydomaindev.com"); matched {
fmt.Println("Valid domain")
}
}
This would match all domain with pattern mydomain, so www.mydomain.com www.mydomaindev.com would match byt test.com and hello.com will fail
Other handy string ops are,
//This would match a.mydomain.com, b.mydomain.com etc.,
if strings.HasSuffix(url1.Hostname(), ".mydomain.com") {
fmt.Println("Valid domain allow access")
}
//To match anything with mydomain - like mydomain.com, mydomaindev.com
if strings.Contains(url2.Hostname(), "mydomain.com") {
fmt.Println("Valid domain allow access")
}
You can store the keys of the map in the format *.domain.com
The convert all the hostnames you get into that format using strings.SplitAfterN and strings.Join.
split := strings.SplitAfterN(url.Hostname(),".",2)
split[0] = "*"
hostName := strings.Join(split,".")
...
hostnameWhitelist[hostName]
...
Play Link
Unrelated improvement
If you are using the map purely as a whitelist you can use map[string]struct{} instead of map[string]bool. But as Peter mentioned in his comment, it might be relevant only if you have a very large whitelist.
If you want to be able to have several depth in your domains, eg:
*.foo.example.org
*.example.com
Then I would add a second container for the wildcards:
var wdomains = []string { ".foo.example.org", ".example.com"}
Then just check if your domain to test ends with one of those entries:
func inWdomain(wdomains []string, domain string) bool {
for _, suffix := range wdomains {
if strings.HasSuffix(domain, suffix) {
return true
}
}
return false
}
Note: if you have more than hundreds of domains, you could use a radix tree.
https://play.golang.org/p/-4n8mlGmpH
You can use fstest.MapFS like a Set data structure, with the added benefit of
Glob matching:
package main
import "testing/fstest"
var tests = []struct {
pat string
res int
} {
{"*.hello.com", 0},
{"*.mydomain.com", 1},
{"*.mydomaindev.com", 1},
{"*.test.com", 0},
}
func main() {
m := fstest.MapFS{"test.mydomain.com": nil, "test12.mydomaindev.com": nil}
for _, test := range tests {
a, e := m.Glob(test.pat)
if e != nil {
panic(e)
}
if len(a) != test.res {
panic(len(a))
}
}
}
https://golang.org/pkg/testing/fstest

VSTO web interaction

In PHP I can use this code:
$url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins=$location1& destinations=$location2&mode=bicycling&language=en-EN&sensor=false";
$data = #file_get_contents($url);
$obj = json_decode($data);
$arr = (array)$obj;
to get an array of values that return the distance between 2 locations.
Is this kind of web interaction possible with VSTO? I have Googled high and low and nothing I search is giving me any results to work with.
Ergo, I think I am missing something.

CakePHP reading Cookie with multiple dots

I am using CakePHP to develop a website and currently struggling with cookie.
The problem is that when I write cookie with multiple dots,like,
$this->Cookie->write("Figure.1.id",$figureId);
$this->Cookie->write("Figure.1.name",$figureName);`
and then read, cakePHP doesn't return nested array but it returns,
array(
'1.id' => '82',
'1.name' => '1'
)
I expected something like
array(
(int) 1 => array(
'id'=>'82',
'name'=>'1'
)
)
Actually I didn't see the result for the first time when I read after I write them. But from second time, result was like that. Do you know what is going on?
I'm afraid it doesn't look as if multiple dots are supported. If you look at the read() method of the CookieComponent (http://api.cakephp.org/2.4/source-class-CookieComponent.html#256-289), you see this:
277: if (strpos($key, '.') !== false) {
278: $names = explode('.', $key, 2);
279: $key = $names[0];
280: }
and that explode() method is being told to explode the name of your cookie into a maximum of two parts around the dot.
You might be best serializing the data you want to store before saving and then deserializing after reading as shown here: http://abakalidis.blogspot.co.uk/2011/11/cakephp-storing-multi-dimentional.html

Query Facebook Opengraph next page parameters

I am unable to implement pagination with Facebook OpenGraph. I have exhausted every option I have found.
My hope is to query for 500 listens repeatedly until there are none left. However, I am only able to receive a response from my first query. Below is my current code, but I have tried setting the parameters to different amounts rather than having the fields from the [page][next] dictate them
$q_param['limit'] = 500;
$next_exists = true;
while($next_exists){
$music = $facebook->api('/me/music.listens','GET', $q_param);
$music_data = array_merge($music_data, $music['data']);
if($music["paging"]["next"]==null || $music["paging"]["next"]=="")
$next_exists = false;
else{
$url = $music["paging"]["next"];
parse_str(parse_url($url, PHP_URL_QUERY), $array);
foreach ($array as $key => $value) {
$q_param[$key]=$value;
}
}
}
}
a - Can you please share what do you get after first call?
b - Also, possible if you can share the whole file?
I think your script is timing out. Try adding following on top of your file:
set_time_limit(0);
Can you check apache log files?
sudo tail -f /var/log/apache2/error.log

Finding the phone company of a cell phone number?

I have an application where people can give a phone number and it will send SMS texts to the phone number through EMail-SMS gateways. For this to work however, I need the phone company of the given number so that I send the email to the proper SMS gateway. I've seen some services that allow you to look up this information, but none of them in the form of a web service or database.
For instance, http://tnid.us provides such a service. Example output from my phone number:
Where do they get the "Current Telephone Company" information for each number. Is this freely available information? Is there a database or some sort of web service I can use to get that information for a given cell phone number?
What you need is called a HLR (Home Location Register) number lookup.
In their basic forms such APIs will expect a phone number in international format (example, +15121234567) and will return back their IMSI, which includes their MCC (gives you the country) and MNC (gives you the phone's carrier). The may even include the phone's current carrier (eg to tell if the phone is roaming). It may not work if the phone is currently out of range or turned off. In those cases, depending on the API provider, they may give you a cached result.
The site you mentioned seems to provide such functionality. A web search for "HLR lookup API" will give you plenty more results. I have personal experience with CLX's service and would recommend it.
This would be pretty code intensive, but something you could do right now, on your own, without APIs as long as the tnid.us site is around:
Why not have IE open in a hidden browser window with the URL of the phone number? It looks like the URL would take the format of http://tnid.us/search.php?q=########## where # represents a number. So you need a textbox, a label, and a button. I call the textbox "txtPhoneNumber", the label "lblCarrier", and the button would call the function I have below "OnClick".
The button function creates the IE instance using MSHtml.dll and SHDocVW.dll and does a page scrape of the HTML that is in your browser "object". You then parse it down. You have to first install the Interoperability Assemblies that came with Visual Studio 2005 (C:\Program Files\Common Files\Merge Modules\vs_piaredist.exe). Then:
1> Create a new web project in Visual Studio.NET.
2> Add a reference to SHDocVw.dll and Microsoft.mshtml.
3> In default.aspx.cs, add these lines at the top:
using mshtml;
using SHDocVw;
using System.Threading;
4> Add the following function :
protected void executeMSIE(Object sender, EventArgs e)
{
SHDocVw.InternetExplorer ie = new SHDocVw.InternetExplorerClass();
object o = System.Reflection.Missing.Value;
TextBox txtPhoneNumber = (TextBox)this.Page.FindControl("txtPhoneNumber");
object url = "http://tnid.us/search.php?q=" + txtPhoneNumber.Text);
StringBuilder sb = new StringBuilder();
if (ie != null) {
ie.Navigate2(ref url,ref o,ref o,ref o,ref o);
ie.Visible = false;
while(ie.Busy){Thread.Sleep(2);}
IHTMLDocument2 d = (IHTMLDocument2) ie.Document;
if (d != null) {
IHTMLElementCollection all = d.all;
string ourText = String.Empty;
foreach (object el in all)
{
//find the text by checking each (string)el.Text
if ((string)el.ToString().Contains("Current Phone Company"))
ourText = (string)el.ToString();
}
// or maybe do something like this instead of the loop above...
// HTMLInputElement searchText = (HTMLInputElement)d.all.item("p", 0);
int idx = 0;
// and do a foreach on searchText to find the right "<p>"...
foreach (string s in searchText) {
if (s.Contains("Current Phone Company") || s.Contains("Original Phone Company")) {
idx = s.IndexOf("<strong>") + 8;
ourText = s.Substring(idx);
idx = ourText.IndexOf('<');
ourText = ourText.Substring(0, idx);
}
}
// ... then decode "ourText"
string[] ourArray = ourText.Split(';');
foreach (string s in ourArray) {
char c = (char)s.Split('#')[1];
sb.Append(c.ToString());
}
// sb.ToString() is now your phone company carrier....
}
}
if (sb != null)
lblCarrier.Text = sb.ToString();
else
lblCarrier.Text = "No MSIE?";
}
For some reason I don't get the "Current Phone Company" when I just use the tnid.us site directly, though, only the Original. So you might want to have the code test what it's getting back, i.e.
bool currentCompanyFound = false;
if (s.Contains("Current Telephone Company")) { currentCompanyFound = true }
I have it checking for either one, above, so you get something back. What the code should do is to find the area of HTML between
<p class="lt">Current Telephone Company:<br /><strong>
and
</strong></p>
I have it looking for the index of
<strong>
and adding on the characters of that word to get to the starting position. I can't remember if you can use strings or only characters for .indexOf. But you get the point and you or someone else can probably find a way to get it working from there.
That text you get back is encoded with char codes, so you'd have to convert those. I gave you some code above that should assist in that... it's untested and completely from my head, but it should work or get you where you're going.
Did you look just slightly farther down on the tnid.us result page?
Need API access? Contact sales#tnID.us.
[Disclosure: I work for Twilio]
You can retrieve phone number information with Twilio Lookup.
If you are currently evaluating services and functionality for phone number lookup, I'd suggest giving Lookup a try via the quickstart.