cleaning up missed geocoding (or general advise on data cleaning) - geocoding

I've got a rather large database of location addresses (500k+) from around the world. Though lots of the address are duplicates or near duplicates.
Whenever a new address is entered, I check to see if it is in the database already, and if so, i take the already existing lat/long and apply it to the new entry.
The reason I don't link to a separate table is because the addresses are not used as a group to search on, and their are often enough differences in the address that i want to keep them distinct.
If I have a complete match on the address, I apply that lat/long. If not, I go to city level and apply that, if I can't get a match there, I have a separate process to run.
Now that you have the extensive background, the problem. Occasionally I end up with a lat/long that is far outside of the normal acceptable range of error. However, strangely, it is normally just one or two of these lat/longs that fall outside the range, while the rest of the data exists in the database with the correct city name.
How would you recommend cleaning up the data. I've got the geonames database, so theoretically i have the correct data. What i'm struggling with is what is the routine you would run to get this done.
If someone could point me in the direction of some (low level) data scrubbing direction, that would be great.

This is an old question, but true principles never die, right?
I work in the address verification industry for a company called SmartyStreets. When you have a large list of addresses and need them "cleaned up", polished to official standards, and then will rely on it for any aspect of your operations, you best look into CASS-Certified software (US only; countries vary widely, and many don't offer such a service officially).
The USPS licenses CASS-Certified vendors to "scrub" or "clean up" (meaning: standardize and verify) address data. I would suggest that you look into a service such as SmartyStreets' LiveAddress to verify addresses or process a list all at once. There are other options, but I think this is the most flexible and affordable for you. You can scrub your initial list then use the API to validate new addresses as you receive them.
Update: I see you're using JSON for various things (I love JSON, by the way, it's so easy to use). There aren't many providers of the services you need which offer it, but SmartyStreets does. Further, you'll be able to educate yourself on the topic of address validation by reading some of the resources/articles on that site.


Can the _ga cookie value be used to exclude self traffic in Google analytics universal?

Google analytics store a unique user id in a cookie names _ga. Some self traffic was already counted, and I was wondering if there's a way to filter it out by providing the _ga cookie value to some exclusion filter.
Any ideas?
Firstly, I'm gonna put it out there that there is no solution for excluding or removing historical data, except to make a filter or segment for your reports, which doesn't remove or prevent that data from showing up; it simply hides it. So if you're looking for something that gets rid of the data that is already there, sorry, not happening. Now on to making sure more data doesn't show up.
GA does not offer a way to exclude traffic by its visitor cookie (or any cookie in general). In order to do this, you will need to read the cookie yourself and expose it to something that GA can exclude by. For example, you can pop a custom variable or override/append the page name.
But this isn't really that convenient for lots of reasons, such as having to burn a custom variable slot, or having to write some server-side or client-side code to read the cookie and act on a value, etc..
And even if you do decide to do this, you're going to have to consider at least 2 scenarios in which this won't work or break:
1) This won't work if you go to your site from a different browser, since browsers can't read each other's cookies.
2) It will break as soon as you clear your cookies.
An alternative you should consider is to make an exclusion filter on your IP address. This has the benefit of:
works regardless of which browser you are on
you don't have to write any code for it that burns or overwrites any variables
you don't have to care about the cookie
General Notes
I don't presume to know your situation or knowledge, so take this for what it's worth: simply throwing out general advice because nothing goes without saying.
If you were on your own site to QA something and are wanting to remove data from some kind of development or QA efforts, a better solution in general is to have a separate environment for developing and QAing stuff. For example a separate subdomain that mirrors live. Then you can easily make an exclusion on that subdomain (or have a separate view or property or any number of ways to keep that dev/qa traffic out).
Another thing is.. how much traffic are we talking here anyway? You really shouldn't be worrying about a few hits and whatnot that you've personally made on your live site. Again, I don't know your full situation and what your normal numbers look like, but in the grand scheme of things, a few extra hits is a drop of water in a bucket, and in general it's more useful to look at trends in the data over time, not exact numbers at a given point in time.

Geocoding for many addresses

why doesn't geocoding allow me to create markings for more than 11 addresses? I have hundreds of addresses in a database, but no Long Lat information. I need to mark all these addresses on a map. Somehow it displays only the first 11 markings.
This question has been asked earlier i know and the solution is to set an interval between markers. I was able to display all by using a time interval between the markings. This solution is obviously too slow. Is there a better solution now?
Your question isn't very clear to me, but I understand that you are trying to show address locations on a map without knowing their coordinates. Using Google Maps, for example, you don't actually need latitude/longitude. But do you know the addresses are correct? Or, if you aren't using Google Maps but have a different use case entirely, then perhaps you do need the coordinates.
I work for SmartyStreets where we perform both of these services (verifying addresses and geocoding them, meaning supplying lat/lon information).
Getting lat/lon can be tricky, especially considering that addresses are often so different and anything but "normalized" or standardized. Google and similar services approximate addresses but do not verify them, and their lat/lon is sometimes equally a best-guess.
Judging from your question, it seems like something like the LiveAddress API would suit you well -- and it's free for low volume use. It's quite accurate, but in cases where it's "really" far off (meaning, a city block usually; still not too bad), it does return the level of resolution for each query.
If you have further questions or clarifications, feel free to respond (I'll help you out).
Geocoding has some limitations on converting address into lat long. This is casued by OVER_QUERY_LIMIT.
Client side geocoding has some limitation of 20 queries per minute or sec. Server side geocoding also has limitations but after 2500 queries
I have worked on this issue and I used tips based on this solution via PHP/JavaScript and AJAX:

GeoCoding providers for non-map use

I'm looking for a GeoCoding provider for two purposes:
Address parsing (convert a long String into address components)
Address validation (make sure the address really exists)
I need to support North America addresses first, but keep the door open for international addresses as well.
I won't be displaying this information on a map or in a webapp, which puts me in a bit of a bind because services like Google Maps and Yahoo Maps require you to display any information you look up on their services.
Wikipedia contains a nice list of available geocoding providers here. My question is:
Is there a reliable/easy way to parse an address into component? I'd prefer embedding this logic into my application instead of having to depend on a 3rd-party provider.
Eventually I'll need to add address validation (with a map but not in a webapp). At that point, what do you recommend I do?
Is there a reliable/easy way to parse an address into component? I'd
prefer embedding this logic into my application instead of having to
depend on a 3rd-party provider.
No. You can always try to do it, but it will eventually fail. There is no universal planetary standard for addresses and not every country uses English addresses which add to the complexity of the task. There are 311 millions peoples in the USA and nearly 7 billion people in the world, now think of the different addresses it can represent.
Eventually I'll need to add address validation (with a map but not in
a webapp). At that point, what do you recommend I do?
I would use Google Maps API V3 but since it's against the rules in your case, I would try one of the paid service available out there for address parsing/validation (there are even free ones but they are less reliable). I think it's the best you can do.
In your case the only way to be 100% sure if the address exists and is valid would be to check it manually and then go there physically ;)
Gili, good for you for heeding license restrictions and other important "fine print".
I know you would rather embed the logic/functionality into your application without using an external service, but if you can figure out how to do that without jumping through a bunch of USPS hoopla to do it, kudos.
I work for SmartyStreets where we do both of those things. There's a really easy API called LiveAddress which does what you need... and it performs such that it doesn't seem like you're using a third-party service. I might add also, that usually it is smart business practice to dissociate non-core operations from your internal system, leaving the "black box" aspect of other stuff up to experts in those fields.
Here's some more information about converting a string into address components using LiveAddress.

Free geocoding service with non-restrictive license

I am looking for a geocoding service where I can make a request with an address or intersection, not necessarily separated into separate fields (street, city, state, etc.) and get the latitude and longitude, along with suggestions and corrections for misspelled or ambiguous queries.
I really like the Google Geocoding API, but the terms of use say that I am not allowed to store the responses or use the service for any purpose other than showing the result on one of their maps. I am planning to use it for a lightweight, mobile-friendly website that may have the option of displaying results with text only, so this would not work, assuming I am interpreting their terms correctly.
The Yahoo PlaceFinder API looks nice but it comes with similar restrictions.
I am trying to decide what would be a good choice. The Bing API looks good. I don't see any sort of restriction in their terms but am I missing something?
Does anyone know what would be a good choice? I have very limited funding, so I would prefer something that is free or cheap, at least for the near future.
You could try Nominatim, it's a tool to search OpenStreetMap data by name and address.
MapQuest provide a free API as long as you give the appropriate credit
I'm not sure how well it handles misspellings or ambiguous queries though!

Retrieve a list of the most popular GET param variations for a given URL?

I'm working on building intelligence around link propagation, and because I need to deal with many short URL services where a reverse-lookup from an exact URL address is required, I need to be able to resolve multiple approximate versions of the same URL.
An example would be a URL like
Of course, changing GET params in certain circumstances can refer to a completely different page, especially if the GET params in question refer to a profile or content ID.
But a quick parse of the page would quickly determine how similar the pages were to each other. Using a bit of machine learning, it could quickly become clear which GET params don't effect the content of the pages returned for a given site.
I'm assuming a service to send a URL and get a list of very similar URLs could only be offered by the likes of Google or Yahoo (or Twitter), but they don't seem to offer this feature, and I haven't found any other services that do.
If you know of any services that do cluster together groups of almost identical URLs in the aforementioned way, please let me know.
My bounty is a hug.
Every URL is akin an "address" to a location of data on the internet. The "host" part of the URL (in your example, "") is a web-server, or a set of web-servers somewhere in the world. If we think of a URL as an "address", then the host could be a "country".
The country itself might keep track of every piece of mail that enters it. Some do, some don't. I'm talking about web-servers! Of course real countries don't make note of every piece of mail you get! :-)
But even if that "country" keeps track of every piece of mail - I really doubt they have any mechanism in place to send that list to you.
As for organizations that might do that harvesting themselves, I think the best bet would be Google, but even there the situation is rather grim. You see, because Google isn't the owner every web-server ("country") in the world, they cannot know of every URL that accesses that web-server.
But they can do the reverse. Since they can index every page they encounter, they can get a pretty good idea of every URL that appears in public HTML pages on the web. Of course, this won't include URLs people send to each other in chats, SMSs, or e-mails. But still, they can get a pretty good idea of what URLs exist.
I guess what I'm trying to say is that what you're looking for doesn't exist, really. The only way you can get all the URLs used to access a single website, is to be owner of that website.
Sorry, mate.
It sounds like you need to create some sort of discrete similarity rank between pages. This could be done by finding the number of similar words between two pages and normalizing the value to a bounded range then mapping certain portions of the range to different similarity ranks.
You would also need to know for each pair that you compare what GET parameters they had in common or how close they were. This information would become the attributes that define each of your instances (stored along side the rank mentioned above). After you have amassed a few hundred pairs of comparisons you could perhaps do some feature subset selection to identify the GET parameters that most identify how similar two pages are.
Of course, this could end up not finding anything useful at all as this dataset is likely to contain a great deal of noise.
If you are interested in this approach you should look into Infogain and feature subset selection in general. This is a link to my professors lecture notes which may come in handy.