I'm looking for a GeoCoding provider for two purposes:
Address parsing (convert a long String into address components)
Address validation (make sure the address really exists)
I need to support North America addresses first, but keep the door open for international addresses as well.
I won't be displaying this information on a map or in a webapp, which puts me in a bit of a bind because services like Google Maps and Yahoo Maps require you to display any information you look up on their services.
Wikipedia contains a nice list of available geocoding providers here. My question is:
Is there a reliable/easy way to parse an address into component? I'd prefer embedding this logic into my application instead of having to depend on a 3rd-party provider.
Eventually I'll need to add address validation (with a map but not in a webapp). At that point, what do you recommend I do?
Is there a reliable/easy way to parse an address into component? I'd
prefer embedding this logic into my application instead of having to
depend on a 3rd-party provider.
No. You can always try to do it, but it will eventually fail. There is no universal planetary standard for addresses and not every country uses English addresses which add to the complexity of the task. There are 311 millions peoples in the USA and nearly 7 billion people in the world, now think of the different addresses it can represent.
Eventually I'll need to add address validation (with a map but not in
a webapp). At that point, what do you recommend I do?
I would use Google Maps API V3 but since it's against the rules in your case, I would try one of the paid service available out there for address parsing/validation (there are even free ones but they are less reliable). I think it's the best you can do.
In your case the only way to be 100% sure if the address exists and is valid would be to check it manually and then go there physically ;)
Gili, good for you for heeding license restrictions and other important "fine print".
I know you would rather embed the logic/functionality into your application without using an external service, but if you can figure out how to do that without jumping through a bunch of USPS hoopla to do it, kudos.
I work for SmartyStreets where we do both of those things. There's a really easy API called LiveAddress which does what you need... and it performs such that it doesn't seem like you're using a third-party service. I might add also, that usually it is smart business practice to dissociate non-core operations from your internal system, leaving the "black box" aspect of other stuff up to experts in those fields.
Here's some more information about converting a string into address components using LiveAddress.
I am running an anonymous voting contest. We are using cookies as the sole deterrant of multiple voting, but also tracking IP addresses and looking for suspiciously high numbers of votes from the same IP. Is there any way to prove that someone is cheating by IP rotation?
only statistically, which is not a 100% proof
but you can easily put the statistical terms in your contest terms - for example (just an example, don't know your traffic exactly) - no more than 1 vote per hour from same class B network for same candidate
a good way to filter out based on cookies is to require cookie before contest starts. i.e. only allow previous visitors of the site to vote. place cookie on their computers before they know about the contest. well, and of course require registration for votes, but that's a little more involved.
There is no way to identify the human sitting at the keyboard. So there's no 100% reliable way to prevent or detect multiple votes.
But, you could use some other means to identify the browser. Some useful links:
Browser info: http://panopticlick.eff.org/
Flash cookies: http://www.google.com/search?q=flash+cookies
List of various "offline storage" APIs: https://labs.isecpartners.com/breadcrumbs/breadcrumbs.html
Also, you can check the "User Agent". E.g. Wget and Curl are only used by ballot-stuffers, they're not normal browsers.
Short of watching over their shoulder as they do it you're not going to prove it. There are a few things you could potentially do though to try and catch this out.
The most obvious seems to be requiring email confirmation of voting (e.g. give us your email and click the link we send), you an enforce uniqueness on the emails sensibly and "disposable" addresses would be reasonably easy to spot I suspect. This could be taken a step further to "only registered users can vote" or like stackoverflow "only users with rep > X can vote" even.
See also this question
Since
In October 2009, the Internet
Corporation for Assigned Names and
Numbers (ICANN) approved the creation
of country code top-level domains
(ccTLDs) in the Internet that use the
IDNA standard for native language
scripts.
I'm pretty sure that the standard regexes most sites currently use won't mark these as valid, or am I wrong? Has anyone actually thought about how this would play out or has anyone done anything about it?
Hope I'm not jumping the gun here.
When a user types an internationalized domain into a browser, it's translated to an ASCII form; e-mail, surely, must work the same way (however, I've never received mail from an IDNA domain and I have reason to believe browsers are the only implementors of it).
Mailing agents would have to know that when they see Unicode in an address, it must be translated to IDNA form, and then the MX records looked up. I don't think in all of my system administration I've ever accounted for this. Being able to accept something the browser will translate as IDNA in a form element is not something I know how to do. If it is indeed translated to IDNA and a regex attempts to validate it, it should work.
I wouldn't be surprised if an international domain fails most e-mail regular expressions, and I think the relevance of such a fail is less than 1%. IDNA is really an "address bar" system, and an awful hack; I would really be surprised if e-mail worked on top of it.
Everyone is freaking out like something is changing. It isn't. IDNA is just moving from the domain to the TLD, and business will be as usual like it was before. Don't overthink it, OP.
Old regexes will mark IDNA names valid, provided they are correctly translated into ASCII DNS names.
So yes, we have a problem here. One cannot expect a user to simply input unicode into a textarea and receive an ASCII version of the domain name on the server side.
IDNA encoding is not nice, nor easy: Unicode chars are removed for the word they are in and placed after it, with a position marker.
Reimplementing it (e.g.) in javascript is slow, sad and boring. An url-encode-like approach would have made porting over every language easier.
Also people with systems not supporting IDNA have an hard time figuring out what a given domain looks like in ASCII by hand.
I feel IDNA came out pretty ugly, and that will hinder its adoption.
I've got a rather large database of location addresses (500k+) from around the world. Though lots of the address are duplicates or near duplicates.
Whenever a new address is entered, I check to see if it is in the database already, and if so, i take the already existing lat/long and apply it to the new entry.
The reason I don't link to a separate table is because the addresses are not used as a group to search on, and their are often enough differences in the address that i want to keep them distinct.
If I have a complete match on the address, I apply that lat/long. If not, I go to city level and apply that, if I can't get a match there, I have a separate process to run.
Now that you have the extensive background, the problem. Occasionally I end up with a lat/long that is far outside of the normal acceptable range of error. However, strangely, it is normally just one or two of these lat/longs that fall outside the range, while the rest of the data exists in the database with the correct city name.
How would you recommend cleaning up the data. I've got the geonames database, so theoretically i have the correct data. What i'm struggling with is what is the routine you would run to get this done.
If someone could point me in the direction of some (low level) data scrubbing direction, that would be great.
This is an old question, but true principles never die, right?
I work in the address verification industry for a company called SmartyStreets. When you have a large list of addresses and need them "cleaned up", polished to official standards, and then will rely on it for any aspect of your operations, you best look into CASS-Certified software (US only; countries vary widely, and many don't offer such a service officially).
The USPS licenses CASS-Certified vendors to "scrub" or "clean up" (meaning: standardize and verify) address data. I would suggest that you look into a service such as SmartyStreets' LiveAddress to verify addresses or process a list all at once. There are other options, but I think this is the most flexible and affordable for you. You can scrub your initial list then use the API to validate new addresses as you receive them.
Update: I see you're using JSON for various things (I love JSON, by the way, it's so easy to use). There aren't many providers of the services you need which offer it, but SmartyStreets does. Further, you'll be able to educate yourself on the topic of address validation by reading some of the resources/articles on that site.
Problem
At work we have a department wiki (running Mediawiki). Unfortunately several
persons edit without logging in, and that makes it very difficult to track
down editors to ask questions about the content.
There are two strategies to improve this
encourage logged in editing
discourage anonymous editing.
Encouraging
For this part, any tips are welcome. But of course there is always risks involved
in rewarding behaviours.
Discourage
I know that this must be kept low or else it will discourage any editing.
But something just slightly annoying would be nice to have.
[update]
I know it is possible to just disallow anonymous editing, but that will put a high barrier to any first time contribution (especially for people outside our department!), so I do not think that is an option.
[/update]
[update2]
Using LDAP or Active Directory does not solve the problem since the wiki is also accessible and used by external contractors.
[/update2]
[update3]
I am no longer working for this company. That does not mean that I completely have lost interest in this question, but from my current interest point the most valuable part is the "Did you forget to log in?" part below, and I will accept answers based on this part of the question.
[/update3]
Confirmation
One thought was to have an additional confirmation step for anonymous users -
"Are you really sure you want to submit this anonymously?", although with
such a question there is a risk that people will give up or resist editing. However,
if that question is re-phrased in a more diplomatic way as "Did you forget
to log in?" I think it will appear as much more acceptable. And besides that
will also capture those situations where the author did in fact forget to
log in, but actually would want to have his/her contributions credited
his/her user. This last point is by itself a good enough reason for wanting it.
Is this possible?
Delay
Another thought for something to be slightly annoying is to add an extra
forced delay after "save page" displaying something like "If you had logged
in you would not have to wait x seconds". Selecting a right x is difficult
because if it is to high it will be a barrier and if it too low might not
make any difference. But then I started thinking, what about starting at
zero and then add one second delay for each anonymous edit by a given IP
address in a given time frame? That way there will be no barrier for
starting to use the wiki, and by the time the delay is getting significant
the user has already contributed a lot so I think the outcome is much
more likely to be that the editor eventually creates a user rather than
giving up. This assumes IP addresses are rather static, but that is very
typically is the case in a business network.
Is this possible?
You can Turn off Anonymous Editing in Mediawiki like so:
Edit LocalSettings.php and add the following setting:
$wgDisableAnonEdit = true;
Edit includes/SkinTemplate.php, find $fname-edit and change the code to look like this (i.e., basically wrap the following code between the wfProfileIn() and wfProfileOut() functions):
wfProfileIn( "$fname-edit" );
global $wgDisableAnonEdit;
if ( $wgUser->mId || !$wgDisableAnonEdit) {
// Leave this as is
}
wfProfileOut( "$fname-edit" );
Next, you may want to disable the [Edit] links on sections. To do this, open includes/Skin.php and search for editsection. You will see something like:
if (!$wgUser->getOption( 'editsection' ) ) {
Change that to:
global $wgDisableAnonEdit;
if (!$wgUser->getOption( 'editsection' ) || !$wgDisableAnonEdit ) {
Section editing is now blocked for anonymous users.
Forbid anonymous editing and let people log in using their domain logins (LDAP). Often the threshold is the registering of a new user and making up username and password and such.
I think you should discourage anonymous edits by forbidding them - it's an internal wiki, after all.
The flipside is you must make the login process as easy as possible. Hopefully you can configure the login cookie to have a decent length (like 1 month) so they only need to login once per month.
Play to the people's egos, and add a rep system kind of like here. Just make a widget for the home page that shows the number of edits made by the top 5 users or something. Give the top 1 or 2 users a MVP reward at regular (monthly?) intervals.
Well, I doubt that this solution will be valuable for hlovdal, given that this question is now two months old, but maybe somebody else will find it useful:
The optimum solution to this problem is to enable automatic logins. This requires two steps. First, you need to add automatic authentication to your web service. Right now, we're using Apache with the Debian usn-libapache2-authenntlm-perl package on our internal application server*. (Our network is Active Directory and, obviously, the server runs on Debian Linux.) Second, you need a MediaWiki extension that makes MediaWiki aware of the web service's authentication. I've used the Automatic REMOTE_USER Authentication module successfully on an Apache web server that was tied into our network via an NTLM authentication module, but I do recall that it required a bit of massaging the code to make it work:
I had to follow the "horrid hacks" given on the extension's page, changing the setPassword() and addUser() functions to always return true instead of always returning false.
Since Active Directory is case-insensitive and MediaWiki isn't, I replaced both instances of the statement $username = $_SERVER['REMOTE_USER'] with $username = getCanonicalName($_SERVER['REMOTE_USER']).
Since I wanted to only allow certain people within the company to use our wiki, I set autoCreate() to always return false. It doesn't sound as if you need to worry about this, so you should leave autoCreate() at always returning true, which means that anybody on your company network will be able to access the wiki.
The nifty thing about this solution is that nobody has to log in into the wiki, ever; they simply go to a wiki page and they are logged in under their network ID.
* We just switched to this from a Red Hat server that was using mod_ntlm. Unfortunately, mod_ntlm hasn't been updated in a while and it's been starting to sporadically fail. I mention this because I've started to stumble on a performance issue with our current MediaWiki configuration that may require further code massaging....
Make sure users don't get logged out if they look away from the screen or sneeze or scratch their head. You want long, persistent, sessions. Once logged in, stay logged in.
That's the problem with the MediaWiki our company is using internally - you log in, do stuff, then come back later and it logged you out, but the notification of not being logged in anymore is so insignificant on the screen that the user never notices.
If this runs within an internal network, you could pull Active Directory information so that no one has to log in, ever. That's how I do it at work. That is, if they are logged into their windows machine, then my webapps can pick up their username and associate that (or their userid) with their edits.
I don't know if this would be easy to add to MediaWiki, though.
I'd recommend checking out wikipatterns.org - a great site about the social aspects of wikis
Explicitly using some form of directory service (LDAP) would probably be a good idea, so that your users are always fully identified. On the other hand, wikis are subject to their own dynamics, in fact some wikis are so successful because they can be anonymously edited, so that's another thing to keep in mind.
Apart from that, personally I'd try to create some sort of incentive for users to contribute openly and identifiable: this could be based on a point/score system so that there are stats shown for all users who have contributed to the wiki each day, this could possibly even create some sort of competition.
Likewise, the wiki could by default not show any anonymously contributed contents without them being reviewed first, which would be another incentive for users to contribute openly.
SO has an extremely low barrier for posting. You could allow people to specify their name when making an edit. When they are ready, they can finally log in to avoid having to type their name all the time.
You said this is in a departmental situation. Can't you add a feature to the wiki where it makes an educated guess as to who is editing based on the IP address, and annotates the edit accordingly?
I agree absolutely with everyone who recommends carefully researching the effects of anonymity in your application before you start "forbidding" it. In a great many cases people prefer anonymous editing because they DO NOT WANT TO BE ASKED ABOUT IT, IDENTIFIED WITH IT, OR SUFFER SOME PROBLEM FOR POINTING IT OUT. You need to be VERY sure these factors are not driving users to prefer anonymous edits, and frankly you should continue to allow anonymized edits with a generic credential login like "anonymous_employee" or "anonymous_contractor", in case someone wants to point out an issue without becoming identified with it.
Re the "thought... to have an additional confirmation step for anonymous users- "Are you really sure you want to submit this anonymously?", it's a good idea, but do not "re-phrase" in a way that suggests it is wrong to not be logged in as yourself, i.e. don't say "Did you forget to log in?" I'd instead note it this way:
"Your edit will appear as an IP number - it may be attributed to 'anonymous_employee' or 'anonymous_contractor' or 'anonymous_contributor' for your privacy protection. You will not be notified of any answer or response to it. If you prefer to have this contribution credited, then [log in right now]."
That leaves it absolutely clear what will happen, doesn't pressure anyone to do it either way, and does not bias what is being contributed with some "rewards".
You can also, alternately, force a login via LDAP / cookies, and then ask them if they prefer this edit to be anonymous. That is the approach taken on some blog platforms. In an intranet the abuse potential for this is basically zero, so you would presumably only have situations where someone didn't want 'how they knew' or 'why they raised this' to be the question rather than the data itself... IBM has shown in some careful research that anonymized feedback is very much more useful than attributed in correcting groupthink & management blind sides.