How to do large-scale, batch reverse geo-coding? - geocoding

I have a very large list of lat/lon coordinate pairs (>50 million). I want to attach address information to each one. Most geo/revgeo services have strict call limits. Assuming computing power isn't the issue, how can I accomplish this? Also note that time/speed are not the primary concern.
One place to start might be the

You can get one of the dedicated AWS geocoders for unlimited volumme processing: https://aws.amazon.com/marketplace/search/results?x=0&y=0&searchTerms=geocoder

Intro
I have experience working with SmartyStreets's batch processing tool. They don't have call limits (paid version). But, they also don't have a Reverse Geocode API (yet!). Their batch processing is strictly for flexibility and ease-of-use in addition to normal calls. But, I am aware of a couple services that do Reverse Geocoding, and they mention batch processing on their website.
How they work
Batch processing services generally allow you to upload your data, even arbitrarily large files. You probably want to put your data in a CSV file (type of spreadsheet) as latitude and longitude pairs. Then, their servers will process the data and alert you when you can download. It's common practice to charge money for this download, but maybe TAMU's is free?
Suggestions on who to use
Texas A&M Geoservices
MapLarge
Both of these services have demos and developer portals to guide you along if there is something you want to research before using them.
(Full disclosure: I have worked for SmartyStreets.)

Related

Microservice architecture for ETL

I am redesigning a small monolith ETL software written in Python. I find a microservice architecture suitable as it will give us the flexibility to use different technologies if needed (Python is not the nicest language for enterprise software in my opinion). So if we had three microservices (call them Extract, Transform, Load), we could use Java for Transform microservice in the future.
The problem is, it is not feasible here to pass the result of a service call in an API response (say HTTP). The output from Extract is going to be gigabytes of data.
One idea is to call Extract and have it store the results in a database (which is really what that module is doing in the monolith, so easy to implement). In this case, the service will return only a yes/no response (was the process successful or not).
I was wondering if there were a better way to approach this. What would be a better architecture? Is what I'm proposing reasonable?
If your ETL process works on individual records (some parallelize-able units of computation), then there are a lot of options you could go with, here are a few:
Messaging System-based
You could base your processing around a messaging system, like Apache Kafka. It requires a careful setup and configuration (depending on durability, availability and scalability requirements of your specific use-cases), but may give you a better fit than a relational db.
In this case, the ETL steps would work completely independently, and just consume some topics, produce into some other topics. Those other topics are then picked up by the next step, etc. There would be no direct communication (calls) between the E/T/L steps.
It's a clean and easy to understand solution, with independent components.
Off-the-shelf processing solutions
There are a couple of OTS solutions for data processing/computation and transformation: Apache Flink, Apache Storm, Apache Spark.
Although these solutions would obviously confine you to one particular technology, they may be better than building a similar system from scratch.
Non-persistent
If the actual data is streaming/record-based, and it is not required to persist the results between steps, you could just get away with long-polling the HTTP output of the previous step.
You say it is just too much data, but that data doesn't have to go to the database (if it's not required), and could just go to the next step instead. If the data is produced continuously (not everything in one batch), on the same local network, I don't think this would be a problem.
This would be technically very easy to do, very simple to validate and monitor.
I would suggest you to have a look into the Apache flink, It is very similar to what big sized enterprise apps like informatica, talend and data stage mappings but it process in a smaller scale but repetitively. It actually helps you to compute and transform the stuff on the fly/as they arrive and then store/load into a file/db.
The current infra we have with flink process close 28.5GB per every 4 hours and it just works. In the initial days, we had to run our daily batch and the flink stream to ensure both of them are producing consistent results and eventually most of the streams were left active and the daily batches were retired gradually.
Hope it helps someone.
There's none preventing you to have an SFTP server containing CSV or database storing the results. You can do whatever make senses. Using messaging to pass gigabytes of data, or streaming through HTTP may or may not make senses for your case.
This is an interesting problem. The best solution for this could be Reactive Spring Boot. You can have your Extract service to be as a Reactive Spring Boot app and instead of sending GBs of data, stream the data to the required service.
Now you might be wondering that while streaming, it might hold on the working thread. The answer is NO. IT works at the OS level. It doesn't hold up any request thread to stream the results. That's the beauty of the Reactive Spring Boot.
Go through this and explore
https://spring.io/blog/2016/07/28/reactive-programming-with-spring-5-0-m1

Design a System to monitor web services

I want to know how to design a system that would monitor my web services status like CPU usage, whether the service is up or not. I searched it on internet but it shows me different tools. I want to design my own system. A very basic guidance will help me a lot.
It depends how deep you want to go. In principle you need to store timeseries data, for example rrdtool can be used, then you need to collect a data based on some time interval and last but not least you should be able to present data, graphs for example.
Does it make sense to reinvent the wheel it is up to you, but for such problem there are open source systems based on rrdtool, like Cacti or Nagios.
Also influxdb with Graphana is powerfull in this respect, just to name a few.

Full text search with Amazon Services

I would to move my application to Amazon SimpleDB, since I’m not going to maintain database service on my own. This application lives under heavy load. There are a lot of reads/writes per second. I don’t need consistency and atomicity and I want to keep things as simple as possible, so SimpleDB is good choice.
The problem is, that I need full-text search capacities. And I don’t know how to make it better with Amazon SimpleDb. I had implemented before hand-written full-text search with MongoDB database. I had to split text to words in my application layer, and build my own index. It was not hard, but I don’t want to do it again with SimpleDB.
I found an interesting article
http://codingthriller.blogspot.com/2008/04/simpledb-full-text-search-or-how-to.html
But I would like to not have to implement it myself. I’m looking for a pre-made solution
What are the options?
Is it better to use Amazon RDS + Lucene?
Or probably there are out of the box solutions for SimpleDB?
Requirements are:
Ability to handle a lot of concurrency requests
Full-text search (text size would not be greater then 1MB (SimpleDB restriction))
Preferable not to admin it on my own.
Lucene or similar is usually the way people do it, but not knowing what platform you're working with its hard to suggest anything in particular. Simol is an .NET object-persistence framework for SimpleDB which can use Lucene.NET for indexing. I've also looked at some basic Lucene.NET examples which aren't too bad. If you're looking for a hosted indexing service you could take a look at this question.
For your indexing to do its job well, you're more than likely going to have to tailor it to your application.
Amazon looks like they will announce something to do with search on Jan 18 2012. http://pandodaily.com/2012/01/17/good-news-for-ec2-customers-amazon-may-launch-new-cloud-search-tomorrow/
SimpleDB for full text search is not great. It will not search more than about 300,000 documents on a single field, using the %like% operator, for instance. It will take about 2 or three tries - about 15 seconds to run through only a hundred MB of text looking for a match. I think its too slow, as do others. See the AWS forums...
Amazon CloudSearch has been released but does not have an easy way to move data from your SimpleDB to CloudSearch without you writing code.
The API, however, is fairly simple and it probably could get up in running in a week or two depending on your needs (if you use the existined SDKs). If you're using a programming language without an SDK, then it will take you longer.
http://aws.amazon.com/cloudsearch/

What's the easiest way to do a one-time mass geocode? (580,000 addresses)

I am working on a civics related project and I need to be able to display all the properties in the City of Philadelphia on a map, so I'll need to get the latitude & longitude for all 580,000 properties. (Only once)
Most APIs like Google/Yahoo have limits of 5,000 per day, and even BatchGeo has a similar limit.
Is there a way I can do a one-time geocoding of all these addresses?
You can find a list of free and paid geocoding services at USC site.
Also check Microsoft's Geocode Dataflow API, it allows up to 200,000 entries / 300 Mb and takes up to 14 days.
Another possibility to combine several services at once: use 4 services that allow 5,000 entries a day and you'll finish your task in a month.
You can use Map Quest of Cloud Made.
I have created a small utility to help compare these API's.
The utility is hosted at below url:
http://ankit-zalani.appspot.com/GeoCode/index.jsp
Tobias, I work for an address verification (and recently, geocoding) company called SmartyStreets.
Many services have usage restrictions based on volume and license agreements which prevent users from storing the results of geocoding queries. There are some vendors, however, which don't have limits or restrictions like that.
I would recommend something like LiveAddress which will not only geocode the addresses but also perform CASS-Certified verification to make sure your addresses are correct before giving you potentially faulty coordinates. You can run 580,000 or even millions at a time in a few minutes, and we allow you to store your results.
Hope this helps. If you have any more questions about addresses, I'll personally assist.
This thread is pretty old by now, but there have been some developments in recent years making bulk geocoding very cheap. My favorite option is to just obtain a geocoding server on AWS ( google: geocoding on aws), many options there, some free some with low hourly rates (total cost depends on the server you choose, of course.)

Amazon MapReduce with cronjob + APIs

I have a website set up on an EC2 instance which lets users view info from 4 of their social networks.
Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day.
Initially we had a cron-job which went through each user and did the necessary calls to the APIs and then stored the data on the DB (amazon rds instance).
This operation should take between 2 to 30 seconds per person, which means doing it 1 by 1 would take days to update.
I was looking at MapReduce and would like to know if it would be a suitable option for what im trying to do, but at the moment I can't tell for sure.
Would I be able to give an .sql file to MapReduce, with all the records I want to update + a script that tells MapReduce what to do with each record and have it process them all simultaneously?
If not, what would be the best way to go about it?
Thanks for your help in advance.
I am assuming each user's data is independent of the other users' data, which seems logaical to me. If that-s not the case, please ignore this answer.
Since you have mutually independent data (that is, each user's data is independent from other users') there is no need to use MapReduce. MR is just a paradigm in programming that simplifies data manipulation when the data is not independent (map prepares the data, then there is sorting phase, then reduce pulls the results from the sorted records).
In your case, if you want to use more computers, just split the load between them - each computer should process ~10000 users per hour (very rough estimate). Then users can be distributed among computers beforehand or they can be requested in chunks of 1000 or so users, so the machines that end sooner can process more users.
BUT there is an added bonus in using MR framework (such as Hadoop), even if you only use one phase (map only). It does the error handling for you (nodes failing, jobs failing,...) and it takes care of distributing the input among the nodes.
I'm not sure if MR is worth all the trouble to set it up, depends on your previous experience - YMMV.
If my understanding is correct. should this application to be implement as MapReduce, all the processings are done in the Map phase and reduce might simple output the Map phase result.
So if I were to implement this, I would just divide the job into multiple EC2 instances with each instance process a given range of record in your sql data. This has made the assumption that you have an good idea of how to divide the data to different instances.
The advantage is that you needn't pay for the price of Elastic MapReduce and avoid any possible MapReduce overhead.