Clojure - HTML Decoder - clojure

Are there any good HTML decoders in clojure?
I have tried a few out such as the ones from clojure.tools.html-utils and codec from ring but they don't decode the html fully, ie there are still some encoded symbols.
If I put my code to be decoded in a website such as https://opinionatedgeek.com/Codecs/HtmlDecoder , for example, the HTML decodes properly into text.
The type of text I am getting is Broad Institute of MIT and Harvard - Cambridge/Boston, MA - ONSITE<p>Do you want to help cure cancer? Do you care about the mission behind your software engineering?<p>We are a motivated team of software engineers building scalable tools to analyze massive amounts of genomic data using cloud compute software to process 24TB of biological data daily... and that's just the beginning! We are co-developing products to advance science with the biggest partners in the industry -- working directly with and alongside their engineers.<p>We are seeking strong software engineers to join our team. We have a flat organizational structure with self-directed, agile teams.<p>We use Scala, Spark, Akka, React & Clojurescript. Experience in the tech stack or sciences not req'd.<p>Here is some recent information on our mission: http://www.wbur.org/commonhealth/2016/07/07/precision-medici...<p>Interested? Please email Amy Massey - massey#broadinstitute.org/n/n/nID is https://news.ycombinator.com/item?
When you put this through the website I linked above you get Broad Institute of MIT and Harvard - Cambridge/Boston, MA - ONSITEDo you want to help cure cancer? Do you care about the mission behind your software engineering?We are a motivated team of software engineers building scalable tools to analyze massive amounts of genomic data using cloud compute software to process 24TB of biological data daily... and that's just the beginning! We are co-developing products to advance science with the biggest partners in the industry -- working directly with and alongside their engineers.We are seeking strong software engineers to join our team. We have a flat organizational structure with self-directed, agile teams.We use Scala, Spark, Akka, React & Clojurescript. Experience in the tech stack or sciences not req'd.Here is some recent information on our mission: http://www.wbur.org/commonhealth/2016/07/07/precision-medici...Interested? Please email Amy Massey - massey#broadinstitute.org/n/n/nID is https://news.ycombinator.com/item?
This is how I want it to look

(import '[org.jsoup Jsoup])
(.text (Jsoup/parse s))
Taken from crouton

Related

Open source concept mining tools?

Are there to day any concept mining open source tools available? I have only be coming across like Leximancer, which although seem to fit the role is not open source and quite expensive for a undergraduate student. I have been unsuccessful so far since the word 'concept' on both google and google scholar seems to be un-matching what I want.
It seems to me you need a text mining tool for clustering. RapidMiner has an open-source, Java based Community Edition which has several extensions (Text Mining, R, etc.). In addition you can develop and integrate your own algorithms too.
Moreover Rexer Analytics offers a comprehensive data mining survey annually, you can call for reports for free.

Batch Geocoding with Garmin Mapsource

I lost track of this effort years ago but have need to geocode thousands of addresses nightly. I must use the very accurate database sitting on the machine, installed when the Nuvi map update installed Mapsource.
When I contacted Garmin years ago, they expressed an interest in providing an API for this, but then I heard nothing and did not follow up. Their database is provided by navtec? I believe. Anyone have experience with that format?
I posted on the Garmin Developer forum a while ago, but its a little lethargic over there :)
Has anyone done this?
Does anyone know how it might be done without an API; meaning database structure and calls?
I'll take a solution in any language.
Added:
Garmin has expressed an interest in making this available to me. They just have not done it.
I do not know the database format.
I am NOT looking for an online solution or any other "alternative". This question is very specific.
Talk to Navtec directly. They will sell you or license you their database directly. The database tables are clearly documented, then write your own Geocoder on top. Took me about a week 4 years ago, and I was marginally profficient in SQL at the time.
You can geocode up to 10,000/day by city with NN4D after you get their free application key.
You can geocode for $18 per 1,000 with CoreLogic (aka Proxix)
Yahoo looked most promising because it has the Hadoop feature, which is also currently being utilized at Navteq. I've contacted a guy at Navteq who uses Hadoop, and I'm awaiting his feedback. According to Ben Lorica's article on Datameer O'Reilly.com entitled "Big Data Tool for Business Analysts", Datameer can upload from spreadsheets to Hadoop. Hadoop is a pipeline to Navteq.
Starting point - a list of the tools at the GIS Dept at USC
(I can only have one link because I'm new, but I'll add the rest when I get my points up.
naveteq uses oracle format
BUT HOLD 1 SECOND:
doing 1000 lookups(per night) is easy,
doing 10000 lookups(per night) requires a good server,
doing 1000000 lookups(per night) requires a cluster
letting them do the searches requires less hardware(and more traffic) using xml-rpc or similar rpc would be the best( for everyone)
buy oracle db and start working
you can use almost anything BUT keeping in mind the volume you should use a compile language like c++
gpsbabel.org has lots of stuff on converting between lots of GPS formats, and a downloadable tool. My limited experience, mostly with google maps, streetview etc. is that geocoding is not very accurate.
cM
The free IBM DB2 Express-C DBMS comes with Spatial Extender that can be used to GEOcode US addresses. See a webinar on this. Don't know if this is exact fit but it can't hurt to take a look.
Also take a quick look the DB2 documentation http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.spatial.topics.doc/doc/csbp3008.html

What is data mining from a developer's perspective?

I can find the technical explanation of what data mining is in a book or on Wikipedia, but I'm wondering what sort of development does it exactly involve? Is it more about using tools or more about writing tools? Is it really any much different from other domains when it comes to R&D?
Data Mining is the process of discovering interesting patterns in large amounts of data. It is not querying data, which is just what user Treb describes (sorry Treb).
To understand DM from a developer's perspective, you should read the book Programming Collective Intelligence by Toby Segaran.
In my experience (I'm a former data miner :-)), it's a mixture of using tools and writing tools. A lot of the time, the tools you need to analyse the particular data set don't exist, so you have to write them yourself first. It can be very interesting but you often need quite a different approach to the sort of programming I do now (embedded wireless), for example.
You really ought to change the accepted answer on this question so it doesn't mislead those who come across it.
Saying that querying a database IS data mining because "[h]ow would you discover any pattern in your data without querying first?" is like saying opening your car door is driving because "how else would you be able to drive somewhere without opening the car door first."
You can read your data out of a text file if you want. My first data mining assignment used data sets from the UCI repository and those are almost all text files.
If you want to learn about data mining start by looking up clustering and classification. Learn about decision trees and rule based classification. Then look at k-nearest-neighbor and k-means. After that if you really want to see what data mining is all about look at Chameleon, DBScan, and Support Vector Machines. Don't necessarily learn the minutiae of the last three (they're pretty complex and math heavy) but understanding the abstract idea of what happens will tell you all you need to know in order to use the many tools and libraries that are available for each strategy.
These are only the algorithms that popped into my head just now. There are so many others that I don't recall or don't even know yet.
Data mining is about searching large quantities of data for hidden patterns. Web 2.0 example: News corp uses its site myspace.com as a large data mine to determine what movies and products to promote. They write software to identify trends in the data that it's users post to the site. News corp does this to gather information useful for advertising campaigns and market predictions. It's different from other domains of R&D in that from a data givers perspective its passive. Rather than going out on the street and asking people in person what movies they are likely to see this summer and other such questions, the data mining tools sort out these things by analyzing data given by users voluntarily.
Wikipedia actually does have a pretty good article on it:
- http://en.wikipedia.org/wiki/Data_mining
Data Mining as I say is finding patterns or trends from given data. A developer perspective might be in applications like Anti Money Laundring... Where given a pattern you will search data for that given pattern. One other use is in Projection Softwares... where you project a result or outcome in future against a heuristic by studying recognizing the current trend from data.
I think it's more about using off the shelf tools rather than developing your own. An academic example of that kind of tools might be WEKA. Of course, you still have to know what algorithms use, how to preprocess data (very important this part), etc.
In R&D I don't have much idea, but it should be like almost everything: maths, statistics, more maths...
On the development level, data mining is just another database application, but with a huge amount of data.
The mining itself is done by running specific queries on the database. It's in the creation of the queries where the important work is done. They of course depend on the data model, and on the hypotheses, what sort of trends the customer expects to find.
Therefore, the fine tuning of the queries usually can't be done in development, but only once the system is live and you have live data. Then the user can test his hypotheses and adapt the queries to show him the trends he is looking for.
So from a dev point of view, data maining is about
Managing large sets of data in your client (one query may return 100.000 rows of data)
Providing the user (who may know nothing about SQL or relational databases in general) with an effective way to modify his queries and view the results.

How to understand Microsoft Dynamics products?

What is the difference? They all are business management solutions. They do the same? Some sort of different editions? Do they use same platform?
Dynamics NAV
Microsoft Dynamics NAV 2009 is a
comprehensive business management
solution that helps people work faster
and smarter, and gives your business
the flexibility to adapt to new
opportunities and growth.
Dynamics AX
Microsoft Dynamics AX 2009 is a
comprehensive business management
solution for mid-sized and larger
organizations that works like and with
familiar Microsoft software to help
your people improve productivity.
Dynamics GP
Microsoft Dynamics GP is a richly
featured business management solution
that allows you to use familiar,
powerful software to operate and grow
your business.
Dynamics SL
Microsoft Dynamics SL is a business
management solution specialized to
help project-driven midsize
organizations obtain reports and
business analysis, while helping
increase efficiency, accuracy, and
customer satisfaction.
Generally speaking each of these products were purchased separately, and Microsoft is kind of trying to put them into a general business, but has not actually integrated them into a common ERP platform (yet anyway). For example, NAV was formerly Navision, GP was formerly Great Plains. I think AX was also part of the Navision purchase, but was a different product that Navision had themselves purchased.
Each has a separate accounting implementation that it came with, so there is a lot of overlap in the non-differentiators like accounting.
Basically they are targeted at different types of businesses. SL is for a service oriented business (like a software consulting firm). NAV would be more targeted at an inventory based operation.
I didn't investigate all of their options in depth to know all of the similarities and differences, but in a former job I had to look into NAV, AX and GP and that is what I recall it being all about.
I agree with Dave Markle, the marketing is engineered to create the maximum possible confusion. The executive suite buys these things and then marketing has to break its head to figure out how to sell and differentiate each one. As you can see, they didn't do a great job.
The marketing-ese is in full effect with the Dynamics products. All these packages were acquisitions by Microsoft, and they are making an effort to bring them to market under one brandname: Dynamics. They are aiming at the SMB market. It's not positioned to compete with SAP. Both are client-server apps.
I've worked with Dynamics SL (previously named Solomon). It's an accounting suite, with modules for Accounts Payable + Receivable, Inventory, General Ledger, Purchasing, Reporting, Cost Accounting, Purchase Orders, etc.
It's all VBA goodness. The database underlying would make your blood curdle. It's denormalized like you wouldn't believe. I guess saying 'denormalized' would indicate that it previously normalized. I got the feeling it was never normalized. Full of technical debt.
Foreign keys are an unknown entity in SL. DBAs would have trouble taking the architecture seriously (e.g. columns actually named like User1 and User2 to indicate a custom field on the User Interface).
Dynamics GP is more oriented towards payroll. I cannot comment on its inner architecture.
They all run on the same platform. Client executables connecting to SQL Server. The forms design is like the Win95 and sometimes Win3.1 paradigm. Don't let the Outlook-like main screen fool you in the screenshot; it's the only one getting the upgrade treatment.
The licensing model is a killer, and so my previous encounter with Solomon had everyone running the same EXE from the network share. It was notoriously slow, and rarely a compliment from users on its responsiveness.
Entire consultancy businesses are built around these products. Supply and demand allow those consultants to charge a substantial amount, relative to the web-app and other line-of-business consultants.
On their "How To Buy" form, there is a "Contact Us"
and I'm completely certain that if you contacted one of the sales reps, they would go into great detail and great length about the strengths and weaknesses of each product.
Keep in mind, they'll be highlighting more strengths than weaknesses, and they'll be highlighting the weaknesses of the lesser priced products. But the sales reps are guaranteed to know the products inside out.
Also, Wikipedia has little write-ups on each of them.
They are mostly similar (sometimes identical) to the blurb on the MS website, but there's also some extra information there.

Data mining and Business Intelligence Technologies

I've noticed an increasing number of jobs that are asking for experience with data mining and business intelligence technologies. This sounds like an incredibly broad topic but where would one go if they wanted to develop at least a partial understanding of this stuff if it were to come up in an interview?
A very good book with practical examples is the
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran.
Go read Data Mining: Practical Machine Learning Tools and Techniques (Second Edition).
Then use Weka on a pet project. Despite the name, this is a good book, and the Weka package has several levels of entry into the data-m... er machine-learning world.
Consider reading Ralph Kimball's books for an introduction to Business Intelligence.
Also, try to not stick to one technolgy-vendor, every company has its own biased vision of BI, you'll need a 360 overview.
Maybe you can also try to work with real BI - it is almost impossible to get in contact with data-filled and running SAS, MS, Oracle etc. I work in a team which integrates BI BellaDati for enterprises. For try-out and personal purposes it is free with some datastore limitations ( http://www.trgiman.eu/en/belladati/product/personal ).
BellaDati is also used as a learning tool on technical universities focused on practical application of data mining and analysis. The final manager-level dashboards examples of BellaDati can be seen at http://mercato.belladati.com/bi/mercato/show/worldexchanges
You can work here with SQL datasources, flat files, web services and play. From my own experience - to show real samples of market analysis practise (like case study etc.) is good for an interview.
I wish you luck,
Peter