There're several ways to generate data for tests (not only unit tests), for example, Object Mother, builders, etc. Another useful approach is to write test data as plain text:
product: Main; prices: 145, 255; Expire: 10-Apr-2011; qty: 2; includes: Sub
product: Sub; prices: 145, 255; Expire: 10-Apr-2011; qty: 2
and then parse it into C# objects. This is easy to use in unit tests (because deep inner collections can be written in single line), this is even more convenient to use in FitNesse-like system (because this DSL naturally fits into wiki), and so on.
So I use this and write parser, but it's tedious to write each time. I'm not a big expert in DSL/language parsers, but I think they can help here. What would be the right one to use? I only heard about:
DSL (I mean, any DSL)
Boo (that I think can do DSL)
ANTLR
but I don't even know which one to pick and where to start.
So the question: is it reasonable to use some kind of DSL to generate test data? What would you suggest to do so? Are there any existing cases?
Update: seems like I was not clear enough. It's not about raw string to object convertion. Look at first line and relate it to
var main = Product.New("Main")
.AddPrice(Price.New(145).WithType(PriceType.Main).AndQty(2))
.AddPrice(Price.New(255).WithType(PriceType.Maintenance).AndQty(2))
.Expiration(new DateTime(10, 04, 2011));
var sub = Product
.New("Sub").Parent(main)
.AddPrice(...));
main.AddSubProduct(sub);
products.Add(main);
products.Add(sub);
And note that I first create sub product and then add it to main, even though it is listed in reverse order. Prices are handled in a special way. I want to specify name of Sub product and get reference to it - created. I want to list all product properties - FLAT and NON-REPEATATIVE - on single line. I want to use defaults for properties. And so on.
Update: I'm not convinced to avoid DSL because all the alternative examples are too verbose and not user-friendly. And no-one said anything useful about DSL.
For the data DSL YAML is an excellent candidate. Here is a sample from Wikipedia:
---
receipt: Oz-Ware Purchase Invoice
date: 2007-08-06
customer:
given: Dorothy
family: Gale
items:
- part_no: A4786
descrip: Water Bucket (Filled)
price: 1.47
quantity: 4
- part_no: E1628
descrip: High Heeled "Ruby" Slippers
price: 100.27
quantity: 1
bill-to: &id001
street: |
123 Tornado Alley
Suite 16
city: East Westville
state: KS
ship-to: *id001
specialDelivery: >
Follow the Yellow Brick
Road to the Emerald City.
Pay no attention to the
man behind the curtain.
I used YAML in several projects and happy with it.
However, if we are talking about unit-tests it is usually simpler and more readable to construct necessary objects “by hand” with constructors and property assignments in-place. This is because unit-test are by their nature highly focused on some code (unit), and it shouldn't be hard to create data infrastructure that is just enough for the test. It is OK to operate on half-complete entities in unit-tests, don't bother with constructing data that is not related to this concrete test.
For functional tests YAML is great.
I would first start by seeing if my language of choice was rich enough to build my DSL. C# ought to handle your case quite easily:
Product[] products = new Product[] {
new TestProduct{product="Main", prices=new[]{145, 255}, Expire="10-Apr-2011", qty=2, includes="Sub"},
new TestProduct{product="Sub", prices=new[]{145, 255}, Expire="10-Apr-2011", qty=2}
};
Not quite as pretty, but certainly tolerable enough that I would struggle to justify the extra effort of a custom DSL.
Also note that Expire is initialised with a string, but it is obviously a date. This is perfectly reasonable for a DSL idiom, since TestProduct.Expire's setter can do the translation.
For creating an external DSL I would recommend Eclipse TMF Xtext which is really good (based on ANTLR but simpler), but built on top of Eclipse and Java, however you can generate any code.
When it comes to creating testing data, I was inspired by the way the Ruby on Rails guys do it, which was YAML fixtures as mentioned in another answer, but I also saw an approach using factories, which can help you to get rid of some duplicity and inflexibility. Look at this Railscasts 158: Factories not Fixtures, it might give you some ideas for designing the DSL.
Related
I want to read the bold words as the column names in the dataframe and the string following the bold letters as the value for that particular row.
<posts>
<**row Id**="5" PostTypeId="1" **CreationDate**="2014-05-13T23:58:30.457" **Score**="7" ViewCount="315" **Body**="<p>I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p><p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.</p><p>Obviously, randomly generating code would be impractical, so how could I do this?</p>" **OwnerUserId**="5" LastActivityDate="2014-05-14T00:36:31.077" Title="How can I do simple machine learning without hard-coding behavior?" Tags="<machine-learning>" AnswerCount="1" CommentCount="1" FavoriteCount="1" ClosedDate="2014-05-14T14:40:25.950"/>
<**row Id**="7" **PostTypeId**="1" **AcceptedAnswerId**="10" CreationDate="2014-05-14T00:11:06.457" Score="2" ViewCount="297" Body="<p>As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.</p>" OwnerUserId="36" LastEditorUserId="97" LastEditDate="2014-05-16T13:45:00.237"LastActivityDate="2014-05-16T13:45:00.237" Title="What open-source books (or other materials) provide a relatively thorough overview of data science?" Tags="<education><open-source>" AnswerCount="3" CommentCount="4" FavoriteCount="1" **ClosedDate**="2014-05-14T08:40:54.950"/>
</posts>
I am trying to build a system which identifies various commands and inputs based on a written human-entered text. I'll start with an example, to make things cleaner. Suppose the user inputs the following text:
My name is John Doe, my age is 28 years old, my address is Barkley Street no. 7 Havana. I like chocolate cake with strawberries and vanilla.
Based on a set of predefined markers (e.g. "name is", "age is", "address is", "I like"), I would like to detect their corresponding value (e.g. "John Doe", "28", "Barkley Street... Havana", "chocolate cake ... vanilla").
My current attempt was to tackle this via some regex patterns: for each marker I built a regex saying something along the lines of "if you find marker X, take all the text between it and any of the X, Y, Z markers you could find". That was extracting text between markers, but building everything based on regexes is going to be very cumbersome, especially if I start taking flexing and small variations into account.
I don't have much experience with NLP, so I'm not really sure where I should start for a proper solution. What are some appropriate approaches/solutions/libraries for tackling this problem?
What you are actually trying to do is "information extraction", particularly named entity recognition (NER) to detect the mentions of interest. For an overview, see:
https://en.wikipedia.org/wiki/Information_extraction
To actually start to solve your problem with something approaching state of the art I would suggest looking into the Stanford NLP Toolkit (http://nlp.stanford.edu/software/) for your basic NLP tasks (tokenization, POS tagging) but their NER toolkit won't take you very far with your specific requirements. You could tried their SPIED to help you, but I haven't used it and can't vouch for it. Ultimately if you are serious about this task (which on the face of it sounds quite hard) you will have to write your own NER system for all the entities you want to extract. You may want to incorporate some of your regular expressions as machine learning features to help you with your task (start with a simple ML library like LibSVM or Mallet) but regardless it will be a lot of work.
Good luck!
If the requirement is to identify named entities such as person, place, organisation then one could use StanfordNER library in Python. Additionally, there is solution to training one's own custom entity recognition model using CRF algorithm in Python. Here is an article explaining the same.
I have a corpus of a language that has not been POS annotated before, that is, it has no existing tagset.
Apart from manually tagging it with a word processor like notepad, is there any automatic approach to start tagging a new untagged set like my corpus?.
Thanks.
It depends how detailed the tagset should be. 10-12 basic POS (Noun, Adjective, ..., foreign, punctuation) or more detailed (distinguishing verb forms, types of pronouns, gender, number, tense, ...).
The former is pretty much universal (see the categories of the Multext-East tagset or Google's universal tagset).
The latter is much more complicated, we have a paper about it. In short, we have a template for tagsets, then we modify it (dropping/adding categories and values) to suit a particular language.
Regarding annotation: again, it depends - if you have a small tagset you can just manually assign a tag to each word, say in Notepad or some simple GUI (we use this one, but there are probably better ones). If you have a tagset with hundreds or thousand tags, then you probably want some better support for it. The best thing is to use a (possibly overgenerating) morphological analyzer and a GUI allowing to choose from the options the analyzer suggests.
Brat has a very nice GUI for manual annotation.
Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.
I will first describe the problem and then what I currently look at, in terms of libraries.
In my application, we have a set of variables that are always available. For example: TOTAL_ITEMS, PRICE, CONTRACTS, ETC (we have around 15 of them). A clients of the application would like to have certain calculations performed and displayed, using those variables. Up until now, I have been constantly adding those calculations to the app. It's pain in the butt, and I would like to make it more generic by way of creating a template, where the user can specify a set of formulas that the application will parse and calculate.
Here is one case:
total_cost = CONTRACTS*PRICE*TOTAL_ITEMS
So, want to do something like that for the user to define in the template file:
total_cost = CONTRACTS*PRICE*TOTAL_ITEMS and some meta-date, like screen to display it on. Hence they will be specifying the formula with a screen. And the file will contain many formulas of this nature.
Right now, I am looking at two libraies: Spirit and matheval
Would anyone make recommendations what's better for this task, as well as references, examples, links?
Please let me know if the question is unclear, and I will try to further clarify it .
Thanks,
Sasha
If you have a fixed number of variables it may be a bit overkill to invoke a parser. Though Spirit is cool and I've been wanting to use it in a project.
I would probably just tokenize the string, make a map of your variables keyed by name (assuming all your variables are ints):
map<const char*,int*> vars;
vars["CONTRACTS"] = &contracts;
...
Then use a simple postfix calculator function to do the actual math.
Edit:
Looking at MathEval, it seems to do exactly what you want; set variables and evaluate mathematical functions using those variables. I'm not sure why you would want to create a solution at the level of a syntax parser. Do you have any requirements that MathEval does not fulfill?
Looks like it shouldn't be too hard to generate a simple parser using yacc and bison and integrate it into your code.
I don't know about matheval, but boost::spirit can do that for you pretty efficiently : see there.
If you're into template metaprogramming, you may want to have a look into Boost::Proto, but it will take some time to get started using it.