Blocks of similar text for test data - unit-testing

For testing purposes I need to create sets of text files that have similar but not identical text. Each set needs to be different from the other set but also share some commonality.
For example, I may need to create 10 sets of 20 documents each for a total of 200 documents. Each document needs about 250 words in it.
If one of the sets of documents is about dogs then it would be appropriate that the other sets' documents be about animals, for example, such that there is a weak link between each set (in this case animals) and a strong link between the documents within a set (such as dogs in one set and cats in another set).
The words in the documents do not need to be in any particular order, nor do they need to be in sentences or make sense.
Does anybody know how I can generate or obtain this type of data for my unit tests?

How about grabbing some text from Project Gutenberg?

I needed test data set for text indexing to benchmark solr indexing speed.
I downloaded source code from github as zip file. e.g this one is huge-
https://github.com/spring-projects/spring-framework
"download as zip" button.

Related

How can I see if a document contains any of a set of words using ML.net?

I am learning ML.net and want to submit the text of a document to see if it contains one or more specific words. In my example I want to categorize my document with the day of the week. For example, if it contains the word "Monday" or "first", then I want it to categorize it as "Monday.
You will need to create a training dataset. This should consist of documents that contain the words you are looking for AND documents that don't. You will need to label these with a column that contains a Yes or No as the ground truth.
Then use the model builder interface and select a classify task.
Best to start with a small training set just to check you have the data in the right format.

Searching a large data file for a credit card type and its relevant info

So I am running Python 3.7.1 and I am trying to make a program that pulls out only customers that use an American Express card and display only their name and Email.
I have part of the code that pulls all the customers data that uses the same card type, but it pulls up multiple of the same name and email and all other information. I just can't figure out how to eliminate multiples and only display Name and Email. Below I will show a picture of my code and a screen shot of the output for reference.
My code so far
Output(notice the multiples of Mary and Hunter)
Assuming your file isn't extremely long, consider using Python's set data structure to filter out duplicates. You can check for membership within the set via the in operator (e.g. x in s) and you can add new elements to the set via the add() method (e.g. s.add(x)). At a high level, you want to amend your code to check whether the element is already in your set (in which case you don't need to print it again), and if it is not in the set, add it to ensure you don't print it again.

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?
So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

How are documents retrieved after reduce produces the output?

So, after reduce completes its job we have data stored in the files something like this:
But what happens when the user types something? How is search performed when the data is stored just in files?
MapReduce is for processing. So once you have processed the data and generated your aggregate information, which is on HDFS, you will either have to read the file in some program to display to user. Or several alternative options are available to read the data from HDFS :
You could use Hive and create a table on top of this data and read the data using SQL like queries. A simple web application can connect to this using the thrift server which provides a JDBC interface to hive.
Other options include loading data to HBase, Shark etc. All depends on what your use case is interms of the size of the aggregated data, performance requirements
What you have constructed after MapReduce is a inverted index, a nice little data structure. Now you have to use it.
For example, in case of google, this inverted index is sharded across many servers and stores the entire list on each of them. So for example, server 500 has the list for be, and another has the list for to. These are implementation details, you could theoretically store it on one box in a large hash if you could hold the index in memory.
When the customer types in words into the engine. It will retrieve that entire list. If there are multiple words, it will do an intersection of those lists to show you documents that have both words.
Here is the source for the full paper on how they did it http://infolab.stanford.edu/~backrub/google.html
See "Figure 4. Google Query Evaluation"

What is a good design pattern to implement a dynamic data importer tool?

We are planning to build a dynamic data import tool. Basically taking information on one end in a specified format (access, excel, csv) and upload it into an web service.
The situation is that we do not know the export field names, so the application will need to be able to see the wsdl definition and map to the valid entries in the other end.
In the import section we can define most of the fields, but usually they have a few that are custom. Which I see no problem with that.
I just wonder if there is a design pattern that will fit this type of application or help with the development of it.
I am not sure where the complexity is in your application, so I will just give an example of how I have used patterns for importing data of different formats. I created a factory which takes file format as argument and returns a parser for particular file format. Then I use the builder pattern. The parser is provided with a builder which the parser calls as it is parsing the file to construct desired data objects in application.
// In this example file format describes a house (complex data object)
AbstractReader reader = factory.createReader("name of file format");
AbstractBuilder builder = new HouseBuilder(list_of_houses);
reader.import(text_stream, builder);
// now the list_of_houses should contain an extra house
// as defined in the text_stream
I would say the Adaptor Pattern, as you are "adapting" the data from a file to an object, like the SqlDataDataAdapter does it from a Sql table to a DataTable
have a different Adaptor for each file type/format? example SqlDataAdptor, MySqlDataAdapter, they handle the same commands but different datasources, to achive the same output DataTable
Adaptor pattern
HTH
Bones
Probably Bridge could fit, since you have to deal with different file formats.
And Façade to simplify the usage. Handle my reply with care, I'm just learning design patterns :)
You will probably also need Abstract Factory and Command patterns.
If the data doesn't match the input format you will probably need to transform it somehow.
That's where the command pattern come in. Because the formats are dynamic, you will need to base the commands you generate off of the input. That's where Abstract factory is useful.
Our situation is that we need to import parametric shapes from competitors files. The layout of their screen and data fields are similar but different enough so that there is a conversion process. In addition we have over a half dozen competitor and maintenance would be a nightmare if done through code only. Since most of them use tables to store their parameters for their shapes we wrote a general purpose collection of objects to convert X into Y.
In my CAD/CAM application the file import is a Command. However the conversion magic is done by a Ruleset via the following steps.
Import the data into a table. The field names are pulled in as well depending on the format.
We pass the table to a RuleSet. I will explain the structure the ruleset in a minute.
The Ruleset transform the data into a new set of objects (or tables) which we retrieve
We pass the result to the rest of the software.
A RuleSet is comprise of set of Rules. A Rule can contain another Rule. A rule has a CONDITION that it tests, and a MAP TABLE.
The MAP TABLE maps the incoming field with a field (or property) in the result. There are can be one mapping or a multitude. The mapping doesn't have to involve just poking the input value into a output field. We have a syntax for calculation and string concatenation as well.
This syntax is also used in the Condition and can incorporate multiple files like ([INFIELD1] & "-" & [INFIELD2])="A-B" or [DIM1] + [DIM2] > 10. Anything between the brackets is substituted with a incoming field.
Rules can contain other Rules. The way this works is that in order for a sub Rule mapping to apply both it's condition and those of it's parent (or parents) have to be true. If a subRule has a mapping that conflicts with a parent's mapping then the subRule Mapping applies.
If two Rules on the same level have condition that are true and have conflicting mapping then the rule with the higher index (or lower on the list if you are looking at tree view) will have it's mapping apply.
Nested Rules is equivalent to ANDs while rules on the same level are equivalent of ORs.
The result is a mapping table that is applied to the incoming data to transform it to the needed output.
It is amicable to be being displayed in a UI. Namely a Treeview showing the rules hierarchy and a side panel showing the mapping table and conditions of the rule. Just as importantly you can create wizards that automate common rule structures.