Convert text to elements in XML using XSLT - xslt

I'm currently migrating XML from one CMS to another and needs to convert some text to elements. Because of how the system works, some editors can only enter escaped text. The challange is to replace some of these escaped elements and convert them into valid XML elements.
Source file:
<p>Press the <button-name>Select key </button-name>to show more information.</p>
<p>Press the <button-name>Back key</button-name> to save the
values.</p>
<p>When the storage is completed, the <product-name/> machine
displays:</p>
<p><attention>
<display-text translate="no">STORAGE COMPLETED
Press BACK to exit</display-text>
</attention></p>
What I want to do
Replace <button-name> with <gui>
Replace <button-name> with <kt.in name="custom-name"/>
Keeping other escaped elements.
XML I want
<p>Press the <gui>Select key</gui>to
show more information.</p>
<p>Press the <gui>Back key</gui>
to save the calibrations values.</p>
<p>When the storage is completed, the <kt.in name="custom-name"/> machine
displays:</p>
<p><attention> <display-text translate="no">STORAGE COMPLETED
Press BACK to exit</display-text>
</attention></p>
I tried using a string-based search-and-replace but as I want proper an XML element as output this wouldn't do it.

This is probably only going to work by string-based-search-and-replace - depending on the amount of text "tags" you want to switch to xml. The bigger problem I see is actually keeping it all in a proper XML-Element.
I dont think that you could keep this without writing a small tool that will read the strings between the elements of the text e.g.
<button-name>
and copy them into the right variables of an Object which you then parse back as XML conform Element.
It doesnt really depend on the language you prefer since there should be plenty of object-xml parsers available
For just changing the tags you could also switch the encoding of the text as
< would turn into -> <
and then filter any content in between the <> to exchange the ones you want e.g. button-name to gui
hope I could give you an idea..

Related

MsWord APIs - How to shift content of table cell from one to another

I tried using Cell->Copy() and Cell->Paste() methods, but this also modifies the clipboard. Is there any other method to shift the whole content of one cell to another?
I have tried setting the Cell->Range->Text but that excludes the other objects like images.
Here's some C# code that demonstrates how to use the object model to transfer formatted content between table cells within a Word application instance (same document or different documents), not using the Clipboard.
This uses Range objects for both source and target cell. When a Cell.Range is assigned to an object variable it will contain the entire cell - this includes the end-of-cell character(s) ANSI 13 + ANSI 7 that store the cell structures. In order to copy only the content, without the cell structures, these characters need to be trimmed from the Range. Depending on how the object model is used, one or two characters need to be trimmed.
Trimming is done for a Range by reducing the scope of the Range (think of it like holding the Shift key while pressing right-arrow for a selection) - here for the source Range using the method MoveEnd. In my tests for this code sample trimming one character worked, but you'll want to test it.
For the target Range the code sample uses the Collapse method to put the focus at the start of the target cell (think of it like pressing the Left-arrow key to collapse a selection to a blinking insertion point at the beginning of the selection).
Then it's simply a matter of setting the FormattedText property of the target Range to the FormattedText property of the source Range.
Word.Table tbl = doc.Tables[1];
Word.Range rngCellSource = tbl.Cell(1, 1).Range;
Word.Range rngCellTarget = tbl.Cell(2, 1).Range;
rngCellSource.MoveEnd(Word.WdUnits.wdCharacter, -1);
//System.Diagnostics.Debug.Print(rngCellSource.Text);
rngCellTarget.Collapse(Word.WdCollapseDirection.wdCollapseStart);
rngCellTarget.FormattedText = rngCellSource.FormattedText;
I can't see one in the documentation.
The whole notion of "shifting" data is necessarily cutting it from where it is, and pasting it someplace else.
I'd question why you are afraid of touching the clipboard. This is its purpose.

Best way to use spreadsheet RegEx to extract text and numbers and replace with the formatting?

I'm currently working on a non profit project where I need to reformat the way the data in the rows displays.
At the moment, this is how the row data looks:
Save The Children (Donation)|10.00{0}{2}
And I need it to output like this instead:
donation_id:save_children|quantity:1|total:10.00
The first problem is sometimes there's multiple items within the row:
Save The Children (Donation)|10.00{0}{2} / Save The Forrest|15.50{0}{2}
In which case it would need to be separated by a semicolon:
donation_id:save_children|quantity:1|total:10.00;donation_id:save_forrest|quantity:1|total:15.50
The second problem is, we have 9 donation variables/causes, each needing to convert the output to a different "donation_id".
So every time it finds:
Save the Children, it needs to convert to: donation_id:save_children
Save the Forrest, to, donation_id:save_forrest
Save the Animals, to, donation_id:save_animals
And so forth.
And the third problem is that the donation amounts are variable (as people donate whatever they wish), so the "total:" dollar value that we ouput will often be different.
How would I go about doing this with the regex?
Thank you
You can use below regex
(Save) The (Children|Forrest|Animals).*?\|([0-9]+\.[0-9]+)\{0\}\{2\}([\s\/]+)?
substitution/replace with
donation_id:$1_$2|quantity:1|total:$3;
When I test for
Save The Children (Donation)|10.00{0}{2} / Save The Forrest|15.50{0}{2}
Output is
donation_id:Save_Children|quantity:1|total:10.00;donation_id:Save_Forrest|quantity:1|total:15.50;
Test it online!

How to store graph in files

I want to store the following information in a file.
My program is consisted of set of string that are connected forming a graph.
I call each single string "Tag".
let's say we have 3 main tags $Mohammed , $car , $color
Each of the main tags contains sub tags and each sub tag has a value or another sub tag or set of sub tags.
$Mohammad:
$Age: "18"
$color: $red
$kind_of: $human
$car:
$type: $toyota
$color: $blue
$doors:
$number: "3"
$car:
$made_of: $metal
$used_for: $transporting
$types: {$mercedes,$toyota,$nissan}
$best_color: $red
$color:
$usedto: $coloring_things
$example: {$red,$green,$blue,...}
But this in not the only thing, there is a connection between the tags of the same name, so that $Mohammed->$car->$color must be connected with the main tag $color. and $Mohammed->$color:$red , $car->$best_color:$red , $color->$best_color: $red and the main tag $red must all be connected to each other.
The tags connected means be stored in a way that I can call the connected tags at once. just like the computer memory. when it calls something from the memory, it calls the information before and after the requested information.
When I looked to my situation in the first time, I thought that XML would solve it, but then I realized that XML can't represent graph.
I don't want to use databases for this. I want to keep database as my last weapon.
Any idea or suggestion about how can I store,connect and recall the informations from my program?
Thanks in advance.
You actually could use XML, but I would recommend JSON or Yaml.
Your example format is already very close to Yaml.
Take a loot at boost's property_tree
It contains a nice c++ way to represent your graph, and let's you very easily decide what kind of file-representation you want. Be that xml, json, info.
Also, I don't see why your graph can't be represented by xml, as it supports named nodes.
Although property_tree also supports the ini format, that actually can't represent your >2 level deep tree.

Comparing two documents

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!
Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?
So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)

Create and use HTML full text search index (C++)

I need to create a search index for a collection of HTML pages.
I have no experience in implementing a search index at all, so any general information how to build one, what information to store, how to implement advanced searches such as "entire phrase", ranking of results etc.
I'm not afraid to build it myself, though I'd be happy to reuse an existing component (or use one to get started with a prototype). I am looking for a solution accessible from C++, preferrably without requiring additional installations at runtime. The content is static (so it makes sense to aggregate search information), but a search might have to accumulate results from multiple such repositories.
I can make a few educated guesses, though: create a map word ==> pages for all (relevant) words, a rank can be assigned to the mapping by promincence (h1 > h2 > ... > <p>) and proximity to top. Advanced searches could be built on top of that: searching for phrase "homo sapiens" could list all pages that contain "homo" and "sapiens", then scan all pages returned for locations where they occur together. However, there are a lot of problematic scenarios and unanswered questions, so I am looking for references to what should be a huge amount of existing work that somehow escapes my google-fu.
[edit for bounty]
The best resource I found until now is this and the links from there.
I do have an imlementation roadmap for an experimental system, however, I am still looking for:
Reference material regarding index creation and individual steps
available implementations of individual steps
reusable implementations (with above environment restrictions)
This process is generally known as information retrieval. You'll probably find this online book helpful.
Existing libraries
Here are two existing solutions that can be fully integrated into an application without requiring a separate process (I believe both will compile with VC++).
Xapian is mature and may do much of what you need, from indexing to ranked retrieval. Separate HTML parsing would be required because, AFAIK, it does not parse html (it has a companion program Omega, which is a front end for indexing web sites).
Lucene is a index/searching Apache library in Java, with an official pre-release C version lucy, and an unofficial C++ version CLucene.
Implementing information retrieval
If the above options are not viable for some reason, here's some info on the individual steps of building and using an index. Custom solutions can go from simple to very sophisticated, depending what you need for your application. I've broken the process into 5 steps
HTML processing
Text processing
Indexing
Retrieval
Ranking
HTML Processing
There are two approaches here
Stripping The page you referred to discusses a technique generally known as stripping, which involves removing all the html elements that won't be displayed and translating others to their display form. Personally, I'd preprocess using perl and index the resulting text files. But for an integrated solution, particularly one where you want to record significance tags (e.g. <h1>, <h2>), you probably want to role your own. Here is a partial implementation of a C++ stripping routine (appears in Thinking in C++ , final version of book here), that you could build from.
Parsing A level up in complexity from stripping is html parsing, which would help in your case for recording significance tags. However, a good C++ HTML parser is hard to find. Some options might be htmlcxx (never used it, but active and looks promising) or hubbub (C library, part of NetSurf, but claims to be portable).
If you are dealing with XHTML or are willing to use an HTML-to-XML converter, you can use one of the many available XML parsers. But again, HTML-to-XML converters are hard to find, the only one I know of is HTML Tidy. In addition to conversion to XHTML, its primary purpose is to fix missing/broken tags, and it has an API that could possibly be used to integrate it into an application. Given XHTML documents, there are many good XML parsers, e.g. Xerces-C++ and tinyXML.
Text Processing
For English at least, processing text to words is pretty straight forward. There are a couple of complications when search is involved though.
Stop words are words known a priori not to provide a useful distinction between documents in the set, such as articles and propositions. Often these words are not indexed and filtered from query streams. There are many stop word lists available on the web, such as this one.
Stemming involves preprocessing documents and queries to identify the root of each word to better generalize a search. E.g. searching for "foobarred" should yield "foobarred", "foobarring", and "foobar". The index can be built and searched on roots alone. The two general approaches to stemming are dictionary based (lookups from word ==> root) and algorithm based. The Porter algorithm is very common and several implementations are available, e.g. C++ here or C here. Stemming in the Snowball C library supports several languages.
Soundex encoding One method to make search more robust to spelling errors is to encode words with a phonetic encoding. Then when queries have phonetic errors, they will still map directly to indexed words. There are a lot of implementations around, here's one.
Indexing
The map word ==> page data structure is known as an inverted index. Its inverted because its often generated from a forward index of page ==> words. Inverted indexes generally come in two flavors: inverted file index, which map words to each document they occur in, and full inverted index, which map words to each position in each document they occur in.
The important decision is what backend to use for the index, some possibilities are, in order of ease of implementation:
SQLite or Berkly DB - both of these are database engines with C++ APIs that integrated into a project without requiring a separate server process. Persistent databases are essentially files, so multiple index sets can be search by just changing the associated file. Using a DBMS as a backend simplifies index creation, updating and searching.
In memory data structure - if your using a inverted file index that is not prohibitively large (memory consumption and time to load), this could be implemented as a std::map<std::string,word_data_class>, using boost::serialization for persistence.
On disk data structure - I've heard of blazingly fast results using memory mapped files for this sort of thing, YMMV. Having an inverted file index would involve having two index files, one representing words with something like struct {char word[n]; unsigned int offset; unsigned int count; };, and the second representing (word, document) tuples with just unsigned ints (words implicit in the file offset). The offset is the file offset for the first document id for the word in the second file, count is the number of document ids associate with that word (number of ids to read from the second file). Searching would then reduce to a binary search through the first file with a pointer into a memory mapped file. The down side is the need to pad/truncate words to get a constant record size.
The procedure for indexing depends on which backend you use. The classic algorithm for generating a inverted file index (detailed here) begins with reading through each document and extending a list of (page id, word) tuples, ignoring duplicate words in each document. After all documents are processed, sort the list by word, then collapsed into (word, (page id1, page id2, ...)).
The mifluz gnu library implements inverted indexes w/ storage, but without document or query parsing. GPL, so may not be a viable option, but will give you an idea of the complexities involved for an inverted index that supports a large number of documents.
Retrieval
A very common method is boolean retrieval, which is simply the union/intersection of documents indexed for each of the query words that are joined with or/and, respectively. These operations are efficient if the document ids are stored in sorted order for each term, so that algorithms like std::set_union or std::set_intersection can be applied directly.
There are variations on retrieval, wikipedia has an overview, but standard boolean is good for many/most application.
Ranking
There are many methods for ranking the documents returned by boolean retrieval. Common methods are based on the bag of words model, which just means that the relative position of words is ignored. The general approach is to score each retrieved document relative to the query, and rank documents based on their calculated score. There are many scoring methods, but a good starting place is the term frequency-inverse document frequency formula.
The idea behind this formula is that if a query word occurs frequently in a document, that document should score higher, but a word that occurs in many documents is less informative so this word should be down weighted. The formula is, over query terms i=1..N and document j
score[j] = sum_over_i(word_freq[i,j] * inv_doc_freq[i])
where the word_freq[i,j] is the number of occurrences of word i in document j, and
inv_doc_freq[i] = log(M/doc_freq[i])
where M is the number of documents and doc_freq[i] is the number of documents containing word i. Notice that words that occur in all documents will not contribute to the score. A more complex scoring model that is widely used is BM25, which is included in both Lucene and Xapian.
Often, effective ranking for a particular domain is obtained by adjusting by trial and error. A starting place for adjusting rankings by heading/paragraph context could be inflating word_freq for a word based on heading/paragraph context, e.g. 1 for a paragraph, 10 for a top level heading. For some other ideas, you might find this paper interesting, where the authors adjusted BM25 ranking for positional scoring (the idea being that words closer to the beginning of the document are more relevant than words toward the end).
Objective quantification of ranking performance is obtained by precision-recall curves or mean average precision, detailed here. Evaluation requires an ideal set of queries paired with all the relevant documents in the set.
Depending on the size and number of the static pages, you might want to look at an already existent search solution.
"How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles."
I would choose the Sphinx engine for full text searching. The licence is GPL but the also have a commercial version available. It is meant to be run stand-alone [2], but it can also be embedded into applications by extracting the needed functionality (be it indexing[1], searching [3], stemming, etc).
The data should be obtained by parsing the input HTML files and transforming them to plain-text by using a parser like libxml2's HTMLparser (I haven't used it, but they say it can parse even malformed HTML). If you aren't bound to C/C++ you could take a look at Beautiful Soup.
After obtaining the plain-texts, you could store them in a database like MySQL or PostgreSQL. If you want to keep everything embedded you should go with sqlite.
Note that Sphinx doesn't work out-of-the-box with sqlite, but there is an attempt to add support (sphinx-sqlite3).
I would attack this with a little sqlite database. You could have tables for 'page', 'term' and 'page term'. 'Page' would have columns like id, text, title and url. 'Term' would have a column containing a word, as well as the primary ID. 'Page term' would have foreign keys to a page ID and a term ID, and could also store the weight, calculated from the distance from the top and the number of occurrences (or whatever you want).
Perhaps a more efficient way would be to only have two tables - 'page' as before, and 'page term' which would have the page ID, the weight, and a hash of the term word.
An example query - you want to search for "foo". You hash "foo", then query all page term rows that have that term hash. Sort by descending weight and show the top ten results.
I think this should query reasonably quickly, though it obviously depends on the number and size of the pages in question. Sqlite isn't difficult to bundle and shouldn't need an additional installation.
Ranking pages is the really tricky bit here. With a large sample of pages you can use links quite a lot in working out ranks. Other wise you need to check how words seem to be placed, and also making sure your engine doesn't get fooled by 'dictionary' pages.
Good luck!