Catch a special character with regex - regex

I have a xml file, and I have to match the char < and > inside the tag and replace them, but I have some difficulties catching them...
The xml is something link this:
<tag>text</tag>
<tag2>3 is > than 2</tag2>
<tag3>But 1 in < than 4</tag3>
I found a solution using this regex
(\s>\s|\s<\s)
including a whitespace, the character and another whitespace... but how if there aren't the whitespaces?
Edit In fact I need to replace these symbols with < and >...
The xml fields are obtained from a third party software that gave away the output xml file like the one I've written above.
I know that the best approach is that when the software reads the data it encodes the < and > as < and > in the xml, but I hoped that there was a way to do it afterwards

So basically you are receiving incorrectly formed XML and you want to replace < and > and replace it with < and >
Bad news. It is not possible to do it with regex in a XML generic way. Try building a parser.
Good news. If you introduce some limitations (i.e. if the data you are receiving comply with some requirements), there may be some good solutions.
You need a way to distinguish which symbols are part of the tags, and which symbols are part of the content.
For example, if you consider that tags have only letters and numbers, but no spaces(or other symbols) in between, something like
(?<lt><)(?:(?!\/?[[:alnum:]]*>))|(?:\s[[:alnum:]]*)(?<gt>>)
could probably work. You can play with it in https://regex101.com/r/uF0iR2/2
It is the concatenation | of two queries. The first one is the < but not followed but the rest of a tag. And the second one is the > but prefixed with something that has an space. We could avoid the negative lookahead ?! but then we could end up colliding with the other "query". We cannot do negative look-behind because there cannot be quantifiers.
Finally, unrelated, another possibility for (\s>\s|\s<\s) is (\s[<>]\s)

Related

Write CDATA XML-node with boost::property_tree

I'm trying to write an XML-file containing CDATA-nodes using boost::property_tree. However since characters such as <, >, &, etc. are escaped automatically when writing the XML-file, something like
xml.put("node", "<![CDATA[message]]>")
will appear as
<node>&lt![CDATA[message]]&gt</node>
in the XML-file. Is there any way to properly write CDATA-nodes using property_tree or is this simply a limitation of the library?
Boost documentation clearly says that it is not able to distinguish between CDATA and non-CDATA values:
The XML storage encoding does not round-trip perfectly. A read-write cycle loses trimmed whitespace, low-level formatting information, and the distinction between normal data and CDATA nodes. Comments are only preserved when enabled. A write-read cycle loses trimmed whitespace; that is, if the origin tree has string data that starts or ends with whitespace, that whitespace is lost.
The few times I've faced the same problem have been for very specific cases where I knew no other escaped data would be needed, so a simple post-processing of the generated file replacing the escaped characters was enough.
As a general example:
std::ostringstream ss;
pt::write_xml(ss, xml, pt::xml_writer_make_settings<std::string>('\t', 1));
auto cleaned_xml = boost::replace_all_copy(ss.str(), ">", ">");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "<", "<");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "&", "&"); // last one
std::ofstream fo(path);
fo << cleaned_xml;
A more elaborated solution should include finding the opening <![CDATA[ and closing ]]&gt, and replace only within those limits to avoid replacing correctly escaped symbols.
Another solution is presented in this answer but I've never used it.

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

Represent "less-than" symbol in Sublime Text regex in syntax definition

I'm working with an ancient pre-XML markup that uses codes of the form "$=x", where x may be an alphabetic character or a symbol on the keyboard, such as ; (semicolon), ? (question mark), or < (right left angle bracket, aka greater-than less-than). [Note after editing: the confusion manifested in the question as originally phrased goes to the heart of the problem. See my comment to the accepted answer. RS]
So I've modified a copy of XML.tmLanguage syntax definition file in my User folder to identify the eleven different categories that these codes represent, so I can easily see them in the large text files (which also contain XML markup) I'm working with.
For all the symbols except < I'm able to escape the symbol by preceding it with a backslash. But in the Boost regex engine that ST2 uses, \< is how you indicate that you want to match only at the start of a word. Consequently I've been unable to get this code to be properly recognized and highlighted.
I've looked everywhere for how to escape the < symbol in this circumstance. I've tried preceding it with 0, 1, 2, 3 and 4 back-slashes; and I also tried using the hexadecimal escape code \x{3009}. [Note: this is the code for greater-than instead of less-than.]
All in vain. (A few alternatives didn't generate an error message but also didn't highlight the code.)
Because the codes I'm working with need to be colored differently, I can't use a generic symbol in lieu of <, and I can't specify it either. How do I get this?
The tmLanguage file is written in XML, so Sublime Text feeds it through an XML parser first, before giving pieces to its regex engine's parser.
XML uses < to open tags such as <string>, so you can't use it directly as a character. Instead, there are these standard character references:
& for & (ampersand)
< for < (less than)
> for > (greater than; not required)
" for " (quote mark; only required in attribute values quoted with ")
&apos; for ' (apostrophe; only required in attribute values quoted with ')
So use <string>\$=<</string> in the syntax file. When Sublime Text reads the file, its own XML parser will turn this into \$=< for the regex parser.
Backslash sequences don't help because the XML parser passes them through unchanged to the regex parser, which then sees \< or \\, neither of which are what you want.
\x{3008} is passed by the XML parser to the regex parser, where it's decoded to 〈, a character which looks somewhat similar to < but doesn't match it. \x3C would work though.
By the way, tmLanguage files use plist (property list) XML, so you can convert it to a format that's easier to edit, or use a plist editor such as http://tustin2121.github.io/jsPlistor/ (from Is there any online .plist editor?).
Try to use > for a syntax file.

Hunspell/Aspell data conversion to human-readable inflection list

Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.
There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).

Tokenize the text depending on some specific rules. Algorithm in C++

I am writing a program which will tokenize the input text depending upon some specific rules. I am using C++ for this.
Rules
Letter 'a' should be converted to token 'V-A'
Letter 'p' should be converted to token 'C-PA'
Letter 'pp' should be converted to token 'C-PPA'
Letter 'u' should be converted to token 'V-U'
This is just a sample and in real time I have around 500+ rules like this. If I am providing input as 'appu', it should tokenize like 'V-A + C-PPA + V-U'. I have implemented an algorithm for doing this and wanted to make sure that I am doing the right thing.
Algorithm
All rules will be kept in a XML file with the corresponding mapping to the token. Something like
<rules>
<rule pattern="a" token="V-A" />
<rule pattern="p" token="C-PA" />
<rule pattern="pp" token="C-PPA" />
<rule pattern="u" token="V-U" />
</rules>
1 - When the application starts, read this xml file and keep the values in a 'std::map'. This will be available until the end of the application(singleton pattern implementation).
2 - Iterate the input text characters. For each character, look for a match. If found, become more greedy and look for more matches by taking the next characters from the input text. Do this until we are getting a no match. So for the input text 'appu', first look for a match for 'a'. If found, try to get more match by taking the next character from the input text. So it will try to match 'ap' and found no matches. So it just returns.
3 - Replace the letter 'a' from input text as we got a token for it.
4 - Repeat step 2 and 3 with the remaining characters in the input text.
Here is a more simple explanation of the steps
input-text = 'appu'
tokens-generated=''
// First iteration
character-to-match = 'a'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ap'
pattern-found = false
tokens-generated = 'V-A'
// since no match found for 'ap', taking the first success and replacing it from input text
input-text = 'ppu'
// second iteration
character-to-match = 'p'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'pp'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ppu'
pattern-found = false
tokens-generated = 'V-A + C-PPA'
// since no match found for 'ppu', taking the first success and replacing it from input text
input-text = 'u'
// third iteration
character-to-match = 'u'
pattern-found = true
tokens-generated = 'V-A + C-PPA + V-U' // we'r done!
Questions
1 - Is this algorithm looks fine for this problem or is there a better way to address this problem?
2 - If this is the right method, std::map is a good choice here? Or do I need to create my own key/value container?
3 - Is there a library available which can tokenize string like the above?
Any help would be appreciated
:)
So you're going through all of the tokens in your map looking for matches? You might as well use a list or array, there; it's going to be an inefficient search regardless.
A much more efficient way of finding just the tokens suitable for starting or continuing a match would be to store them as a trie. A lookup of a letter there would give you a sub-trie which contains only the tokens which have that letter as the first letter, and then you just continue searching downward as far as you can go.
Edit: let me explain this a little further.
First, I should explain that I'm not familiar with these the C++ std::map, beyond the name, which makes this a perfect example of why one learns the theory of this stuff as well as than details of particular libraries in particular programming languages: unless that library is badly misusing the name "map" (which is rather unlikely), the name itself tells me a lot about the characteristics of the data structure. I know, for example, that there's going to be a function that, given a single key and the map, will very efficiently search for and return the value associated with that key, and that there's also likely a function that will give you a list/array/whatever of all of the keys, which you could search yourself using your own code.
My interpretation of your data structure is that you have a map where the keys are what you call a pattern, those being a list (or array, or something of that nature) of characters, and the values are tokens. Thus, you can, given a full pattern, quickly find the token associated with it.
Unfortunately, while such a map is a good match to converting your XML input format to a internal data structure, it's not a good match to the searches you need to do. Note that you're not looking up entire patterns, but the first character of a pattern, producing a set of possible tokens, followed by a lookup of the second character of a pattern from within the set of patterns produced by that first lookup, and so on.
So what you really need is not a single map, but maps of maps of maps, each keyed by a single character. A lookup of "p" on the top level should give you a new map, with two keys: p, producing the C-PPA token, and "anything else", producing the C-PA token. This is effectively a trie data structure.
Does this make sense?
It may help if you start out by writing the parsing code first, in this manner: imagine someone else will write the functions to do the lookups you need, and he's a really good programmer and can do pretty much any magic that you want. Writing the parsing code, concentrate on making that as simple and clean as possible, creating whatever interface using these arbitrary functions you need (while not getting trivial and replacing the whole thing with one function!). Now you can look at the lookup functions you ended up with, and that tells you how you need to access your data structure, which will lead you to the type of data structure you need. Once you've figured that out, you can then work out how to load it up.
This method will work - I'm not sure that it is efficient, but it should work.
I would use the standard std::map rather than your own system.
There are tools like lex (or flex) that can be used for this. The issue would be whether you can regenerate the lexical analyzer that it would construct when the XML specification changes. If the XML specification does not change often, you may be able to use tools such as lex to do the scanning and mapping more easily. If the XML specification can change at the whim of those using the program, then lex is probably less appropriate.
There are some caveats - notably that both lex and flex generate C code, rather than C++.
I would also consider looking at pattern matching technology - the sort of stuff that egrep in particular uses. This has the merit of being something that can be handled at runtime (because egrep does it all the time). Or you could go for a scripting language - Perl, Python, ... Or you could consider something like PCRE (Perl Compatible Regular Expressions) library.
Better yet, if you're going to use the boost library, there's always the Boost tokenizer library -> http://www.boost.org/doc/libs/1_39_0/libs/tokenizer/index.html
You could use a regex (perhaps the boost::regex library). If all of the patterns are just strings of letters, a regex like "(a|p|pp|u)" would find a greedy match. So:
Run a regex_search using the above pattern to locate the next match
Plug the match-text into your std::map to get the replace-text.
Print the non-matched consumed input and replace-text to your output, then repeat 1 on the remaining input.
And done.
It may seem a bit complicated, but the most efficient way to do that is to use a graph to represent a state-chart. At first, i thought boost.statechart would help, but i figured it wasn't really appropriate. This method can be more efficient that using a simple std::map IF there are many rules, the number of possible characters is limited and the length of the text to read is quite high.
So anyway, using a simple graph :
0) create graph with "start" vertex
1) read xml configuration file and create vertices when needed (transition from one "set of characters" (eg "pp") to an additional one (eg "ppa")). Inside each vertex, store a transition table to the next vertices. If "key text" is complete, mark vertex as final and store the resulting text
2) now read text and interpret it using the graph. Start at the "start" vertex. ( * ) Use table to interpret one character and to jump to new vertex. If no new vertex has been selected, an error can be issued. Otherwise, if new vertex is final, print the resulting text and jump back to start vertex. Go back to (*) until there is no more text to interpret.
You could use boost.graph to represent the graph, but i think it is overly complex for what you need. Make your own custom representation.