Tokenize parse option - stata

Consider a slightly different toy example from my previous question:
. local string my first name is Pearly,, and my surname is Spencer
. tokenize "`string'", parse(",,")
. display "`1'"
my first name is Pearly
. display "`2'"
,
. display "`3'"
,
. display "`4'"
and my surname is Spencer
I have two questions:
Does tokenize work as expected in this case? I thought local macro
2 should be ,, instead of , while local macro 3 contain the rest of the string (and local macro 4 be empty).
Is there a way to force tokenize to respect the double comma as a parsing
character?

tokenize -- and gettoken too -- won't, from what I can see, accept repeated characters such as ,, as a composite parsing character. ,, is not illegal as a specification of parsing characters, but is just understood as meaning that , and , are acceptable parsing characters. The repetition in practice is ignored, just as adding "My name is Pearly" after "My name is Pearly" doesn't add information in a conversation.
To back up: know that without other instructions (such as might be given by a syntax command) Stata will parse a string according to spaces, except that double quotes (or compound double quotes) bind harder than spaces separate.
tokenize -- and gettoken too -- will accept multiple parse characters pchars and the help for tokenize gives an example with space and + sign. (It's much more common, in my experience, to want to use space and comma , when the syntax for a command is not quite what syntax parses completely.)
A difference between space and the other parsing characters is that spaces are discarded but other parsing characters are not discarded. The rationale here is that those characters often have meaning you might want to take forward. Thus in setting up syntax for a command option, you might want to allow something like myoption( varname [, suboptions])
and so whether a comma is present and other stuff follows is important for later code.
With composite characters, so that you are looking for say ,, as separators I think you'd need to loop around using substr() or an equivalent. In practice an easier work-around might be first to replace your composite characters with some neutral single character and then apply tokenize. That could need to rely on knowing that that neutral character should not occur otherwise. Thus I often use # as a character placeholder because I know that it will not occur as part of variable or scalar names and it's not part of function names or an operator.
For what it's worth, I note that in first writing split I allowed composite characters as separators. As I recall, a trigger to that was a question on Statalist which was about data for legal cases with multiple variations on VS (versus) to indicate which party was which. This example survives into the help for the official command.
On what is a "serious" bug, much depends on judgment. I think a programmer would just discover on trying it out that composite characters don't work as desired with tokenize in cases like yours.

Related

A Regex to ignore a set of words

Is there a way to set regex to ignore a set of words separated by space?
I have different products names like:
"Matrix 10X, 10 ml + DISPENSER"
"Matrix 10X,10ml + DISPENSER" where the quantity varies
What I'm trying to do is to replace using regex all words except for:
"10 ml" | "10 ML" | "10ml" ---> these are to be ignored
I have found a code to replace all characters except words separated by space (like "10 ml")
https://regex101.com/r/bG8vB4/5
and to replace them when they are together (like "10ml")
https://regex101.com/r/bG8vB4/4
but can find a way to mix them together to keep just "10 ml" OR "10 ML" OR "10ml" and remove other characters up to the end of the string
Regexps are a mathematical model to do efficient computer recognition of strings. As easy as getting a regular expression to match a string if it has any of some words, math demonstrates that the regexp to get a matcher of strings that just matches a string if it has none of those words is possible. The way to get such a regexp, although is far more complex.
On regular expressions theory, a regular language is one that allows you to set a finite automaton from a regular expression, and the automaton that recognizes a string if the original doesn't is feasible by just switching all accept states into non-accepting states. Once done this, the hardest part is to build a regular expression that matches that automaton (that is possible, but the final regular expression is far more complex, in general than the original) This can be solved with an example (a simple one) and you'll see that that is a complex thing (of course, some regexp libraries allow you to use an operand for this, but you don't specify if the one you are using does) One such sample is when you have to recognize a simple C language comment. A comment is a string delimited by the sequences /* and */ but in the inner part, you cannot have the sequence */.
The first approach could be to use the following regexp:
\/\*.*\*\/
but that fails, as the inner regexp includes the recognition of */ as part of it, so /* bla bla bla */ bla bla bla */ will be recognized as a comment in whole (it should end at the first */) so wee need a regexp that recognizes anything but not something that includes */
Such subexpression is:
([^*]|\*[^/])*
which means and undefinite concatenation of characters different that *, or sequences that, including the first character as * are not followed by /. If you follow that concatenation, you'll see that it's impossible to form a sequence */ leading to our final regexp:
\/\*([^*]|\*[^/])*\*\/
(now you see how the things complicate)
To extend this to a single word (as word, more than two letters) you have to consider that you can allow:
([^w]|w[^o]|wo[^r]|wor[^d])*
in the set, and if you have two words (like foo and bar) you have to write:
([^f]|f[^o]|fo[^o]|[^b]|b[^a]|ba[^r])*
meaning that for each word you have such regexps, making the final regexp a bit complicated. Also, there can be interactions between words if some can be the prefix to another or some have the same prefix chars. This also can have the problem that the compilation of regexps into finite automata has produced many libraries that consider the | operator non conmutative and resolve them in a non conmutative way, leading to erroneous results.
You have not explained also what you mean with ignoring. If you mean matching them and pass around, is different to mean to ignore the whole line they could appear on. The regexps then (an the definition of the problem you need to solve is quite different ---my explanation was in the sense of rejecting a full sentence if it has any of the words on it, which probably is not what you mean) So please, explain (in your question) what do you mean with:
accepting you have matched a sentence containing a word.
rejecting such a sentence.
what are you rejecting (or ignoring) at all.
Rejecting just a word, is simply selecting a sencence that contains that word, and mark the word to be able to pass over it. But that's a different problem, and it requires to select sentences that do have the word.

Replace odd length substrings of character

I am struggling with a little problem concerning regular expressions.
I want to replace all odd length substrings of a specific character with another substring of the same length but with a different character.
All even sequences of the specified character should remain the same.
Simplified example: A string contains the letters a,b and y and all the odd length sequences of y's should be replaced by z's:
abyyyab -> abzzzab
Another possible example might be:
ycyayybybcyyyyycyybyyyyyyy
becomes
zczayybzbczzzzzcyybzzzzzzz
I have no problem matching all the sequences of odd length using a regular expression.
Unfortunately I have no idea how to incorporate the length information from these matches into the replacement string.
I know I have to use backreferences/capture groups somehow, but even after reading lots of documentation and Stack Overflow articles I still don't know how to pursue the issue correctly.
Concerning possible regex engines, I am working with mainly with Emacs or Vim.
In case I have overlooked an easier general solution without a complicated regular expression (e.g. a small and fixed series of simple search and replace commands), this would help too.
Here's how I'd do it in vim:
:s/\vy#<!y(yy)*y#!/\=repeat('z', len(submatch(0)))/g
Explanation:
The regex we're using is \vy#<!y(yy)*y#!. The \v at the beginning turns on the magic option, so we don't have to escape as much. Without it, we would have y\#<!y\(yy\)*y\#!.
The basic idea for this search, is that we're looking for a 'y' y followed by a run of pairs of 'y's,(yy)*. Then we add y#<! to guarantee there isn't a 'y' before our match, and add y\#! to guarantee there isn't a 'y' after our match.
Then we replace this using the eval register, i.e. \=. From :h sub-replace-\=:
*sub-replace-\=* *s/\=*
When the substitute string starts with "\=" the remainder is interpreted as an
expression.
The special meaning for characters as mentioned at |sub-replace-special| does
not apply except for "<CR>". A <NL> character is used as a line break, you
can get one with a double-quote string: "\n". Prepend a backslash to get a
real <NL> character (which will be a NUL in the file).
The "\=" notation can also be used inside the third argument {sub} of
|substitute()| function. In this case, the special meaning for characters as
mentioned at |sub-replace-special| does not apply at all. Especially, <CR> and
<NL> are interpreted not as a line break but as a carriage-return and a
new-line respectively.
When the result is a |List| then the items are joined with separating line
breaks. Thus each item becomes a line, except that they can contain line
breaks themselves.
The whole matched text can be accessed with "submatch(0)". The text matched
with the first pair of () with "submatch(1)". Likewise for further
sub-matches in ().
TL;DR, :s/foo/\=blah replaces foo with blah evaluated as vimscript code. So the code we're evaluating is repeat('z', len(submatch(0))) which simply makes on 'z' for each 'y' we've matched.

VB.NET Use Regex to split string on single commas, but not on double commas

There's a file I'm trying to use for a list of strings that has the following rules:
Cannot begin or end with an unescaped comma.
A comma is escaped by a preceding comma.
Strings are separated by unescaped commas.
Everything else is absolutely face-value.
I've been fiddling around with some VB.NET code to parse a file like this and split it up into either a String() or a List(Of String), but it's gotten to be a little annoying. It's not that I can't figure this out; it's that I don't want to write crap code. If it's unnecessarily confusing, unecessarily slow, or anything else like that, it's not good enough.
Now, I know this almost starts to sound a little like a Code Review question, but I'm really starting to think that maybe a good regex would work better than trying to do this programmatically. Unfortunately regexes are not easy to work with, and while using one to tell it to escape on a comma may be a trivial matter, getting it to also ignore double commas and such is a bit more of an issue, at least for somebody who's not used to regexes.
How do you do this (properly) in VB.NET? In particular, I'm having a little bit of trouble putting together a wild card that'll match anything at all but a comma. It's also taking me a little bit to find out whether #1 has to be verified programmatically, or whether it can be done in the regex itself at the same time as the split operation.
EDIT
I just "woke up" and realized that this syntax is ambiguous, since in an odd-numbered series of three or more commas, you don't know what's escaped and what isn't. I'm just going to accept the current answer and move on.
Haven't used VB.net in a long time ... but I would't got the RegEx way.
What about splitting the string by "," ...
Dim parts As String() = s.Split(New Char() {","c})
You will get a list of items, now you only need to take care of the empty items (escaped commas) and join them with the correct preceding item.
PS: not sure if split gives you empty items in case of ",,"

Regular Expression extract first three characters from a string

Using a regular expression how can I extract the first 3 characters from a string (Regardless of characters)? Also using a separate expression I want to extract the last 3 characters from a string, how would I do this? I can't find any examples on the web that work so thanks if you do know.
Thanks
Steven
Any programming language should have a better solution than using a regular expression (namely some kind of substring function or a slice function for strings). However, this can of course be done with regular expressions (in case you want to use it with a tool like a text editor). You can use anchors to indicate the beginning or end of the string.
^.{0,3}
.{0,3}$
This matches up to 3 characters of a string (as many as possible). I added the "0 to 3" semantics instead of "exactly 3", so that this would work on shorter strings, too.
Note that . generally matches any character except linebreaks. There is usually an s or singleline option that changes this behavior, but an alternative without option-setting is this, (which really matches any 3 characters):
^[\s\S]{0,3}
[\s\S]{0,3}$
But as I said, I strongly recommend against this approach if you want to use this in some code that provides other string manipulation functions. Plus, you should really dig into a tutorial.

differentiating and testing regex variants

Several implementations of regular expressions differ from each other in subtle ways which is the source of much confusion when I try to use them.
Most of these differences include the semantics related to whether a character is escaped or not. This is most often an issue with parentheses, but can apply to curly brackets and others. This is probably a consequence of the syntax of the language or environment in which the implementation is found. For instance, if the $ symbol indicates a variable name in some language, one can expect regular expressions represented in that language would require escaping the "end of line" anchor to \$ or some such. But what gets confusing at this point is how you would represent an actual dollar sign. I believe Perl gets around this by wrapping a regex inside forward slashes /.
Similarly there are escapes for specific characters themselves, for instance non printing characters such as \n and \t. Then there are the similar looking generic character groups such as \d for digits, \s for whitespace, and \w which I just learned covers underscores as well as digits. I found myself on several occasions trying to use \a for a "alphabetical" group but this only ended up matching the bell character 0x07.
It's pretty clear that there is no simple and one-shot solution to knowing all of the differences in features and syntax offered by the myriad of implementations of regular expressions out there, short of somebody doing all the hard work and putting results in a well organized table. Here is one example of exactly this, but of course it doesn't cover several of the programs that I use extensively myself, which include vim, sed, Notepad++, Eclipse, and believe it or not MS Word (at least version 2010, I suspect 2007 also has this, they call it "wildcards") has a simple regex implementation too.
I guess what I want is to be as lazy as possible (in a certain sense) by trying to come up with a way to determine for any given regex implementation what its "escape settings" are beyond any doubt by applying one (or a few) queries.
I'm thinking I can make a file which contains test cases, along with a huge regex query, and somehow engineer it so that running it once will show me exactly what syntax I need to use subsequently without doubting myself any further. (as opposed to having to edit files and use multiple queries to figure out the same thing which gets terribly old after a while).
If nobody else has attempted to construct such a monstrosity, I may undertake this task myself. If it's even possible. Is this possible?
I tried to come up with an example (it was just to figure out if EOL anchor is $ or \$) but in every case I had to use a multitude of different search/replace queries in order to determine how the program will respond to the input.
Edit: I came up with something using capturing and backtracking. I gotta work on it a little more.
Update: Well, Notepad++ does not implement the OR operator commonly denoted by the pipe |. Word's "wildcards" is a poor substitute also, it doesn't have | or *. I'm fairly certain that missing any of the regular expression operators (union, concat, star) means it cannot generate a regular grammar, so those two are ruled out.
I can create an input file like this:
$
*
]
EOL
and query
(\$)|(\*)|(\[)|($)
replacing with
escDollar:\1:escStar:\2:escSQBrL:\3:Dollar:\4:
yields a result of (assuming unescaped parens is group and unescaped pipe is or)
escDollar:$:escStar::escSQBrL::Dollar::
escDollar::escStar:*:escSQBrL::Dollar::
]escDollar::escStar::escSQBrL::Dollar::
EOLescDollar::escStar::escSQBrL::Dollar::
I ran this in vim. This output would demonstrate the single characters that are matched by each item specified next to it, i.e. the escaped dollar sign item is seen to match the actual dollar sign character rather than the non escaped dollar sign item at the end.
It's difficult to see what's going on with the $ anchor since it matches zero characters, but it shouldn't be hard to find a solution for it. Besides it's not a commonly mistaken one. The ones I'm particularly worried about are pipe and parens and the different brackets. When you've got 4 different types in there there are 2^4 combinations of escaped and non-escaped versions of them you can use. Trial-and-error with that is horrific.
This output isn't too hard to parse at a glance, and is also seriously easy to process as part of a script. The one glaring problem that remains is figuring out whether parens and pipe need to be escaped. Because the functionality of the whole thing depends on them.
It would seem like that will require multiple queries. It may be possible with a cleverly engineered jumble of backslashes, parens, and pipes to figure out the combination (only 4 possibilities after all) with an initial query, then choose the subsequent matrix generator query based on it.
Something like this shows it can work:
(e)
(f)
querying
\((f\))|\|\((e\))
replace with
\1:\2
would produce:
:(e if escaped parens is group and escaped pipe is or
:e) if parens is group and escaped pipe is or
(f: if escaped parens is group and pipe is or
f): if parens is group and pipe is or
I still don't really like this though because it requires a second query on a second set of input. Too much setting up. I may just make 4 copies of the "matrix" thing.
The table on this page summarizes quite nicely which features are available in which regex implementations:
http://www.regular-expressions.info/refflavors.html