Regular expression matching for removing certain uses of the period character

Regular expression matching for removing certain uses of the period character - regex

I have some Fortran 77 source files that I'm trying to convert from a non-standard STRUCTURE and RECORD syntax to the standardized Fortran 90 TYPE syntax. One tricky aspect of this is the different way that structure members are addressed.
Non-standard:
s.member = 1
Standard:
s%member = 1
So, I need to trap all uses of periods in these sort of scenarios and replace them with % characters. Not too bad, except when you think about all of the ways that periods can be used (decimal points in numbers, filenames in include statements, punctuation in comments, Fortran 77 relational operators, maybe others). I've done some preprocessing to fix the relational operators to use the Fortran 90 symbols, and I don't really care about mangling the grammar of comments, but I haven't come up with a good approach to translate the . to % for the cases above. It seems like I should be able to do this with sed, but I'm not sure how to match the instances I need to fix. Here are the rules that I've thought of:
On a line-by-line basis:
If the line begins with <whitespace>include, then we shouldn't do anything to that line; pass it through to the output, so we don't mess up the filename inside the include statement.
The following strings are operators that don't have symbolic equivalents, so they must be left alone: .not. .and. .or. .eqv. .neqv.
Otherwise, if we find a period that is surrounded by 2 non-numeric characters (so it's not a decimal point), then it should be the operator that I'm looking to replace. Change that period to a %.
I'm not a native Fortran speaker myself, so here are some examples:
include 'file.inc' ! We don't want to do anything here. The line can
! begin with some amount of whitespace
if x == 1 .or. y > 2.0 ! In this case, we don't want to touch the periods that
! are part of the logical operator ".or.". We also don't
! want to touch the period that is the decimal point
! in "2.0".
if a.member < 4.0 .and. b.othermember == 1.0 ! We don't want to touch the periods
! inside the numbers, but we need to
! change the "a." and "b." to "a%"
! and "b%".
Any good way of tackling this problem?
Edit: I actually found some additional operators that contain a dot in them that don't have symbolic equivalents. I've updated the rule list above.

You can't do this with a regexp, and it's not that easy.
If I had to do what you have to, I would probably do it by hand, unless the codebase is huge. If the former applies, first replace all [a-zA-Z0-9].[a-zA-Z] to something very weird that is guaranteed never to compile, something like "#WHATEVER#", then proceed to search all these entries and replace them by hand after manual control.
If the amount of code is huge, then you need to write a parser. I would suggest you to use python to tokenize basic fortran constructs, but remember that fortran is not an easy language to parse. Work "per routine", and try to find all variable names used, using them as a filter. If you encounter something like a.whatever, and you know that a is in the list of local or global vars, apply the change.

Unless the codebase is really HUUGE (and do think very hard whether this is indeed the case), I'd just take an editor like Vim (vertical select & block select are your friends) a*nd set aside an afternoon to do this by hand*. In one afternoon, my guess is you'll be done with most of it, if not all. Afternoon is a lot of time. Just imagine how many cases you could've covered in these 2 hours alone.
Just by trying to write a parser for something like this, will take you much longer than that.
Of course, the question begs itself ... if the code if F77 which all compilers still support, and the code works ... why are you so keen on changing it?

I'm not that versed in regexps, so I guess I'd try tackling one this from other side. If you grep for the STRUCTURE keyword, you get the list of all the STRUCTURES used in the code. Once you have it, for each STRUCTURE S then you can just replace all instances of S. by S%.
This way you don't have to worry about things like .true., .and., .neq. and their relatives. The main worry then would be to be able to parse the STRUCTURE declarations.

Although the regex below :
(?<!')\b([^.\s]+)(?<!\.(?:not|and|or|eqv|neqv))(?<=\D)\.(?=\D)(?!(?:not|and|or|eqv|neqv)\.)([^.\s]+)\b(?!')
Replace $1%$2
Works perfectly for your examples, I would not recommend using it with your current task. It will definitely not cover all your cases. Now if you care for a 80% coverage or something you could use it but you should probably back up your sources. With the limited set of input cases I had , I am sure that there will be cases that the regex would replace something that it shouldn't.
Good luck :)

This sed oneliner might be a start
sed -r '/^\s*include/b;/^\s*! /b;G;:a;s/^(\.(not|and|or|eqv|neqv)\.)(.*\n.*)/\3\1/;ta;s/^\.([^0-9]{2,})(.*\n.*)/\2%\1/;ta;s/^(.)(.*\n.*)/\2\1/;ta;s/\n//'

Based on your examples, I am guessing it would be enough to protect quoted strings, then replace periods with alphabetics on both sides.
perl -pe '1 while s%(\x27[^\x27]+)\.([^\x27]+\x27)%$1##::##$2%;
s/([a-z])\.([a-z])/$1%$2/g;
s/##::##/./g' file.f
I offer this Perl solution not because sed is not a good enough tool for this, but because it avoids the issue of minor but pesky differences between sed dialects. The ability to use a hex code for the single quotes is a nice bonus.

Related

Regex that matches a list of comma separated items in any order

I have three "Clue texts" that say:
SomeClue=someText
AnotherClue=somethingElse
YetAnotherClue=moreText
I need to parse a string and see if it contains exactly these 3 texts, separated by a comma. No Clue Text contains any comma.
The problem is, they can be in any order and they must be the only clues in the string.
Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText
SomeClue=someText,YetAnotherClue=moreText,AnotherClue=somethingElse
AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse
Non-Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText,
SomeClue=someText,YetAnotherClue=moreText,,AnotherClue=somethingElse
,AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,UselessText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,AClueThatIDontWant=wrongwrongwrong
Putting togheter what I found on other posts, I have:
(?=.*SomeClue=someText($|,))(?=.*AnotherClue=somethingElse($|,))(?=.*YetAnotherClue=moreText($|,))
This works as far as Clues and their order are concerned.
Unfortunately, I can't find a way to avoid adding a comma and then some stupid text at the end.
My real case has somewhat more complicated Clue Texts, because each of them is a small regex, but I am pretty sure once I know how to handle commas, the rest will be easy.

I think you'd be better off with a stronger tool than regexes (and I genuinely love regular expressions). Regexes aren't good with needing supplementary memory, which is what you have here: you need exactly these 3, but they can come in any order.
In principle, you could write a regex for each of the 6 permutations. But that would never scale. You ought to use something with parsing power.
I suggest writing a verification function in your favorite scripting language, made up of underlying string functions.
In basic Python, you could do (for instance)
ref = set(['SomeClue=someText', 'AnotherClue=somethingElse', 'YetAnotherClue=moreText'])
def ismatch(myline):
splt = myline.split(',')
return ref == set(splt)
You can tweak that as necessary, of course. Note that this nearly-complete solution is not really longer, and much more readable, than any regex would be.

Encode/decode certain text sequences in Qt

I have a QTextEdit where the user can insert arbitrary text. In this text, there may be some special sequences of characters which I wish to translate automatically. And from the translated version, I wish I could go back to the sequences.
Take for instance this:
QMessageBox::information(0, "Foo", MAGIC_TRANSLATE(myTextEdit->text()));
If the user wrote, inside myTextEdit's text, the sequence \n, I would like that MAGIC_TRANSLATE converted the string \n to an actual new line character.
In the same way, if I give a text with a new line inside it, a MAGIC_UNTRANSLATE will convert the newline with a \n string.
Now, of course I can implement these two functions by myself, but what I am asking is if there is something already made, easy to use, in Qt, which allows me to specify a dictionary and it does the rest for me.
Note that sequences with common prefix can create some conflicts, for example converting:
\foo -> FOO
\foobar -> FOOBAR
can give rise to issues when translating the text asd \foobar lol, because if \foo is searched and replaced before \foobar, then the resulting text will be asd FOObar lol instead of the (more natural) asd FOOBAR lol.
I hope to have made clear my needs. I believe that this may be a common task, so I hope there is a Qt solution which takes into account this kind of issues when having conflicting prefixes.
I am sorry if this is a trivial topic (as I think it may be), but I am not familiar at all with encoding techniques and issues, and my knowledge of Qt encoding cover only very simple Unicode-related issues.
EDIT:
Btw, in my case a data-oriented approach, based on resources or external files or anything that does not requires a recompilation would be great.

It sounds like your question is, "I want to run a sequence of regular expression or simple string replacements to map between two encodings of some text".
First you need to work out your mapping, exactly. As you say, if your escape sequences like \foo and \foobar are fiddly, you might find that you don't have a bidirectional, lossless mapping. No library in the world can help you if your design or encoding is flawed.
When you end up with a precise design (which we can't help you on given the complete lack of information provided on the purpose of this function), you'll probably find that a sequence of string replacements is fine. If it really is more complicated, then some QRegExps should be enough.

It is always a bit ugly to self-answer questions, but... Maybe this solution is useful to someone.
As suggested by Nicholas in his answer, a good strategy is to use replacement. It is simple and effective in most cases, for example in the plain C/C++ escaping:
\n \r \t etc
This works because they are all different. It will always work with a replacement if the sequences are all different and, in particular, if no sequence is a prefix to another sequence.
For example, if your sequences are the one aboves plus some greek letters, you will not like the \nu sequence, which should be translated to ν.
Instead, if the replacing function tests for \n before \nu, the result is wrong.
Assuming that both sequences will be translated in two completely different entities, there are two solutions: place a close-sequence character, for example \nu;, or just replace by longest to shorter strings. This ensure that any sequence which is prefix of another one is not replaced before it.
For various reasons, I tried another way: using a trie, which is a tree of all the prefixes of a dictionary of words. Long story short: it works fairly well and probably works faster than (most) regexes and replacements.
Regex are state machines and it is not rare to re-process the input, with a trie, you avoid to re-match characters twice, so you go pretty fast.
Code for tries is pretty easy to find on the internet, and the modifications to do efficient matching are trivial, so I will not write the code here.

Regex to add missing space after comma before argument in functions (to meet coding guidelines)

there is an open source php project i would like to contribute by reducing some coding guidline violations. as there are about 5000 violations for a specific sniff i guess it would be appropriate to use some regex.
the coding guidline rule is called "FunctionCallArgumentSpacingNoSpaceAfterComma". this means that all arguemts should be separated by a comma followed by a space.
these example snippets violate this rule:
$this->message('About', 'Warning, very important!', $this->securityRisk().$this->alterPasswordForm(),2);
$sFiles = t3lib_div::getFilesInDir(PATH_typo3conf,'sql',1,1);
if (!strstr(implode(',',$sFiles).',', '/database.sql,')) {
can anybody help in creating a useful regex to fix these coding guidline violations? i tried some hours but i am unfortunately not capable to solve this on my own.

The first thing I would try would be something like this:
sed 's/,\([^ ]\)/, \1/g' someFile.php
Replace commas followed by non-spaces with a comma, a space, and the character that WAS after the comma.
After thinking about it for a split-second, that might kinda work.
But there must be multiple things I'm not thinking of...

You can't safely/reliably do this with a regex. You can't even reliably skip past string literals in some languages (contemplate Perl's q syntax; especially how it works with what it considers to be "bracketing" characters). And even ignoring those, you get into trouble if you have multiple languages in the codebase (there being subtle differences between string/character literals in, say, PHP, Java, and Perl).

Is it feasible to write a regex that can validate simple math?

I’m using a commercial application that has an option to use RegEx to validate field formatting. Normally this works quite well. However, today I’m faced with validating the following strings: quoted alphanumeric codes with simple arithmetic operators (+-/*). Apparently the issue is sometimes users add additional spaces (e.g. “ FLR01” instead of “FLR01”) or have other typos such as mismatched parenthesis that cause issues with downstream processing.
The first examples all had 5 codes being added:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
So I started going down the road of matching 5 alphanumeric characters quoted by strings:
"[0-9a-zA-Z]{5}"[+-*/]
However, the formulas quickly got harder and I don’t know how to get around the following complications:
I need to test for one of the four simple math operators (+-*/) between each code, but not after the last one.
There can be any number of codes being added together, not just five as in the example above.
Enclosed parenthesis are okay (“X”+”Y”)/”2”
Mismatched parenthesis are not okay.
No formula (e.g. a blank) is okay.
Valid:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
"0XT"+"1SEAL"+"1XT"+"23LSL"+"23NBL"
("LS400"+"LT400")*"LC430"/("EL414"+"EL414R"+"LC407"+"LC407R"+"LC410"+"LC410R"+"LC420"+"LC420R")
Invalid:
" FLR01" +"FLR02"
"FLR01"J"FLR02"
("FLR01"+"FLR02"
Is this not something you can easily do with RegExp? Based on Jeff’s answer to 230517, I suspect I’m failing at least the ‘matched pairing’ issue. Even a partial solution to the problem (e.g. flagging extra spaces, invalid operators) would likely be better than nothing, even if I can't solve the parenthesis issue. Suggestions welcomed!
Thanks,
Stephen

As you are aware you can't check for matching parentheses with regular expressions. You need something more powerful since regexes have no way of remembering state and counting the nested parentheses.
This is a simple enough syntax that you could hand code a simple parser which counts the parentheses, incrementing and decrementing a counter as it goes. You'd simply have to make sure the counter never goes negative.
As for the rest, how about this?
("[0-9a-zA-Z]+"([+\-*/]"[0-9a-zA-Z]+")*)?
You could also use this regular expression to check the parentheses. It wouldn't verify that they're nested properly but it would verify that the open and close parentheses show up in the right places. Add in the counter described above and you'd have a proper validator.
(\(*"[0-9a-zA-Z]+"\)*([+\-*/]\(*"[0-9a-zA-Z]+"\)*)*)?

You can easily use regex's to match your tokens (numbers, operators, etc), but you cannot match balanced parenthesis. This isn't too big of a problem though, as you just need to create a state machine that operates on the tokens you match. If you're not familiar with these, think of it as a flow chart within your program where you keep track of where you are, and where you can go. You can also have a look at the Wikipedia page.

Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.

This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.

I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.

We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.

Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!

10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.

If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.

Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.

You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match

You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...

If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.

I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.

I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.

I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.

You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.

I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.

Try combining them into one big regex?

I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.

The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js