regex for name with capital letters in Xcode - regex

I am trying to change all names that looks like this: thisForExample and change it with: this_for_example in Xcode with regex. Does anyone know how to do that?
I have tried with this: ([a-z][A-Z])*[a-z]?
but it does not find anything.

You just need to find [a-z][A-Z][a-z].
Automating the replacement process will be tricky though - how do you plan on changing an arbitrary upper case letter to its lower case equivalent ?

Perl would be a good tool for this (if you insist on using Regex)as it supports case modification in substitution patterns:
\l => change first char (of following match variable) to lower case
\L => change all chars (of following match variable) to lower case
\u => change single char (of following match variable) to upper case
\U => change all chars (of following match variable) to upper case
If all you care is to convert (simple/trivial!) variables and method names à la thisForExample into this_for_example.
For this a single regex like this would be sufficient:
echo 'thisForExample = orThisForExample();' \
| perl -pe 's/(?<=[^A-Z])([A-Z]+)(?=[^A-Z])/_\L\1/g;'
//output: "this_for_example = or_this_for_example();"
As soon however as you're expecting to come across (quite common) names like…
fooURL = URLString + URL + someURLFunction();
…you're in trouble. Deep trouble.
Let's see what our expression does with it:
echo 'fooURL = URLString + URL + someURLFunction();' \
| perl -pe 's/(?<=[^A-Z])([A-Z]+)(?=[^A-Z])/_\L\1/g;'
//output: "foo_url = _urlstring + _url + some_urlfunction();"
Pretty bad, huh?
And to make things even worse:
It is syntactically impossible to distinguish between a (quite common) variable name "URLString" and a class name "NSString".
Conclusion: Regex alone is pretty hack-ish and error prone for this kind of task. And simply unsufficient. And you don't want a single shell call to potentially mess up your entire code base, do you?
There is a reason why Xcode has a refactor tool that utilizes clang's grammar tree to differentiate between syntactically identical (pattern-wise at least) variable and class names.
This is a problem for context free languages, not regular languages. Hence regular expressions cannot deal with it. You'd need a contect free grammar to generate a language tree, etc. (and at that time you've just started building a compiler)
Also: Why use under_scores anyway? If you're using Xcode then you're probably coding in ObjC(++) or similar anyway, where it's common sense to use camelCase. I and probably pretty much everybody else would hate you for making us one day deal with your underscored ObjC/C/… code.
Alternative Answer:
In a comment answer to Paul R you said you were basically merging two projects, one with under_scored naming, one with camelCased naming.
I'd advise you then to switch your under_scored code base to camelCase. For two reasons:
Turning under_scored names into camelCased is way less error prone then vice versa. (That is: in a camelCase dominated environment only, of course! it would be just as error prone if you'd mainly deal with under_scored code in Xcode. Think of it as "there's simply less code to potentially break" ;) )
Quoting my own answer:
[…] I and probably pretty much
everybody else would hate you for making us one day deal with your
underscored ObjC/C/… code. […]
Here is a simple regex for converting under_score to camelCase:
echo 'this_for_example = _leading_under_score + or_this_for_example();' \
| perl -pe 's/(?<=[\w])_([\w])/\u\1/g;'
//output: "thisForExample = _leadingUnderScore + orThisForExample();"

Something like ([a-zA-Z][a-z]+)+?
The process could like this:
You get all the names to a file, there you automatically forge the replacement, and make (automatically) sed script to change one to another.

Related

Regular Expression for whole world

First of all, I use C# 4.0 to parse the code of a VB6 application.
I have some old VB6 code and about 500+ copies of it. And I use a regular expression to grab all kinds of global variables from the code. The code is described as "Yuck" and some poor victim still has to support this. So I'm hoping to help this poor sucker a bit by generating overviews of specific constants. (And yes, it should be rewritten but it ain't broke, so...)
This is a sample of a code line I need to match, in this case all boolean constants:
Public Const gDemo = False 'Is this a demo version
And this is the regular expression I use at this moment:
Public\s+Const\s+g(?'Name'[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?'Value'[0-9]*)
And I think it too is yuckie, since the * at the end of the boolean group. But if I don't use it, it will only return 'T' or 'F'. I want the whole word.
Is this the proper RegEx to use as solution or is there an even nicer-looking option?
FYI, I use similar regexs to find all string constants and all numeric constants. Those work just fine. And basically the same .BAS file is used for all 50 copies but with different values for all these variables. By parsing all files, we have a good overview of how every version is configured.
And again, yes, we need to rebuild the whole project from scratch since it becomes harder to maintain these days. But it works and we need the manpower for other tasks. It just needs the occasional tweaks...
You can use: Public\s+Const\s+g(?<Name>[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?<Value>False|True)
demo

Regular expression matching for removing certain uses of the period character

I have some Fortran 77 source files that I'm trying to convert from a non-standard STRUCTURE and RECORD syntax to the standardized Fortran 90 TYPE syntax. One tricky aspect of this is the different way that structure members are addressed.
Non-standard:
s.member = 1
Standard:
s%member = 1
So, I need to trap all uses of periods in these sort of scenarios and replace them with % characters. Not too bad, except when you think about all of the ways that periods can be used (decimal points in numbers, filenames in include statements, punctuation in comments, Fortran 77 relational operators, maybe others). I've done some preprocessing to fix the relational operators to use the Fortran 90 symbols, and I don't really care about mangling the grammar of comments, but I haven't come up with a good approach to translate the . to % for the cases above. It seems like I should be able to do this with sed, but I'm not sure how to match the instances I need to fix. Here are the rules that I've thought of:
On a line-by-line basis:
If the line begins with <whitespace>include, then we shouldn't do anything to that line; pass it through to the output, so we don't mess up the filename inside the include statement.
The following strings are operators that don't have symbolic equivalents, so they must be left alone: .not. .and. .or. .eqv. .neqv.
Otherwise, if we find a period that is surrounded by 2 non-numeric characters (so it's not a decimal point), then it should be the operator that I'm looking to replace. Change that period to a %.
I'm not a native Fortran speaker myself, so here are some examples:
include 'file.inc' ! We don't want to do anything here. The line can
! begin with some amount of whitespace
if x == 1 .or. y > 2.0 ! In this case, we don't want to touch the periods that
! are part of the logical operator ".or.". We also don't
! want to touch the period that is the decimal point
! in "2.0".
if a.member < 4.0 .and. b.othermember == 1.0 ! We don't want to touch the periods
! inside the numbers, but we need to
! change the "a." and "b." to "a%"
! and "b%".
Any good way of tackling this problem?
Edit: I actually found some additional operators that contain a dot in them that don't have symbolic equivalents. I've updated the rule list above.
You can't do this with a regexp, and it's not that easy.
If I had to do what you have to, I would probably do it by hand, unless the codebase is huge. If the former applies, first replace all [a-zA-Z0-9].[a-zA-Z] to something very weird that is guaranteed never to compile, something like "#WHATEVER#", then proceed to search all these entries and replace them by hand after manual control.
If the amount of code is huge, then you need to write a parser. I would suggest you to use python to tokenize basic fortran constructs, but remember that fortran is not an easy language to parse. Work "per routine", and try to find all variable names used, using them as a filter. If you encounter something like a.whatever, and you know that a is in the list of local or global vars, apply the change.
Unless the codebase is really HUUGE (and do think very hard whether this is indeed the case), I'd just take an editor like Vim (vertical select & block select are your friends) a*nd set aside an afternoon to do this by hand*. In one afternoon, my guess is you'll be done with most of it, if not all. Afternoon is a lot of time. Just imagine how many cases you could've covered in these 2 hours alone.
Just by trying to write a parser for something like this, will take you much longer than that.
Of course, the question begs itself ... if the code if F77 which all compilers still support, and the code works ... why are you so keen on changing it?
I'm not that versed in regexps, so I guess I'd try tackling one this from other side. If you grep for the STRUCTURE keyword, you get the list of all the STRUCTURES used in the code. Once you have it, for each STRUCTURE S then you can just replace all instances of S. by S%.
This way you don't have to worry about things like .true., .and., .neq. and their relatives. The main worry then would be to be able to parse the STRUCTURE declarations.
Although the regex below :
(?<!')\b([^.\s]+)(?<!\.(?:not|and|or|eqv|neqv))(?<=\D)\.(?=\D)(?!(?:not|and|or|eqv|neqv)\.)([^.\s]+)\b(?!')
Replace $1%$2
Works perfectly for your examples, I would not recommend using it with your current task. It will definitely not cover all your cases. Now if you care for a 80% coverage or something you could use it but you should probably back up your sources. With the limited set of input cases I had , I am sure that there will be cases that the regex would replace something that it shouldn't.
Good luck :)
This sed oneliner might be a start
sed -r '/^\s*include/b;/^\s*! /b;G;:a;s/^(\.(not|and|or|eqv|neqv)\.)(.*\n.*)/\3\1/;ta;s/^\.([^0-9]{2,})(.*\n.*)/\2%\1/;ta;s/^(.)(.*\n.*)/\2\1/;ta;s/\n//'
Based on your examples, I am guessing it would be enough to protect quoted strings, then replace periods with alphabetics on both sides.
perl -pe '1 while s%(\x27[^\x27]+)\.([^\x27]+\x27)%$1##::##$2%;
s/([a-z])\.([a-z])/$1%$2/g;
s/##::##/./g' file.f
I offer this Perl solution not because sed is not a good enough tool for this, but because it avoids the issue of minor but pesky differences between sed dialects. The ability to use a hex code for the single quotes is a nice bonus.

How to find hard erroneous interface casts in Delphi (Win32)

I am trying to find some mysterious bugs in an application, and believe the cause may be some hard casts on interfaces. Such casts are unsafe in Delphi, for example
ISomeInterface(CurrentObj)
which should be
CurrentObj as ISomeInterface
In light of the lack of compiler warnings which in my opinion should be emitted for hard casts, my question is how to easily find these casts in a codebase? Some sort of regex grep search perhaps? The codebase is large and it would take forever to search manually for it.
You don't say which flavor of regular expressions you're using. I'm going to assume PCRE (Perl-compatible regular expressions), which means these examples won't work with the goofball "regular expressions" option in the IDE's Find dialog. However, they'll work with any self-respecting grep tool, as well as with the built-in regexes in Perl, Ruby, .NET, and many other languages.
You could start with something like this:
\w+\s*\(
which would search for one or more word characters, followed by zero or more spaces, followed by an open parenthesis. This would match:
TObject (Foo)
but depending on your regex library, and which options you use, and how you pass the input into it, might or not match if there's a line break before the open paren:
TObject
(Foo)
and definitely wouldn't work if there's a comment in between, like this pathological case:
X := ISomeInterface // come back and look at this cast, it's dangerous
(CurrentObj);
But for most well-formatted code, it will be good enough.
Now your problem is that it's giving you way more than just the typecasts -- it's also giving you just about every method call in your code. So some refinement is needed.
If your code follows the typical Delphi coding style, then this would probably work much better:
\b[TIE][A-Z]\w+\s*\(
and make sure you do a case-sensitive match. This will match anyplace where you have a word boundary, followed by a capital T (the traditional prefix for most classes and types) or capital I (the prefix for interfaces) or capital E (the prefix for Exception descendants), followed by another capital letter, then some number of upper- or lowercase letters or digits or underscores, followed by optional spaces and an open paren. There's a good chance this is what you need.
However, if you have any type names that don't follow the usual naming conventions (e.g. TfcTreeView that has a lowercase letter after the T), or if you ever rely on Delphi's case-insensitivity (e.g. if there's any chance you might ever have code like tobject(Foo) or even Tobject(Foo)), then it gets harder. If that's the case, post some details and I may be able to make suggestions.
If you know the name of the interface you could use the following
regular expression in the Find in Files Dialog.
ITest\([^)]+\)
Where ITest is the name of your interface

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.

Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.
This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.
I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.
We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.
Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!
10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.
If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.
Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.
You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match
You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...
If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.
I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.
I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.
I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.
You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.
I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.
Try combining them into one big regex?
I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.
The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....