Simple regex question (vim search/replace) - regex

How do I specify that there are several options for a string in a search?
For example, I want to find any combination that start with either jspPar, btn or jspAtt that ends with the letter K.
Also - I need to replace it with a string depending on the original prefix.
for example, if the prefix was jspPar I need to replace it with the letter P. (and, let's say, B and A for btn and jspAtt accordingaly).

Is
\(jsPar\|btn\|jspAtt\)[^ \t]*K
what you are looking for?
The \(jsPar\|btn\|jspAtt\) says “at this point, match any of these alternatives”, then [^ \t]* says “at this point, match any amount (incl. zero) of space or tab characters”, and K of course means “at this point match a K”.
For your added question could do something like this:
%s/\(jsPar\|btn\|jspAtt\)[^ \t]*\zsK/\=submatch(1) == 'jsPar' ? 'P' : submatch(1) == 'btn' ? 'B' : 'A' /g
(The \zs says “consider the match to have started at this point” so only the “K” will be replaced.)
But I would only do that if I had to do the substitution in a single pass. Otherwise I’d just run three s///s:
%s/jspAtt[^ \t]*\zsK/A/g
%s/jsPar[^ \t]*\zsK/P/g
%s/btn[^ \t]*\zsK/B/g
Given command history, that’s much less typing, and is also very unlikely to require debugging, whereas that’s always a potentiality when specifying any computation.

Related

How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Regular expression for non-consecutive characters

I'm trying to create a regular expression that validates the following requirements:
Simultaneous use of Cyrillic and numbers is possible (without spaces and special characters)
Simultaneous use of Latin and numbers is possible (without spaces and special characters)
Simultaneous use of Cyrillic and Latin characters is not possible
The first letter must be capitalized, cannot be a number
Sequence length - from 2 to 16 digits inclusive
It is impossible to use 3 or more identical symbols in a row
I am using the following solution:
(?:([A-Z][A-Za-z0-9]{1,15}|[А-Я][А-ЯЁа-яё0-9]{1,15}))$
How do I change the regex to match the last requirement?
I use Google Sheets, in which it is impossible to use negative lookahead.
Sorry for my English.
I don't you can do this with a single regex without lookbehinds.
But there are workarounds for the "don't repeat same character 3 times" functionality.
The workarounds could be simpler if RE2 supported backreferences, but it does not. So the resulting rule will be longer.
You may define a column ValidNoThreeRepeats like this:
=
NOT(
OR(
AND(MID(A1;1 ;1)=MID(A1;2 ;1);MID(A1;2 ;1)=MID(A1;3 ;1));
AND(MID(A1;2 ;1)=MID(A1;3 ;1);MID(A1;3 ;1)=MID(A1;4 ;1));
AND(MID(A1;3 ;1)=MID(A1;4 ;1);MID(A1;4 ;1)=MID(A1;5 ;1));
AND(MID(A1;4 ;1)=MID(A1;5 ;1);MID(A1;5 ;1)=MID(A1;6 ;1));
AND(MID(A1;5 ;1)=MID(A1;6 ;1);MID(A1;6 ;1)=MID(A1;7 ;1));
AND(MID(A1;6 ;1)=MID(A1;7 ;1);MID(A1;7 ;1)=MID(A1;8 ;1));
AND(MID(A1;7 ;1)=MID(A1;8 ;1);MID(A1;8 ;1)=MID(A1;9 ;1));
AND(MID(A1;8 ;1)=MID(A1;9 ;1);MID(A1;9 ;1)=MID(A1;10;1));
AND(MID(A1;9 ;1)=MID(A1;10;1);MID(A1;10;1)=MID(A1;11;1));
AND(MID(A1;10;1)=MID(A1;11;1);MID(A1;11;1)=MID(A1;12;1));
AND(MID(A1;11;1)=MID(A1;12;1);MID(A1;12;1)=MID(A1;13;1));
AND(MID(A1;12;1)=MID(A1;13;1);MID(A1;13;1)=MID(A1;14;1));
AND(MID(A1;13;1)=MID(A1;14;1);MID(A1;14;1)=MID(A1;15;1))
)
)
Or in a compacted way like this:
=NOT(OR(AND(MID(A1;1 ;1)=MID(A1;2 ;1);MID(A1;2 ;1)=MID(A1;3 ;1));AND(MID(A1;2 ;1)=MID(A1;3 ;1);MID(A1;3 ;1)=MID(A1;4 ;1));AND(MID(A1;3 ;1)=MID(A1;4 ;1);MID(A1;4 ;1)=MID(A1;5 ;1));AND(MID(A1;4 ;1)=MID(A1;5 ;1);MID(A1;5 ;1)=MID(A1;6 ;1));AND(MID(A1;5 ;1)=MID(A1;6 ;1);MID(A1;6 ;1)=MID(A1;7 ;1));AND(MID(A1;6 ;1)=MID(A1;7 ;1);MID(A1;7 ;1)=MID(A1;8 ;1));AND(MID(A1;7 ;1)=MID(A1;8 ;1);MID(A1;8 ;1)=MID(A1;9 ;1));AND(MID(A1;8 ;1)=MID(A1;9 ;1);MID(A1;9 ;1)=MID(A1;10;1));AND(MID(A1;9 ;1)=MID(A1;10;1);MID(A1;10;1)=MID(A1;11;1));AND(MID(A1;10;1)=MID(A1;11;1);MID(A1;11;1)=MID(A1;12;1));AND(MID(A1;11;1)=MID(A1;12;1);MID(A1;12;1)=MID(A1;13;1));AND(MID(A1;12;1)=MID(A1;13;1);MID(A1;13;1)=MID(A1;14;1));AND(MID(A1;13;1)=MID(A1;14;1);MID(A1;14;1)=MID(A1;15;1))))
The idea is to have a rule that compares 1st, 2nd and 3rd character, then another rule that compares 2nd, 3rd, 4th, then another rule for 3rd, 4th, 5th, and so on and so forth. You join this rules with an OR, since if any of those match, it means that at some place some repetition exists. Finally, you negate the whole expresion with a NOT
Than you can check that both your regex and that column are valid.
Donno with which script language you're using
If's in PHP code form,I'd be using `Filter_var($param1, FILTER_VALIDATE..., FILTER_FLAG..)` if i were in your shoes .
It makes your way into both **validating** n **sanitizing** your snippet.
**PEACE**.

Possible to limit to scope/range of a lookahead

We can check to see if a digit is in a password, for example, by doing something like:
(?=.*\d)
Or if there's a digit and lowercase with:
(?=.*\d)(?=.*[a-z])
This will basically go on "until the end" to check whether there's a letter in the string.
However, I was wondering if it's possible in some sort of generic way to limit the scope of a lookahead. Here's a basic example which I'm hoping will demonstrate the point:
start_of_string;
middle_of_string;
end_of_string;
I want to use a single regular expression to match against start_of_string + middle_of_string + end_of_string.
Is it possible to use a lookahead/lookbehind in the middle_of_string section WITHOUT KNOWING WHAT COMES BEFORE OR AFTER IT? That is, not knowing the size or contents of the preceding/succeeding string component. And limit the scope of the lookahead to only what is contained in that portion of the string?
Let's take one example:
start_of_string = 'start'
middle_of_string = '123'
end_of_string = 'ABC'
Would it be possible to check the contents of each part but limit it's scope like this?
string = 'start123ABC'
# Check to make sure the first part has a letter, the second part has a number and the third part has a capital
((?=.*[a-z]).*) # limit scope to the first part only!!
((?=.*[0-9]).*) # limit scope to only the second part.
((?=.*[A-Z]).*) # limit scope to only the last part.
In other words, can lookaheads/lookbehinds be "chained" with other components of a regex without it screwing up the entire regex?
UPDATE:
Here would be an example, hopefully this is more helpful to the question:
START_OF_STRING = 'abc'
Does 'x' exist in it? (?=.*x) ==> False
END_OF_STRING = 'cdxoy'
Does 'y' exist in it? (?=.*y) ==> True
FULL_STRING = START_OF_STRING + END_OF_STRING
'abcdxoy'
Is it possible to chain the two regexes together in any sort of way to only wok on its 'substring' component?
For example, now (?=.*x) in the first part of the string would return True, but it should not.
`((?=.*x)(?=.*y)).*`
I think the short answer to this is "No, it's not possible.", but am looking to hear from someone who understands this to tell why it is or isn't.
In .NET and javascript you could use a positive lookahead at the start of your string component and a negative lookbehind at the end of it to "constrain" the match. Example:
.*(?=.*arrow)(?<middle>.*)(?<=.*arrow).*
helloarrowxyz
{'middle': 'arrow'}
If in pcre, python, or other you would need to either have a fixed width lookahead to constraint it from going too far forward, such as what Wiktor Stribiżew says above:
.*(?=.{0,5}arrow)(?<middle>.{0,5}).*
Otherwise, it wouldn't be possible to do without either a fixed-width lookahead or a variable width look-behind.

Grep for Pattern in File in R

In a document, I'm trying to look for occurences of a 12-digit string which contains alpha and numerals. A sample string is: "PXB111X2206"
I'm trying to get the line numbers that contain this string in R using the below:
FileInput = readLines("File.txt")
prot_pattern="([A-Z0-9]{12})";
prot_string<-grep(prot_pattern,FileInput)
prot_string
This worked fine until it hit a document containing all upper-case titles and returned a line containing the word "CONCENTRATIO"
The string I am trying to look for is: "PXB111X2206". I am expecting the grep to return the line numbers containing the string : "PXB111X2206". It however is returning the line number containing the word: "CONCENTRATIO"
What is wrong with my expression above? Any idea what I am doing wrong here?
Here is some sample input:
Each design objective described herein is significantly important, yet it is just one aspect of what it takes to achieve a successful project.
A successful project is one where project goals are identified early on and where the >interdependencies of all building systems are coordinated concurrently from the planning and programming phase.
CONCENTRATION:
The areas of concentration for design objectives: accessible, aesthetics, cost effective, >functional/operational, historic preservation, productive, secure/safe, and sustainable and >their interrelationships must be understood, evaluated, and appropriately applied.
Each of these design objectives is presented in the design objectives document number. >PXB111X2206.
>
Thanks & Regards,
Simak
You are using a very powerful tool for a very simple task, the expression
[A-Z0-9]{12}
will match any alphanumeric 12 sized uppercased string, for example the word "CONCENTRATIO", however, your "PXB111X2206" is not even 12 symbols long, so it is not possible that is being matched. If you only want to match "PXB111X2206" you only have to use it as a regular expression itself, for example, if you file contents are:
foo
CONCENTRATIO.
bazz
foo bar bazz PXB111X2206 foo bar bazz
foo
bar
bazz
and you use:
grep('PXB111X2206',readLines("File.txt"))
then R will only match line 4 as you would wish.
EDIT
If you are looking for that specific pattern try:
grep('[A-Z]{3}[0-9]{3}[A-Z]{1}[0-9]{4}',readLines("File.txt"))
That expression will match strings like 'AAADDDADDDD' where A is an capital letter, and D a digit, the regular expression contains a group (symbols inside square brackets) and a quantifier (the number inside the brackets) that tells how many of the previous symbol will the expression accept, if no quantifier is present it assumes it is 1.
Let's take a look at what your regular expression means. [A-Z0-9] means any capitalized letter or number and {12} means the previous expression must occur exactly 12 times. The string CONCENTRATIO is 12 capitaized letters, so it's no surprise that grep picks it up. If you want to take out the matches that match to just letters or just numbers you could try something like
allleters <- grep("[A-Z]{12}",strings)
allnumbers <-grep("[0-9]{12}",strings)
both <- grep("[A-Z0-9]{12}",strings)
the matches you wanted would then be something like
both <- both[!both %in% union(allletters,allnumbers)]
Someone with better regexfu might have a more elegant solution, but this will work too.

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"