Regex finding all commas between two words - regex

I trying to clean up a large .csv file that contains many comma separated words that I need to consolidate parts of. So I have a subsection where I want to change all the commas to slashes. Lets say my file contains this text:
Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool
I want to select all commas between the unique words bar and blah. The idea is to then replace the commas with slashes (using find and replace), such that I get this result:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
As per #EganWolf input:
How do I include words in the search but exclude them from the selection (for the unique words) and how do I then match only the commas between the words?
Thus far I have only managed to select all the text between the unique words including them:
bar,.*,blah, bar:*, *,blah, (bar:.+?,blah)*,*\2
I experimented with negative look ahead but cant get any search results from my statements.

Using Notepad++, you can do:
Ctrl+H
Find what: (?:\bbar,|\G(?!^))\K([^,]*),(?=.+\bblah\b)
Replace with: $1/
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?: # start non capture group
\bbar, # word boundary then bar then a comma
| # OR
\G # restart from last match position
(?!^) # negative lookahead, make sure not followed by beginning of line
) # end group
\K # forget all we've seen until this position
([^,]*) # group 1, 0 or more non comma
, # a comma
(?= # positive lookahead
.+ # 1 or more any character but newlie
\bblah\b # word boundary, blah, word boundary
) # end lookahead
Result for given example:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
Screen capture:

The following regex will capture the minimally required text to access the commas you want:
(?<=bar,)(.*?(,))*(?=.*?,blah)
See Regex Demo.
If you want to replace the commas, you will need to replace everything in capture group 2. Capture group 0 has your entire match.
An alternative approach would be to split your string by comma to create an array of words. Then join words between bar and blah using / and append the other words joined by ,.
Here is a PowerShell example of split and join:
$a = "Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool"
$split = $a -split ","
$slashBegin = $split.indexof("bar")+1
$commaEnd = $split.indexof("blah")-1
$str1 = $split[0..($slashbegin-1)] -join ","
$str2 = $split[($slashbegin)..$commaend] -join "/"
$str3 = $split[($commaend+1)..$split.count] -join ","
#($str1,$str2,$str3) -join ","
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
This could easily be made into a function with your entire line and keywords as inputs.

Related

Notepad++ and regex - how to title case string between two particular strings?

I have hundreds of bib references in a file, and they have the following syntax:
#article{tabata1999precise,
title={Precise synthesis of monosubstituted polyacetylenes using Rh complex catalysts.
Control of solid structure and $\pi$-conjugation length},
author={Tabata, Masayoshi and Sone, Takeyuchi and Sadahiro, Yoshikazu},
journal={Macromolecular chemistry and physics},
volume={200},
number={2},
pages={265--282},
year={1999},
publisher={Wiley Online Library}
}
I would like to title case (aka Proper Case) the journal name in Notepad++ using regular expression. For example, from Macromolecular chemistry and physics to Macromolecular Chemistry and Physics.
I am able to find all instances using:
(?<=journal\=\{).*?(?=\})
but I am unable to change the case via Edit > Convert Case to. Apparently it doesn't work on find all and I have to go one by one.
Next, I tried recording and running a macro but Notepad++ just hangs indefinitely when I try to run it (option to run until the end of the file).
So my question is: does anyone know the replace regex syntax I could use to change the case? Ideally, I would also like to use "|" exclusions for particular words such as " of ", " an ", " the ", etc. I tried to play with some of the examples provided here, but I was not able to integrate it into my look-aheads.
Thank you in advance, I'd appreciate any help.
This works for any number of words:
Ctrl+H
Find what: (?:journal={|\G)\K(?:(\w{4,})|(\w+))(\h*)
Replace with: \u$1\E$2$3
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
(?: # non capture group
journal={ # literally
| # OR
\G # restart from last match position
) # end group
\K # forget all we have seen until this position
(?: # non capture group
(\w{4,}) # group 1, a word with 4 or more characters
| # OR
(\w+) # group 2, a word of any length
) # end group
(\h*) # group 3, 0 or more horizontal spaces
Replacement:
\u # uppercased the first letter of the following
$1 # content of group 1
\E # stop the uppercased
$2 # content of group 2
$3 # content of group 3
Screenshot (before):
Screenshot (after):
if the format is always in the form:
journal={Macromolecular chemistry and physics},
i.e. journal followed by 3 words then use the following:
Find: journal={(\w+)\s*(\w+)\s*(\w+)\s*(\w+)
Replace with: journal={\u\1 \u\2 \l\3 \u\4
You can modify that if you have more words to replace by adding more \u\x, where x is the position of the word.
Hope it helps to give you an idea to move forward for a better solution.
\u translates the next letter to uppercase (used for all other words)
\l translates the next letter to lowercase (used for the word "and")
\1 replaces the 1st captured () search group
\2 replaces the 2nd captured () search group
\3 replaces the 3rd captured () search group

Perl Regular expression | how to exclude words from a file

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project.
First i want to exclude strings from a txt file (dictionary).
For example if my file have this strings:
path.../Document.txt |
tree
car
ship
i using Regular Expression
a1testtre -- match
orangesh1 -- match
apleship3 -- not match [contains word from file ]
Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).
For example :
adminnisstrator21 -- match (have 2 times a repetition of chars)
kkeeykloakk -- not match have over 3 times repetition
stack22ooverflow -- match (have 2 times a repetition of chars)
for this i have try
\b(?:([a-z])(?!\1))+\b
but it works only for the first char-reppeat
Any idea how to solve these two?
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
^(?!.*(?:tree|car|ship)).*$
^ Assert start of string
(?! negative lookahead, assert what is on the right is not
.*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
) Close negative lookahead
.* Match any char except a newline
$ Assert end of string
Regex demo
To not allow a string to have over 3 times a char repeat you could use:
\b(?!(?:\w*(\w)\1){3})\w+\b
\b Word boundary
(?! Negative lookahead, assert what is on the right is not
(?: NOn capturing group
\w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
){3} Close non capturing group and repeat 3 times
) close negative lookahead
\w+ Match 1+ word characters
\b word boundary
Regex demo
Update
According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
^
Pattern demo
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my #words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } #words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my #chars_that_repeat = $string =~ /(.)\1+/g;
if (#chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
†  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } #words;
#==> so|sole|solely
for a quicker match (so matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }
I hope someone else will come with a better solution, but this seems to do what you want:
\b Match word boundary
(?: Start capture group
(?:([a-z0-9])(?!\1))* Match all characters until it encounters a double
(?:([a-z0-9])\2)+ Match all repeated characters until a different one is reached
){0,2} Match capture group 0 or 2 times
(?:([a-z0-9])(?!\3))+ Match all characters until it encounters a double
\b Match end of word
I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.
My problem is that i have 2 regex that working:
Not allow over 3 pairs of chars:
(?=^(?!(?:\w*(.)\1){3}).+$)
Not allow over 4 times a char to repeat:
(?=^(?:(.)(?!(?:.*?\1){4}))*$)
Now i want to combine them into one row like:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
but its working only the regex that is first and not both of them
As mentioned in comment to #zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:
# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST
# TRUSTEES|TRUSTEE|TRUST
sub tight_match_order {
return #_ unless #_ > 1;
my (#alts, #ordered_alts, %alts_seen);
#alts = map { $alts_seen{$_}++ ? () : $_ } #_;
TEST: {
my $alt = shift #alts;
if (grep m#$alt#, #alts) {
push #alts => $alt;
} else {
push #ordered_alts => $alt;
}
redo TEST if #alts;
}
#ordered_alts
}
So following #zdim's answer:
...
my #words = split ' ', path($file)->slurp;
#words = tight_match_order(#words); # add this line
my $exclude = join '|', map { quotemeta } #words;
...
HTH

Regex to find strings not containing a specified value

I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).
How would I go about doing this?
Example
Every String starts with ABC
ABC is never used in a string other than at the beginning,
ABCABC123 would be two strings --"ABC" and "ABC123"
HIJ may appear multiple times in a string
I need to find the strings that do not contain HIJ
Input is one long file with no line breaks, but does contain special characters (*, ^, #, ~, :) and spaces
Example Input:
ABC1234HIJ56ABC7#HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J
Example Input would be viewed as the following strings
ABC1234HIJ56
ABC7#HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J
The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.
Example desired output:
ABC89
ABC12~34HI456J
I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.
Any help would be appreciated!
This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.
You can find the items you want with:
ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)
Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)
pattern details:
ABC
(?: # non-capturing group
[^HA]+ # all that is not a H or an A
| # OR
H(?!IJ) # an H not followed by IJ
|
A(?!BC) # an A not followed by BC
)*+ # repeat the group
(?=ABC|$) # followed by "ABC" or the end of the string
Note: if you want to remove all that is not the items you want you can make this search replace:
search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n
you could use this pattern
(ABC(?:(?!HIJ).)*?)(?=ABC|\R)
Demo
( # Capturing Group (1)
ABC # "ABC"
(?: # Non Capturing Group
(?! # Negative Look-Ahead
HIJ # "HIJ"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
*? # (zero or more)(lazy)
) # End of Capturing Group (1)
(?= # Look-Ahead
ABC # "ABC"
| # OR
\R # <line break>
) # End of Look-Ahead
You can use the following expression to match your criterion:
(^ABC(?:(?!HIJ).)*$)
This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.
For a single line pattern (as provided in your question), a slight modification of this works (as follows):
(ABC(?:(?!HIJ).)*?)(?=ABC|$)

regular expressions: find every word that appears exactly one time in my document

Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?
If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse
It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.
(\w+){1} will match each word.
After that you could always perfrom the count on the matches....
Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}
Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}

regex to match RTRIM(LTRIM(xx)) = xx

I am trying to jot down regex to find where I am using ltrim rtrim in where clause in stored procedures.
the regex should match stuff like:
RTRIM(LTRIM(PGM_TYPE_CD))= 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))='P'))
RTRIM(LTRIM(PGM_TYPE_CD)) = 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))= P
RTRIM(LTRIM(PGM_TYPE_CD))= somethingelse))
etc...
I am trying something like...
.TRIM.*\)\s+
[RL]TRIM\s*\( Will look for R or L followed by TRIM, any number of whitespace, and then a (
This what you want:
[LR]TRIM\([RL]TRIM\([^)]+\)\)\s*=\s*[^)]+\)*
?
What's that doing is saying:
[LR] # Match single char, either "L" or "R"
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[RL] # Match single char, either "R" or "L" (same as [LR], but easier to see intent)
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)\) # Match two closing parentheses
\s* # Zero or more whitespace characters
= # Match "="
\s* # Again, optional whitespace (not req unless next bit is captured)
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)* # Match zero or more closing parentheses.
If this is automated and you want to know which variables are in it, you can wrap parentheses around the relevant parts:
[LR]TRIM\([RL]TRIM\(([^)]+)\)\)\s*=\s*([^)]+)\)*
Which will give you the first and second variables in groups 1 and 2 (either \1 and \2 or $1 and $2 depending on regex used).
How about something like this:
.*[RL]TRIM\s*\(\s*[RL]TRIM\s*\([^\)]*)\)\s*\)\s*=\s*(.*)
This will capture the inside of the trim and the right side of the = in groups 1 and 2, and should handle all whitespace in all relevant areas.