Regular expression that finds and replaces a long string of words - regex

I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.

I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}

try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases

This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/

match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.

Related

Splitting strings separated by \r\n into array of strings [C/C++]

I have string containing e.g. "FirstWord\r\nSecondWord\r\nThird Word\n\r" and so on...
I want to split it to string array using vector <string> so I would get:
FileName[0] == "FirstWord";
FileName[1] == "SecondWord";
FileName[2] == "Third Word";
Also, note the space in the third string.
This is what I've got so far:
string text = Files; // Files var contains the huge string of lines separated by \r\n
vector<string> FileName; // (optionaly) Here I want to store the result without \r\n
regex rx("[^\\s]+\r\n");
sregex_iterator FormatedFileList(text.begin(), text.end(), rx), rxend;
while(FormatedFileList != rxend)
{
FileName.push_back(FormatedFileList->str().c_str());
++FormatedFileList;
}
It works, but when it comes to the third string which is "Third Word\r\n", it only gives me "Word\r\n".
Can anyone explain to me how do the regular expressions work? I'm a bit confused.
\s matches all spaces, including regular space, tab and a few others. You only want to exclude \r and \n, so your regex should be
regex rx("[^\r\n]+\r\n");
EDIT: This will not fit in a comment, and it will not be exhaustive -- regexes are a fairly complex topic, but I'll do my best to give a cursory explanation. All of this does make more sense if you grok formal languages, so I encourage you to read up on it, and there are countless regex tutorials on the net that go into more detail and that you should also read. Okay.
Your code uses sregex_iterator to walk through all places in the string text where the regular expression rx matches, then turns them into strings and saves them. So, what are regular expressions?
Regular expressions are a way of applying pattern matching to strings. This can range from simple substring searches to...well, to complex substring searches, really. Instead of just looking for an instance of "oba" in the string "foobar", for example, you might search for "oo" followed by any character followed by "a" and find it in "foobar" as well as in "foonarf".
In order to enable this kind of pattern search, you must have a way to specify what pattern you are looking for, and one such way are regular expressions. The details vary across implementations, but in general it works by defining special characters that match special things or modify the behaviour of other parts of the pattern. This sounds confusing, so let's consider a few examples:
The period . matches any single character
Something followed by the Kleene star * matches zero ore more instances of that something
Something followed by a + will match one or more instances of that something
brackets [, ] enclose a set of characters; the whole thing then matches any one of those characters.
The caret ^ inverts the selection of a bracket expression
Still confusing. So let's put it together:
oo.a
is a regular expression using the .. This will match "oo.a", "ooba", "oona", "oo|a" and anything else that is two o's followed by one character followed by an a. It will not match "ooa", "oba" or "nonsense".
a*
will match "", "a", "aa", "aaa", and any other sequence consisting only of a's but nothing else.
[fgh]oobar
will match any of "foobar", "goobar", and "hoobar", nothing else.
[^fgh]oobar
will match "aoobar", "boobar", "coobar" and so forth but not "foobar", "goobar" and "hoobar".
[^fgh]+oobar
will match "aoobar", "aboobar", "abcoobar", but not "oobar", "foobar", "agoobar", and "abhoobar".
In your case,
[^\r\n]+\r\n
will match any instance of one or more characters that are neither \r nor \n followed by \r\n. You then iterate through all those matches and save the matched portions of text.
That is about as deep as I believe I can reasonably go here. This rabbit hole is very deep, which means that you can do freaky cool stuff with regexes but that you should not expect to master them in a day or two. Most of it goes along the lines of what I just outlined, but in true programmer's fashion, most regex implementations go beyond the mathematical scope of regular languages and expressions and introduce useful but mindbendy stuff. Dragons be ahead, but the journey is worth it.
One simple alternative will be to use split_regex from Boost. Eg. split_regex(out, input, boost::regex("(\r\n)+")) where out is a vector of string and input is the input string. A complete example is pasted below:
#include <vector>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
#include <boost/regex.hpp>
using std::endl;
using std::cout;
using std::string;
using std::vector;
using boost::algorithm::split_regex;
int main()
{
vector<string> out;
string input = "aabcdabc\r\n\r\ndhhh\r\ndabcpqrshhsshabc";
split_regex(out, input, boost::regex("(\r\n)+"));
for (auto &x : out) {
std::cout << "Split: " << x << std::endl;
}
return 0;
}
This is also one way to go:
char * pch = strtok((LPSTR)Files.c_str(), "\r\n");
while(pch != NULL)
{
FileName.push_back(pch);
pch = strtok(NULL, "\r\n");
}
regex rx("[^\\s]+\r\n");, seems like you're trying to match the strings instead of splitting it. This [^\\s] negated character class means match any character but not space(horizontal spaces or line breaks). In the third line, there is an horizontal space, so your regex matches the text which was next to the horizontal space. In multiline mode, . would match any character but not of line breaks. You could use regex rx(".+\r\n"); instead of regex rx("[^\\s]+\r\n");

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Perl regex | Match second from the right

I'm trying to parse an OID and extract the #18 but I am unsure on how to write it to count Right to Left using a dot as a delimiter:
1.3.6.1.2.1.31.1.1.1.18.10035
This regex will grab the last value
my $ifindex = ($_=~ /^.*[.]([^.]*)$/);
I haven't found a way to tweak it to get the value I need yet.
How about:
my $str = "1.3.6.1.2.1.31.1.1.1.18.10035";
say ((split(/\./, $str))[-2]);
output:
18
If the format is always the same (ie. always second from right) then you can either use:-
m/(\d+)\.\d+$/;
..and the answer will end up in: $1
Or a different approach would be to split the string into an array on the dots and examine the penultimate value in the array.
What you need is simpler:
my $ifindex;
if (/(\d+)\.\d+$/)
{
$ifindex = $1;
}
A couple of comments:
You don't need to match the entire string, only the part you care about. Thus, no need to anchor to the beginning with ^ and use .*. Anchor to the end only.
[.] is a character class, intended for matching groups of characters. e.g., [abc] will match either a, b, or c. It should be avoided when matching a single character; just match that character instead. In this case you do need to escape it, since it is a special character: \..
I have assumed based on your example that all of the terms have to be numbers. Hence, I used \d+ for the terms.
my $ifindex = ($_=~ /^.*[.]([^.]*)[.][^.]*$/);

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"