Splitting strings separated by \r\n into array of strings [C/C++] - c++

I have string containing e.g. "FirstWord\r\nSecondWord\r\nThird Word\n\r" and so on...
I want to split it to string array using vector <string> so I would get:
FileName[0] == "FirstWord";
FileName[1] == "SecondWord";
FileName[2] == "Third Word";
Also, note the space in the third string.
This is what I've got so far:
string text = Files; // Files var contains the huge string of lines separated by \r\n
vector<string> FileName; // (optionaly) Here I want to store the result without \r\n
regex rx("[^\\s]+\r\n");
sregex_iterator FormatedFileList(text.begin(), text.end(), rx), rxend;
while(FormatedFileList != rxend)
{
FileName.push_back(FormatedFileList->str().c_str());
++FormatedFileList;
}
It works, but when it comes to the third string which is "Third Word\r\n", it only gives me "Word\r\n".
Can anyone explain to me how do the regular expressions work? I'm a bit confused.

\s matches all spaces, including regular space, tab and a few others. You only want to exclude \r and \n, so your regex should be
regex rx("[^\r\n]+\r\n");
EDIT: This will not fit in a comment, and it will not be exhaustive -- regexes are a fairly complex topic, but I'll do my best to give a cursory explanation. All of this does make more sense if you grok formal languages, so I encourage you to read up on it, and there are countless regex tutorials on the net that go into more detail and that you should also read. Okay.
Your code uses sregex_iterator to walk through all places in the string text where the regular expression rx matches, then turns them into strings and saves them. So, what are regular expressions?
Regular expressions are a way of applying pattern matching to strings. This can range from simple substring searches to...well, to complex substring searches, really. Instead of just looking for an instance of "oba" in the string "foobar", for example, you might search for "oo" followed by any character followed by "a" and find it in "foobar" as well as in "foonarf".
In order to enable this kind of pattern search, you must have a way to specify what pattern you are looking for, and one such way are regular expressions. The details vary across implementations, but in general it works by defining special characters that match special things or modify the behaviour of other parts of the pattern. This sounds confusing, so let's consider a few examples:
The period . matches any single character
Something followed by the Kleene star * matches zero ore more instances of that something
Something followed by a + will match one or more instances of that something
brackets [, ] enclose a set of characters; the whole thing then matches any one of those characters.
The caret ^ inverts the selection of a bracket expression
Still confusing. So let's put it together:
oo.a
is a regular expression using the .. This will match "oo.a", "ooba", "oona", "oo|a" and anything else that is two o's followed by one character followed by an a. It will not match "ooa", "oba" or "nonsense".
a*
will match "", "a", "aa", "aaa", and any other sequence consisting only of a's but nothing else.
[fgh]oobar
will match any of "foobar", "goobar", and "hoobar", nothing else.
[^fgh]oobar
will match "aoobar", "boobar", "coobar" and so forth but not "foobar", "goobar" and "hoobar".
[^fgh]+oobar
will match "aoobar", "aboobar", "abcoobar", but not "oobar", "foobar", "agoobar", and "abhoobar".
In your case,
[^\r\n]+\r\n
will match any instance of one or more characters that are neither \r nor \n followed by \r\n. You then iterate through all those matches and save the matched portions of text.
That is about as deep as I believe I can reasonably go here. This rabbit hole is very deep, which means that you can do freaky cool stuff with regexes but that you should not expect to master them in a day or two. Most of it goes along the lines of what I just outlined, but in true programmer's fashion, most regex implementations go beyond the mathematical scope of regular languages and expressions and introduce useful but mindbendy stuff. Dragons be ahead, but the journey is worth it.

One simple alternative will be to use split_regex from Boost. Eg. split_regex(out, input, boost::regex("(\r\n)+")) where out is a vector of string and input is the input string. A complete example is pasted below:
#include <vector>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
#include <boost/regex.hpp>
using std::endl;
using std::cout;
using std::string;
using std::vector;
using boost::algorithm::split_regex;
int main()
{
vector<string> out;
string input = "aabcdabc\r\n\r\ndhhh\r\ndabcpqrshhsshabc";
split_regex(out, input, boost::regex("(\r\n)+"));
for (auto &x : out) {
std::cout << "Split: " << x << std::endl;
}
return 0;
}

This is also one way to go:
char * pch = strtok((LPSTR)Files.c_str(), "\r\n");
while(pch != NULL)
{
FileName.push_back(pch);
pch = strtok(NULL, "\r\n");
}

regex rx("[^\\s]+\r\n");, seems like you're trying to match the strings instead of splitting it. This [^\\s] negated character class means match any character but not space(horizontal spaces or line breaks). In the third line, there is an horizontal space, so your regex matches the text which was next to the horizontal space. In multiline mode, . would match any character but not of line breaks. You could use regex rx(".+\r\n"); instead of regex rx("[^\\s]+\r\n");

Related

C++: How to extract words from string with regex

I want to extract words from a string. There are two methods I can think of that would accomplish this:
Extraction by a delimiter.
Extraction by word pattern searching.
Before I get into the specifics of my problem, I want to clarify that while I do ask about the methods of extraction and their implementations, the main focus of my problem is the regexes; not the implementations.
The words that I want to match can contain apostrophes (e.g. "Don't"), can be inside double or single quotes (apostrophes) (e.g. "Hello" and 'world') and a combination of the two (e.g. "Didn't" and 'Won't'). They can also contain numbers (e.g. "2017" and "U2") and underscores and hyphens (e.g. "hello_world" and "time-turner"). In-word apostrophes, underscores, and hyphens must be surrounded by other word characters. A final requirement is that strings containing random non-word characters (e.g. "Good mor¨+%g.") should still recognize all word-characters as words.
Example strings to extract words from and what I want the result to look like:
"Hello, world!" should result in "Hello" and "world"
"Aren't you clever?" should result in "Aren't", "you" and "clever"
"'Later', she said." should result in "Later", "she" and "said"
"'Maybe 5 o'clock?'" should result in "Maybe", "5" and "o'clock"
"In the year 2017 ..." should result in "In", "the", "year" and "2017"
"G2g, cya l8r" should result in "G2g", "cya" and "l8r"
"hello_world.h" should result in "hello_world" and "h"
"Hermione's time-turner." should result in "Hermione's" and "time-turner"
"Good mor~+%g." should result in "Good", "mor" and "g"
"Hi' Testing_ Bye-" should result in "Hi", "Testing" and "Bye"
Because – as far as I can tell – the two methods I proposed require quite different solutions I'll divide my question into two parts – one for each method.
1. Extraction by delimiter
This is the method I have dedicated the most of my time to develop, and I have found a partially working solution – however, I suspect the regex I am using is not very efficient. My solution is this (using Boost.Regex because its Perl syntax supports look behinds):
#include <string>
#include <vector>
#include <iostream>
#include <boost/regex.hpp>
std::vector<std::string> phrases({ "Hello, world!", "Aren't you clever?",
"'Later', she said.", "'Maybe 5 o'clock?'",
"In the year 2017 ...", "G2g, cya l8r",
"hello_world.h", "Hermione's time-turner.",
"Good mor~+%g.", "Hi' Testing_ Bye-"});
std::vector<std::string> words;
boost::regex delimiterPattern("^'|[\\W]*(?<=\\W)'+\\W*|(?!\\w+(?<!')'(?!')\\w+)[^\\w']+|'$");
boost::sregex_token_iterator end;
for (std::string phrase : phrases) {
boost::sregex_token_iterator phraseIter(phrase.begin(), phrase.end(), delimiterPattern, -1);
for ( ; phraseIter != end; phraseIter++) {
words.push_back(*phraseIter);
std::cout << words[words.size()-1] << std::endl;
}
}
My largest problem with this solution is my regex, which I think looks too complex and could probably be done much better. It also doesn't correctly match apostrophes at the end of words – like in example 3. Here's a link to regex101.com with the regex and the example strings: Delimiter regex.
2. Extraction by word pattern searching
I haven't dedicated too much time to pursue this path myself and mainly included it as an alternative because my partial solution isn't necessarily the best one. My suggestion as to how to accomplish this would be to do something in the vein of repeatedly searching a string for a pattern, removing each match from the string as you go until there are no more matches. I have a working regex for this method, but would still like input on it: "[A-Za-z0-9]+(['_-]?[A-Za-z0-9]+)?". Here's a link to regex101.com with the regex and the example strings: Word pattern regex.
I want to emphasize again that I first and foremost want input on my regexes, but also appreciate help with implementing the methods.
Edit: Thanks #Galik for pointing out that possesive plurals can end in apostrophes. The apostrophes associated with these may be matched in a delimiter and do not have to be matched in a word pattern (i.e. "The kids' toys" should result in "The", "kids" and "toys").
You may use
[^\W_]+(?:['_-][^\W_]+)*
See the regex demo.
Pattern details:
[^\W_]+ - one or more chars other than non-word chars and _ (matches alphanumeric chars)
(?: - start of a non-capturing group that only groups subpatterns and matches:
['_-] - a ', _ or -
[^\W_]+ - 1+ alphanumeric chars
)* - repeats the group zero or more times.
C++ demo:
std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
std::string s = "Hello, world! Aren't you clever? 'Later', she said. Maybe 5 o'clock?' In the year 2017 ... G2g, cya l8r hello_world.h Hermione's time-turner. Good mor~+%g. Hi' Testing_ Bye- The kids' toys";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m.str() << '\n';
}

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Regular expression to find position of the last alpha character that is followed by a space?

I am using ColdFusion 10. I rarely need to use regular expression and really need some help.
I have some lengthy content (up to 8,000 characters) and want to create a teaser. After a certain length (which I will define elsewhere), I want to find the last alpha character that is followed by a space. I will remove everything after that character. I will then add the ellipsis (...)
MyString = "The lazy brown fox is not a dog."
In this case, I would delete everything after the "a" that precedes "dog".
MyString = "There are 123 boxes on up the hill, says that 612 guy."
In this case, I would delete everything after the "that" that precedes "612 ".
MyString = "I fell down the stairs on June 30th, 1962."
In this case, I would delete everything after the "June" that precedes "30th".
What regular expression would I use to find the position of the last alpha [a-Z] character that is followed by a space?
MyReg = "";
LastPosition = reFindNoCase(MyReg, MyString);
I'm not sure about REFindNoCase, but I think you can try with REReplaceNoCase. I hope that CF can take back references like most regex engines do:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "$1", ALL);
EDIT: for the backreference, it appears that you use the backslash instead of the dollar sign:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "\1", ALL);
And if it goes well, you should have something like this.
.* matches anything besides a newline character, \b matches word boundaries, [a-zA-Z]+ are for alphabet characters and \s is for the space just after it.
The greediness of the first .*'s is being exploited here to capture as much as possible until you get the last word followed by a space.
And I guess you can add the ellpses after the $1 like so:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "\1 (...)", ALL)
If you only want to use REFind(), you could maybe use this:
REFindNoCase("[A-Za-z](?:\s\d+|\w+,)*\s[^\s]+\.$", MyString);
Note that I haven't tested this against other possible scenarios, but I tried a few which don't work with the above but with this one:
REFindNoCase("[A-Za-z](?:\s\d+|\s?\w+[,.-]+)*\s[^\s]+[.\s]*$", MyString);
And those are the few test subjects: link.
REFind will give you the position of the last alpha character. You can add 1 to get the position of the space in the original string.
If you're dealing with long strings, a regex would need to scan the whole string to get to the end, and it's likely more efficient to instead start at the end and work backwards.
Like this:
LastPos = len(String);
while( LastPos > 1 )
{
LastPos = String.lastIndexOf(' ',LastPos-1);
if ( mid(String,LastPos,1).matches('[a-zA-Z]') )
break;
}
NewString = left(String,LastPos);
The idea is to keep stepping backwards finding spaces, and break the loop when the previous character is a letter (or the start of the string is reached).
If you really want a regex solution, just do:
NewString = rematch('.*[a-zA-Z] ',MyString)[1];
To get the position, you do len(NewString).
(If newlines are involved, you'd need to put (?s) at the start of the expression so that the dot matches them.)

Regular expression that finds and replaces a long string of words

I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.
I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}
try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases
This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/
match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.

How do you find all text up to the first character x on a line?

Sorry, this is probably really easy. But if you have a delimiter character on each line and you want to find all of the text before the delimiter on each line, what regular expression would do that? I don't know if the delimiter matters but the delimiter I have is the % character.
Your text will be in group 1.
/^(.*?)%/
Note: This will capture everything up the percent sign. If you want to limit what you capture replace the . with the escape sequence of your choice.
In python, you can use:
def GetStuffBeforeDelimeter(str, delim):
return str[:str.find(delim)]
In Java:
public String getStuffBeforeDelimiter(String str, String delim) {
return str.substring(0, str.indexOf(delim));
}
In C++ (untested):
using namespace std;
string GetStuffBeforeDelimiter(const string& str, const string& delim) {
return str.substr(0, str.find(delim));
}
In all the above examples you will want to handle corner cases, such as your string not containing the delimeter.
Basically I would use substringing for something this simple becaues you can avoid scanning the entire string. Regex is overkill, and "exploding" or splitting on the delimeter is also unnecessary because it looks at the whole string.
You don't say what flavor of regex, so I'll use Perl notation.
/^[^%]*/m
The first ^ is a start anchor: normally it matches only the beginning of the whole string, but this regex is in multiline mode thanks the 'm' modifier at the end. [^%] is an inverted character class: it matches any one character except a '%'. The * is a quantifier that means to match the previous thing ([^%] in this case) zero or more times.
you don't have to use regex if you don't want to. depending on the language you are using, there will be some sort of string function such as split().
$str = "sometext%some_other_text";
$s = explode("%",$str,2);
print $s[0];
this is in PHP, it split on % and then get the first element of the returned array. similarly done in other language with splitting methods as well.