RegEx select all between two character - regex

Example:
I want to extract everything between "Item:" until " * "
Item: *Sofa (1 SET), 2 × Mattress, 3 × Baby Mattress, 5
Seaters Car (Fabric)*
Total price: 100.00
Subtotal: 989.00
But I only managed to extract "Item: *" and " Seaters Car (Fabric)* " by using (.*?)\*

After matching Item:, match anything but a colon with [^:]+, and then lookahead for a newline, ensuring that the match ends at the end of a line just before another label (like Total price:) starts:
Item: ([^:]+)(?=\n)

Related

C++11 Regex search - Exclude empty submatches

From the following text I want to extract the number and the unit of measurement.
I have 2 possible cases:
This is some text 14.56 kg and some other text
or
This is some text kg 14.56 and some other text
I used | to match the both cases.
My problem is that it produces empty submatches, and thus giving me an incorrect number of matches.
This is my code:
std::smatch m;
std::string myString = "This is some text kg 14.56 and some other text";
const std::regex myRegex(
R"(([\d]{0,4}[\.,]*[\d]{1,6})\s+(kilograms?|kg|kilos?)|s+(kilograms?|kg|kilos?)(\s+[\d]{0,4}[\.,]*[\d]{1,6}))",
std::regex_constants::icase
);
if( std::regex_search(myString, m, myRegex) ){
std::cout << "Size: " << m.size() << endl;
for(int i=0; i<m.size(); i++)
std::cout << m[i].str() << std::endl;
}
else
std::cout << "Not found!\n";
OUTPUT:
Size: 5
kg 14.56
kg
14.56
I want an easy way to extract those 2 values, so my guess is that I want the following output:
WANTED OUTPUT:
Size: 3
kg 14.56
kg
14.56
This way I can always directly extract 2nd and 3th, but in this case I would also need to check which one is the number. I know how to do it with 2 separate searches, but I want to do it the right way, with a single search without using c++ to check if a submatch is an empty string.
Using this regex, you just need the contents of Group 1 and Group 2
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))\s*((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
Click for Demo
Explanation:
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
(?:kilograms?|kilos?|kg) - matches kilograms or kilogram or kilos or kilo or kg
| - OR
(?:\d{0,4}(?:\.\d{1,6})) - matches 0 to 4 digits followed by 1 to 6 digits of decimal part
\s* - matches 0+ whitespaces
You can try this out:
((?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg)))|(?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)))
As shown here: https://regex101.com/r/9O99Fz/3
USAGE -
As I've shown in the 'substitution' section, to reference the numeral part of the quantity, you have to write $2$5, and for the unit, write: $3$4
Explanation -
There are two capturing groups we could possibly need: the first one here (?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg))) is to match the number followed by the unit,
and the other (?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)) to match the unit followed by the number

R Regex number followed by punctuation followed by space

Suppose I had a string like so:
x <- "i2: 32390. 2093.32: "
How would I return a vector that would give me the positions of where a number is followed by a : or a . followed by a space?
So for this string it would be
"2: ","0. ","2: "
The regex you need is just '\\d[\\.:]\\s'. Using stringr's str_extract_all to quickly extract matches:
library(stringr)
str_extract_all("i2: 32390. 2093.32: ", '\\d[\\.:]\\s')
produces
[[1]]
[1] "2: " "0. " "2: "
You can use it with R's built-in functions, and it should work fine, as well.
What it matches:
\\d matches a digit, i.e. number
[ ... ] sets up a range of characters to match
\\. matches a period
: matches a colon
\\s matches a space.

Clean character vector and strsplit into dataframe

I have a character verctor I want to transform into a data frame.
It's mostly clean but I can't figure out how to finish the cleaning. Notice that the real data are a Date column as yyyy-mm-dd and a Variable column as a number (in this case four digits but not always) separated by a comma.
class(myvec)
[1] "character"
myvec
[1] " \"2016-01-01,8631n\" " " \"2016-01-02,8577n\" "
[3] " \"2016-01-03,8476n\" " " \"2016-01-04,8365n\" "
[5] " \"2016-01-05,8331n\" " " \"2016-01-06,8801n\" "
[7] " \"2016-01-07,5020n\""
The space and backslash" (' \"') should be removed. The same with the n\"
The expected output should be a data frame like this
Date Variable
[1,] "2016-01-01" "8631"
[2,] "2016-01-02" "8577"
[3,] "2016-01-03" "8476"
[4,] "2016-01-04" "8365"
[5,] "2016-01-05" "8331"
[6,] "2016-01-06" "8801"
[7,] "2016-01-07" "5020"
Once the vector is clan, I think this does the job
do.call(rbind,strsplit(clean_vector,","))
I think I can convert to date with lubridate and the var to numeric with as.numeric on my own, the question is about getting the character vector clean and in the correct format.
You can remove the offending characters by enumerating them:
# example
x = " \"2016-01-01,8631n\" "
gsub("[n \"]","",x)
# "2016-01-01,8631"
This works because [xyz] identifies any single character from the list xyz.
Or you can take a substring, since the formatting is fixed-width, with bad chars at the start and end:
substr(x,3,17)
# "2016-01-01,8631"
If the var part of the string varies in length, nchar(x)-3 should work in place of 17.

Regex on numbers and spaces

I'm trying to match numbers surrounded by spaces, like this string:
" 1 2 3 "
I'm puzzled why the regex \s[0-9]\s matches 1 and 3 but not 2. Why does this happen?
Because the space has already been consumed:
\s[0-9]\s
This matches "spacedigitspace" so lets go through the process
" 1 2 3 "
^
|
No match
" 1 2 3 "
^
|
No match
" 1 2 3 "
^
|
Matches, consume " 1 "
"2 3 "
^
|
No match
"2 3 "
^
|
No match
"2 3 "
^
|
No match
"2 3 "
^
|
Matches, consume " 3 "
You want a lookaround:
(?<=\s)\d(?=\s)
This is very different, as it look for \d and then asserts that it is preceded by, and followed by, a space. This assertion is "zero width" which means that the spaces aren't consumed by the engine.
More precisely, the regex \s[0-9]\s does not match 2 only when you go through all matches in the string " 1 2 3 " one by one. If you were to try to start matching at positions 1 or 2, " 2 " would be matched.
The reason for this is that \s is capturing part of the input - namely, the spaces around the digit. When you match " 1 ", the space between 1 and 2 is already taken; the regex engine is looking at the tail of the string, which is "2 3 ". At this point, there is no space in front of 2 that the engine could capture, so it goes straight to finding " 3 "
To fix this, put spaces into zero-length look-arounds, like this:
(?<=\s)[0-9](?=\s)
Now the engine ensures that there are spaces in front and behind the digit without consuming these spaces as part of the match. This lets the engine treat the space between 1 and 2 as a space behind 1 and also as a space in front of 2, thus returning both matches.
The input is captured, and the subsequent matches won't match, you can use a lookahead to fix this
\s+\d+(?=\s+)
The expression \s[0-9]\s mathces " 1 " and " 3 ". As the space after the 1 is matched, it can't also be used to match " 2 ".
You can use a positive lookbehind and a positive lookahead to match digits that are surrounded by spaces:
(?<= )(\d+)(?= )
Demo: https://regex101.com/r/hT1dT6/1

Parse text file to DGV, need to replace more then 1 consecutive periods

I am parsing a text file and importing it into a Data Grid View. The file is set up like so:
number 1 .......... 845.6
number 2 ....... 0.0001
col 1 col 2 col 3
1.233 4.55 1000
I need to get these values into a DGV then I can insert them into a Excel Template. I have everything working except for one line. the "number 1" line ends up all in one cell. I'll explain why.
I use the following code to process each line and sort of create a csv out of the data first.
TextLine = Regex.Replace(TextLine, " {2,}", " ")
TextLine = Replace(TextLine, " ", ",")
Since the data in the file is separated by more then one consecutive space on every line (except "number 1") I can just replace each occurrence with a comma and I get a nice result. What I was trying to do was replace consecutive periods with a space so "number 1" is separated from the value.
I have tried a few things:
TextLine = Regex.Replace(TextLine, "(.)\1{2,}", " ")
TextLine = Regex.Replace(TextLine, ".{2,}", " ")
the top one works, but also obviously gets rid of other characters that are consecutive. I also can't just remove periods, as some numbers have a decimal. I was thinking the solution might be using "Chr(46)" in that function, but I can't seem to make it work.
You need to escape the dot with a backslash so that it doesn't have a special meaning.
It looks like this is what you want:
Dim textLine As String = "number 1 .......... 845.6"
textLine = Regex.Replace(textLine, "\.{2,}", ",")
Console.WriteLine(textLine) ' outputs "number 1 , 845.6"
Or maybe Regex.Replace(textLine, "\.{2,}", " ") to replace two-or-more consecutive dots with a space.