Extracting multiple lines of text between delimiters from a single cell - regex

In Google Sheets or Excel, I would like to extract multiple lines of text between the delimiters x/ and / using a single formula.
INPUT:
x/Apple Juice/,Banana,Grape,x/Pear Juice/,Cherry,Orange,Blueberry
OUTPUT expected:
Apple Juice, Pear Juice
The input line of text may be longer or shorter and the position and instances of "x/text/" can vary.

=ARRAYFORMULA(TEXTJOIN(", ", 1, IFERROR(REGEXEXTRACT(SPLIT(A1, ","), "x/(.*)/"))))

Related

Regex to get text between 2 large spaces

I want to try and regex this text to only get "Second Baptist School" as the output by using Customer: as the set beginning for it to recognize. How would I get it so that it recognizes the beginning and gets all of the text in between the large sections of blanks?
Customer: Second Baptist School Date of Sale: 9/26/2022
Right now I'm using Customer:\s*([^ -.]+) but it only gets "Second" as the output.
You can look for 2 or more white spaces with:
Customer:\s*(.*?)\s{2,}
this should align with your above examples. The {2,} says 2 or more.
https://regex101.com/r/1HapOO/1

Google Sheets: Find match, check for text in adjacent cell to matched cell

Example data here.
I'd like to use conditional formatting to highlight cells where, if the cell's number is found in another sheet, and there is text in an adjacent cell, it is highlighted.
So, given Sheet2:
When a B-column cell in Sheet1 matches an A-column cell in Sheet2, it checks if there is text in the adjacent E-column cell (in Sheet2), and highlights if there is text.
try:
=REGEXMATCH(B2&"", "^"&TEXTJOIN("$|^", 1,
FILTER(INDIRECT("Sheet2!A2:A"), INDIRECT("Sheet2!E2:E")<>""))&"$")

Joining two lines based on specific characters Notepad ++

I'm trying to join lines of data information in Notepad ++, currently, the data looks like this:
It has the above format for about 100,000 rows. I want to combine row 1 with row 2, but sometimes row 2 and row 3 combine and look something like this:
I want the output to look like this (all on one line):
I tried using this formula:
SEARCH: (.+)\R(.+)
REPLACE: \1 \2
If you want to match specific characters in Regex, you can simply type that character. for example, apple will only match apple. If you want to match a number, you can use \d. This will match 8, but not d.
If you want to match only things that end in 4 numbers separated by a dot, try this one: \n(.*?\d\d\.\d\d)\n
An explanation for each part can be found here.

c# split text file by changing the line number

I'm trying to split text file by line numbers,
for example, if I have text file like:
1 ljhgk uygk uygghl \r\n
1 ljhg kjhg kjhg kjh gkj \r\n
1 kjhl kjhl kjhlkjhkjhlkjhlkjhl \r\n
2 ljkih lkjhl kjhlkjhlkjhlkjhl \r\n
2 lkjh lkjh lkjhljkhl \r\n
3 asdfghjkl \r\n
3 qweryuiop \r\n
I want to split it to 3 parts (1,2,3),
How can I do this? the size of the text is very large (~20,000,000 characters) and I need an efficient way (like regex).
Another idea, you can use linq to get the groups you're after, by splitting by each first word. Note that this will take each first word, so make sure you only have numbers there. This is using the split/join antipattern, but it seems to work nice here.
var lines = from line in s.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
let lineNumber = line.Split(" ".ToCharArray(), 2).FirstOrDefault()
group line by lineNumber
into g
select String.Join("\n", g);
Notes:
GroupBy is gurenteed to return lines in the order they appeared.
If a block appears more than once (e.g. "1 1 2 2 3 3 1"), all blocks with the same number will be merged.
You can use a regex, but Split will not work too well. You can Match for the following pattern:
^(\d).*$ # Match first line, capture number
([\r\n]+^\1.*$)* # Match additional lines that begin with the same number
Example: here
I did try to split by$(?<=^(\d+).*)[\r\n]+^(?!\1), but it adds the line numbers as additional elementnt in the array.

Regular Expression help needed to convert lst file to csv

I have a file (ratings.lst) downloaded from IMDB Interfaces. The content appears to be in in the following format :-
Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
My aim is to convert this file into a CSV file (comma separated) with the following desired result (example for 1 line) :
Distribution Votes Rating Title
0000001222, 301527, 8.4, The Green Mile (1999)
I am using textpad and it supports regex based search and replace. I'm not sure what type of regex is needed to achieve the above desired results. Can somebody please help me on this. Thanks in advance.
The other regular expressions are somewhat overcomplicated. Because whitespace is guaranteed not to appear in the first three columns, you don't have to do a fancy match - "three columns of anything separated by whitepace" will do.
Try replacing ^(.+?)\s+(.+?)\s+(.+?)\s+(.+?)$ with \1,\2,\3,"\4" giving the following output (using Notepad++)
Distribution,Votes,Rating,"Title"
0000001222,297339,8.4,"Reservoir Dogs (1992)"
0000001223,64504,8.4,"The Third Man (1949)"
0000000115,48173,8.4,"Jodaeiye Nader az Simin (2011)"
0000001232,324564,8.4,"The Prestige (2006)"
0000001222,301527,8.4,"The Green Mile (1999)"
Note the use of a non-greedy quantifier, .+?, to prevent accidentally matching more than we should. Also note that I've enclosed the fourth column with quote marks "" in case a comma appears in the movie title - otherwise the software you use to read the file would interpret Avatar, the Last Airbender as two columns.
The nice tabular alignment is gone - but if you open the file in Excel it will look fine.
Alternately, just do the entire thing in Excel.
First replace all " with "" then do this:
Find: ^\([0-9]+\)[ \t]+\([0-9]+\)[ \t]+\([^ \t]+\)[ \t]+\(.*\)
Replace with: \1,\2,\3,"\4"
Press F8 to open Replace dialog
Make sure Regular Expression is selected
In Find what: put: ^([[:digit:]]{10})[[:space:]]+([[:digit:]]+)[[:space:]]+([[:digit:]]- {1,2}\.[[:digit:]])[[:space:]]+(.*)$
In Replace with: put \1,\2,\3,"\4"
Click Replace All
Note: This uses 1 or more spaces between fields from ratings.lst - you might be better off specifying the exact number of spaces if you know it.
Also Note: I didn't put spaces between the comma seperated items, as generally you don't, but feel free to add those in
Final Note: I put the movie title in quotes, so that if it contains a comma it doesn't break the CSV format. You may want to handle this differently.
MY BAD This is a C# program. I will leave it up for an alternate solution.
The ignorepattern whitespace is for commenting the pattern.
This will create data which can be placed into a CSV file. Note CSV files do not have optional whitepsace in them as per your example....
string data =#"Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
";
string pattern = #"
^ # Always start at the Beginning of line
( # Grouping
(?<Value>[^\s]+) # Place all text into Value named capture
(?:\s+) # Match but don't capture 1 to many spaces
){3} # 3 groups of data
(?<Value>[^\n\r]+) # Append final to value named capture group of the match
";
var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => string.Join(",", mt.Groups["Value"].Captures
.OfType<Capture>()
.Select (c => c.Value))
);
Console.WriteLine (result);
/* output
Distribution,Votes,Rating,Title
0000001222,297339,8.4,Reservoir Dogs (1992)
0000001223,64504,8.4,The Third Man (1949)
0000000115,48173,8.4,Jodaeiye Nader az Simin (2011)
0000001232,324564,8.4,The Prestige (2006)
0000001222,301527,8.4,The Green Mile (1999)
*/