Regular Expression: Extract the lines - regex

I try to extract the name1 (first-row), name2 (second-row), name3 (third-row) and the street-name (last-row) with regex:
Company Inc.
JohnDoe
Foobar
Industrieterrein 13
The very last row is the street name and this part is already working (the text is stored in the variable "S2").
REGEXREPLACE(S2, "(.*\n)+(?!(.*\n))", "")
This expression will return me the very last line. I am also able the extract the first row:
REGEXREPLACE(S2, "(\n.*)", "")
My problem is, that I do not know how to extract the second and third row....
Also how do I test if the text contains one, two, three or more rows?
Update:
The regex is used in the context of Scribe (a ETL tool). The problem is I can not execute sourcecode, I only have the following functions:
REGEXMATCH(input, pattern)
REGEXREPLACE(input, pattern, replacement)

If the regex language provides support for lookaheads you may count rows backwards and thus get (assuming . does not match newline)
(.*)$ # matching the last line
(.*)(?=(\n.*){1}$) # matching the second last line (excl. newline)
(.*)(?=(\n.*){2}$) # matching the third last line (excl. newline)

just use this regex:
(.+)+
explain:
.
Wildcard: Matches any single character except \n.
+
Matches the previous element one or more times.

As for a regular expression that will match each of four rows, how about this:
(.*?)\n(.*?)\n(.*?)\n(.*)
The parentheses will match, and the \n will match a new line. Note: you may have to use \r\n instead of just \n depending; try both.

You can try the following:
((.*?)\n){3}

Related

Match text between first and last word

I want to create a regular expression that matches across multiple lines. This regular expression should start by matching a line beginning with Email and end by matching a line beginning with =, and it should include all lines in the match. Example:
Input:
scrambled text asdasdad
qwert asd
Email: johndoe#john.com
John is an emplyer
john's number is +146546****
============================
gabrish
Match:
Email: johndoe#john.com
John is an emplyer
john's number is +146546****
============================
There are several ways to design an expression for this task, my guess is that, maybe this expression might be of our interest here,
Email:[\s\S]*?=.+\s*
with an m (multiline) flag:
Demo 1
if likely spaces after the last = might be unnecessary, we can just remove those and simplify it to: Email:[\s\S]*?=.+.
Another way would be to use an s (single line) flag and we would be starting with an expression similar to:
Email.*?=+
Demo 2
Try this:
content.replaceAll("^(?!(EMAIL|=).+).+$", "");

How can I search and replace guids in Sublime 3

I have a textfile where I would like to replace all GUIDs with space.
I want:
92094, "970d6c9e-c199-40e3-80ea-14daf1141904"
91995, "970d6c9e-c199-40e3-80ea-14daf1141904"
87445, "f17e66ef-b1df-4270-8285-b3c15da366f7"
87298, "f17e66ef-b1df-4270-8285-b3c15da366f7"
96713, "3c28e493-015b-4b48-957f-fe3e7acc8412"
96759, "3c28e493-015b-4b48-957f-fe3e7acc8412"
94665, "87ac12a3-62ed-4e1d-a1a6-51ae05e01b1a"
94405, "87ac12a3-62ed-4e1d-a1a6-51ae05e01b1a"
To become:
92094,
91995,
87445,
87298,
96713,
96759,
94665,
94405,
How can i accomplish this in Sublime 3?
Ctrl+H
Find: "[\da-f-]{36}"
Replace: LEAVE EMPTY
Enable regex mode
Replace all
Explanation:
" : double quote
[ : start class character
\d : any digit
a-f : or letter from a to f
- : or a dash
]{36} : end class, 36 characters must be present
" : double quote
Result for given example:
92094,
91995,
87445,
87298,
96713,
96759,
94665,
94405,
Try doing a search for this pattern in regex search mode:
"[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}"
And then just replace with empty string. This should strip off the GUID, leaving you with the output you want.
Demo
Another regex solution involving a slightly different search-replace strategy where we don't care about the GUI format and simply get the first column:
Search for ([^,]*,).* (again don't forget to activate the regex mode .*).
Replace with $1.
Details about the regular expression
The idea here is to capture all first columns. A column here is defined by a sequence of
"some non-comma character": [^,]*
followed by a comma: [^,]*,
The first column can then be followed by anything .* (the GUI format doesn't matter): [^,]*,.*
Finally we need to capture the 1st column using group capturing: ([^,]*,).*
In the replace field we use a backreference $x which refers the the x-th capturing group.

Regex Lookahead/Lookbehind if more than one occurance

I have string formulas like this:
?{a,b,c,d}
It can be can be embedded like this:
?{a,b,c,?{x,y,z}}
or this is the same:
?{a,b,c,
?{x,y,z}
}
So I have to find those commas, what are in the second and greather "level" brackets.
In the example below I marked the "levels" where I have to find all commas:
?{a,b,c,
?{x,y, <--Those
?{1,2,3} <--Those
}
}
I've tried with lookahead and lookbehind, but I'm totally confused now :/
Here is my latest working try, but it is not good at all:
OnlineRegex
Update:
To avoid misunderstanding, I don't want to count the commas.
I'd like to get groups of commas to replace them.
The condition is find the commas where more than one "open tags" before it like this: ?{
.. without closing tag like this: }
Examlpe.:
In this case I have not replace any commas:
?{1,2,3} ?{a,b,c}
But in this case I have to replace commas between a b c
?{1,2,3,?{a,b,c}}
For the examples which you have provided, the following regex works(gives the desired output as mentioned by you):
(?<!^\?{[^{}]*),(?=[\s\S]*(?:\s*}){2,})
For String ?{a,b,c,d}, see Demo1 No Match
For String, ?{a,b,c,?{x,y,z}}, see Demo2 Match successful
For String,
?{a,b,c,
?{x,y,z}
}
see Demo3 Match Successful
For String,
?{a,b,c,
?{x,y,
?{1,2,3}
}
}
see Demo4 Match Successful
For String ?{1,2,3} ?{a,b,c} ?{1,2,3} ?{a,b,c}, see Demo5 No Match
Explanation:
(?<!^\?{[^{}]*), - negative lookbehind to discard the 1st level commas. The logic applied here is it should not match the comma which is preceded by start of the string followed by ?{ followed by 0+ occurrences of any character except { or }
(?=[\s\S]*(?:\s*}){2,}) - The comma matched above must be followed by atleast 2 occurrences of }(consecutive or having only whitespaces between them)
Your question is rather unclear #norbre, but I presume you'd like to extract (i.e. "count") the number of commas.
You can't do this with a regex. Regexps can't count number of occurences. However, you can use this to extract the "internal part" and then use a spreadsheet formula to count number of commas:
^(?:\?{[a-zA-Z0-9,]+?,\n??\s*?\?{)([a-zA-Z0-9,?{}\n\s]+?(?:\n*?\s*?|})+)(?:[a-zA-Z0-9,\n\s]*})$
Try: https://regex101.com/r/Rr0eFo/5
Examples
1.
Input:
?{a,b,c,?{e,f},1,2,3}
Output:
e,f}
2.
Input:
?{a,b,c,
?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
,d,e,f}
Output:
x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
3.
Input:
?{a,b,c,?{e},1,2,3}
Output:
e}
(note that there are no commas here!)
One caveat however. As I have said, regexps can't count number of occurences.
Hence, the following sample (don't know if it's valid or not for your case) would return wrong match:
?{a,b,c,?{e,f}
,1,2,3,?{a,b}
}
Output:
e,f}
,1,2,3,?{a,b}
OK replacing commas is another story so I'll add another answer.
Your regexp engine would need to support recursion.
Still I don't see a way to do it with one regex - one match would either contain the first comma or contain everything between the braces!
What I suggest is to use one regexp to get "what is inside the inner braces", run a replace (, => "") and assemble the whole line again using submatches from the regexp.
Here it is: (\?{[^?{}]*)((?>[^?{}]|(?R))+?)([^?{}]*?\})
Try: https://regex101.com/r/IzTeY0/3
Example 1:
Input:
?{a,b,c,
?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
,d,e,f}
Submatches:
1. ?{a,b,c,
2. ?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
3.
,d,e,f}
Replace all commas in submatch 2 with anything you want, then reassamble the whole string using submatches 1 and 3.
Again, this would break the regexp:
?{a,b,c,?{e,f}
,1,2,3,?{a,b}
}
Submatch 2 would look like this:
?{e,f}
,1,2,3,?{a,b}

Regular expression to get only the first word from each line

I have a text file
#sp_id int,
#sp_name varchar(120),
#sp_gender varchar(10),
#sp_date_of_birth varchar(10),
#sp_address varchar(120),
#sp_is_active int,
#sp_role int
Here, I want to get only the first word from each line. How can I do this? The spaces between the words may be space or tab etc.
Here is what I suggest:
Find what: ^([^ \t]+).*
Replace with: $1
Explanation: ^ matches the start of line, ([^ \t]+) matches 1 or more (due to +) characters other than space and tab (due to [^ \t]), and then any number of characters up to the end of the line with .*.
See settings:
In case you might have leading whitespace, you might want to use
^\s*([^ \t]+).*
I did something similar with this:
with open('handles.txt', 'r') as handles:
handlelist = [line.rstrip('\n') for line in handles]
newlist = [str(re.findall("\w+", line)[0]) for line in handlelist]
This gets a list containing all the lines in the document,
then it changes each line to a string and uses regex to extract the first word (ignoring white spaces)
My file (handles.txt) contained info like this:
JoIyke - personal twitter link;
newMan - another twitter handle;
yourlink - yet another one.
The code will return this list:
[JoIyke, newMan, yourlink]
Find What: ^(\S+).*$
Replace by : \1
You can simply use this to get the first word.Here we are capturing the first word in a group and replace the while line by the captured group.
Find the first word of each line with /^\w+/gm.

Regular expression extract filename from line content

I'm very new to regular expression. I want to extract the following string
"109_Admin_RegistrationResponse_20130103.txt"
from this file content, the contents is selected per line:
01-10-13 10:44AM 47 107_Admin_RegistrationDetail_20130111.txt
01-10-13 10:40AM 11 107_Admin_RegistrationResponse_20130111.txt
The regular expression should not pick the second line, only the first line should return a true.
Your Regex has a lot of different mistakes...
Your line does not start with your required filename but you put an ^ there
missing + in your character group [a-zA-Z], hence only able to match a single character
does not include _ in your character group, hence it won't match Admin_RegistrationResponse
missing \ and d{2} would match dd only.
As per M42's answer (which I left out), you also need to escape your dot . too, or it would match 123_abc_12345678atxt too (notice the a before txt)
Your regex should be
\d+_[a-zA-Z_]+_\d{4}\d{2}\d{2}\.txt$
which can be simplified as
\d+_[a-zA-Z_]+_\d{8}\.txt$
as \d{2}\d{2} really look redundant -- unless you want to do with capturing groups, then you would do:
\d+_[a-zA-Z_]+_(\d{4})(\d{2})(\d{2})\.txt$
Remove the anchors and escape the dot:
\d+[a-zA-Z_]+\d{8}\.txt
I'm a newbie in php but i think you can use explode() function in php or any equivalent in your language.
$string = "01-09-13 10:17AM 11 109_Admin_RegistrationResponse_20130103.txt";
$pieces = explode("_", $string);
$stringout = "";
foreach($i = 0;$i<count($pieces);i++){
$stringout = $stringout.$pieces[$i];
}