I have a big text file with addresses, and I want split the data into 3 variables. Example:
NM_LOGRADO
Street BLA BLA BLA 340
Av BLE BLI 318
Road BLI 48 Block 4
I want transform into:
NM_LOGRADO
Street(TAB)BLA BLA BLA(TAB)340
Av(TAB)BLE BLI(TAB)318
Road(TAB)BLI(TAB)48 Block 4
Basically, replace the first space and the last space before the first number space by tab.
I'm using Notepad++, and for the second replacement I tried replace ' (?=[0-9])(?<=)' by '(TAB)', but it replaced all spaces before numbers (in the third line I got Road(TAB)BLI(TAB)48 Block(TAB)4). For the first replacement I have no idea :(
Go to Search > Replace menu (shortcut CTRL+H) and do the following:
Find what:
(?:^.+?\K | (?=[0-9]+.+))
Replace:
\t
Select radio button "Regular Expression"
Then press Replace All
You can test it with your example at regex101.
Update1:
Based on your updated sample, try this:
Find:
^([^ ]+) ([^0-9]+) (.+)
Replace:
$1\t$2\t$3
Test it at regex101.
Update2:
Based on your updated sample, try this:
Find:
(?:^[^ ]+\K |(?<!Block|Ap) (?=[0-9]))
Replace:
\t
Test it at regex101.
I'm assuming that (TAB) refers to a tab character rather than a literal string.
Find what: ^(\w*) ((([A-Z]{3})( )?)+) (\d.*)$
Replace with: \1\t\2\t\6
(If my assumption was incorrect, replace \t with \(TAB\))
The key is the ungreedy space: ( )?. That leaves the leading and trailing spaces uncaptured, and therefore replaced by the tab characters.
Explanation of regular expressions:
^ Beginning of line
(\w*) Any number of alphanumeric characters, i.e. "Street", "Av", "Road"
((([A-Z]{3})( )?)+) 3 uppercase letters, followed by an ungreedy space, once or more, i.e. "BLA BLA BLA", "BLE BLI", "BLI"
(\d.*) A digit, followed by any number of any characters, i.e. "340", "318", "48 Block 4"
$ End of line
\1 First capture group, "(\w*)"
\t Tab character
\2 Second capture group, "((([A-Z]{3})( )?)+)"
\t Tab character
\6 Sixth capture group, "(\d.*)"
as you're using Notpad++, the easiest way is not to bother with regex but rather use a macro. simply record one and play it until the end of the line. You'll want to:
put your cursor at the first character of the file
Macros > Start Recording
find a space and convert it to tab (this will replace the first space of the row)
press END to go to the end of the line
use "find previous" command to find the last space of the line
replace that space with tab
go to the next line
Macros > Stop Recording
Run your macro till the end of the file
Related
I have three lines of tab-separated values:
SELL 2022-06-28 12:42:27 39.42 0.29 11.43180000 0.00003582
BUY 2022-06-28 12:27:22 39.30 0.10 3.93000000 0.00001233
_____2022-06-28 12:27:22 39.30 0.19 7.46700000 0.00002342
The first two have 'SELL' or 'BUY' as first value but the third one has not, hence a Tab mark where I wrote ______:
I would like to capture the following using Regex:
My expression ^(BUY|SELL).+?\r\n\t does not work as it gets me this:
I do know why outputs this - adding an lazy-maker '?' obviously won't help. I don't get lookarounds to work either, if they are the right means at all. I need something like 'Match \r\n\t only or \r\n(?:^\t) at the end of each line'.
The final goal is to make the three lines look at this at the end, so I will need to replace the match with capturing groups:
Can anyone point me to the right direction?
Ctrl+H
Find what: ^(BUY|SELL).+\R\K\t
Replace with: $1\t
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(BUY|SELL) # group 1, BUY or SELL
.+ # 1 or more any character but newline
\R # any kind of linebreak
\K # forget all we have seen until this position
\t # a tabulation
Replacement:
$1 # content of group 1
\t # a tabulation
Screenshot (before):
Screenshot (after):
You can use the following regex ((BUY|SELL)[^\n]+\n)\s+ and replace with \1\2.
Regex Match Explanation:
((BUY|SELL)[^\n]+\n): Group 1
(BUY|SELL): Group 2
BUY: sequence of characters "BUY" followed by a space
|: or
SELL: sequence of characters "SELL" followed by a space
[^\n]+: any character other than newline
\n: newline character
\s+: any space characters
Regex Replace Explanation:
\1: Reference to Group 1
\2: Reference to Group 2
Check the demo here. Tested on Notepad++ in a private environment too.
Note: Make sure to check the "Regular expression" checkbox.
Regex
I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After
lemme show an example. My file looks like this:
AaaAab
AacAaa
AacAap
AaaBbb
I would like to delete all the lines which contains 3 same characters in first or second 3 chars. Which means I will receive only AacAap from above example.
You can use something like:
^(?:(.)\1\1.*|.{3}(.)\2\2.*)$
Put that in the "Find what" field, and put an empty string in the "Replace with" field.
Here's a demo.
Ctrl+H
Find what: ^(?:(.)\1\1|...(.)\2\2).*\R
Replace with: LEAVE EMPTY
UNcheck Match case
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all
Explanation:
^ : beginning of line
(?: : start non capture group
(.) : group 1, any character but newline
\1\1 : same as group 1, twice
| : OR
... : 3 any character
(.) : group 2, any character but newline
\2\2 : same as group 2, twice
) : end group
.* : 0 or more any character
\R : any kind of linebreak
Result for given example:
AacAap
You can use this pattern:
^(?:...)?(.)\1\1.*\r?\n?
The part (.)\1\1 matches three consecutive same characters with a capture and two backreferences. (?:...)? makes the three first characters optional, this way the consecutive characters can be at the beginning of the line or at the 4th position.
.*\r?\n? is only here to match all remaining characters of the line including the line break (you can preserve line breaks if you want, you only have to remove \r?\n?).
Check on the next regex (?im)^(?:...)?(.)\1\1.*(?:\R|\z).
To try the regex online and get an explanation, please click here.
Problem
I have a long unstructured text which I need to extract groups of text out.
I have an ideal start and end.
This is an example of the unstructured text truncated:
more useless gibberish at the begininng...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
bunch of text with lots of newlines in between... Closing 11.11 1,111.11 111,111.11
more useless gibberish between the groups...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
The word START appears in the middle sometimes multiple times, but it's fine bunch of text with lots of newlines in between... Closing 22.22 2,222.22 222,222.22
more useless gibberish at the end...
separated by new lines...
What I have tried
In the example above, I want to extract out 2 groups of text that lie between START and Closing
I have successfully done so using regex
/(?<=START)(?s)(.*?)(?=Closing)/g
This is the result https://regex101.com/r/vo7CLx/1/
What's wrong?
Unfortunately, I also need to extract the end of the line containing Closing string.
If you notice from the regex101 link, there's a Closing 11.11 1,111.11 111,111.11 in the first match. And a Closing 22.22 2,222.22 222,222.22 in the second match.
Which the regex does not match.
Is there a way to do this in a single regex? so that even the ending tag with the numbers are included?
Try this Regex:
(?s)(?<=START)(.*?Closing(?:\s*[\d.,])+)
Click for Demo
Explanation:
(?s) - single line modifier which means a . in the regex will match a newline
(?<=START) - Positive lookbehind to find the position immediately preceded by a START
(.*?Closing(?:\s*[\d.,])+) - matches 0+ occurrences of any character lazily until the next occurrence of the word Closing which is followed by a sequence (?:\s*[\d.,])+
(?:\s*[\d.,])+ - matches 0+ occurrences of a whitespace followed by a digit or a . or a ,. The + at the end means we have to match this sub-pattern 1 or more times
(START)(?s)(.*?)(Closing)(\s+((,?\d{1,3})+.\d+))+ should match everything you want, see here!
You can try this regex,
START(.*)Closing(.*)(((.?\d{1,3})+.\d+)+.\d+.\d+.\d)\d
I've a collection of strings like that (each "space" is a tabulation):
29 301 3 31 0 TREZILIDE Trézilidé
2A 001 1 73 1 (LE) AFA (Le) Afa
What I want is to transform it into this:
29301 Trézilidé
2A001 (Le) Afa
Suppression of the first tabulation
suppression of the tabulations, numbers and the first uppercase occurrence (and replacement of the whole stuff by a space)
replacement of the last tabulation by a space
My bigger problems are:
How to select the first tabulation without selecting the "prefix" and the "suffix"? (like ^(..)\t[0-9] but without selecting ^(..) nor [0-9])
How to select from after the 3 digits to after the tabulation of the uppercase word?
I do that in a text file with the search and replace toolbox of Notepad++
Thanks in advance for your help!
How to select the first tabulation without selecting the "prefix" and the "suffix"?
Optimally this is done using lookahead and lookbehind assertions, but Notepad++ doesn't support those before version 6.0. The next best solution is to just capture them, then backreference them in the replacement string.
Here's how I did it (in answer to your full question):
Check Match case to do a case-sensitive find
Find by regex:
^(..)\t(\d\d\d)[\tA-Z0-9()]+\t(.+)$
Replace with:
\1\2 \3
I end up with this, where <tab> represents an actual tabulation:
29301 Trézilidé
2A001 (Le)<tab>Afa
To get rid of that I do an extended find:
\t
And replace it with the space character, to obtain the final result:
29301 Trézilidé
2A001 (Le) Afa
Try
^(..)\t
Replace with
\1
Then
\(*[A-Z][A-Z]+\)*
Replace with empty string, removes (LE) and AFA too.
''
Then
^(.....).*(\t[A-Za-z]+)+$
Replacement:
\1 \2
And finally:
\t
Replace with a space. Every occurence.
HTW