I have a file containing many lines of the following
line 123456 89 2018-02-12 14:47:07 +0000 here
line 234567 90 2019-02-13 09:02:01 +0000 there
So I would like to split them into the last four parts from each line
Here is the regular expresion that
"\t\d{6}\t\d{2}\t\w+"
It gives out
123456\t89\t2018
234567\t90\t2019
How do I update the regular expression to get
123456\t89\t2018-02-12 14:47:07\there
234567\t90\t2019-02-13 09:02:01\tthere
instead?
Thanks!
The end of your regex "\t\d{6}\t\d{2}\t\w+" matches up to the next non-word character, which happens to be the dash after the year item. To capture the remaining characters, I'd recommend a negative character class, which matches everything except \t. That is:
"\t\d{6}\t\d{2}\t[^\t]+\t\w+"
Usually, this is easier than positively stating all possible characters that might occur.
Related
I have a csv file provided by a client with a filepath in the first column, then a blank column, the a file size, then two timestamps, then an owner, and a final column which is usually, though not exclusively, blank. It can contain text of the first 500 characters of the file.
Some of the filepaths contain single occurrences of the double-quote character.
My problem is finding the regex I can use in Notepad++ to find these occurrences in only the first column, and replace them with pairs of double-quotes, so they are properly escaped for a csv file.
Here are three example lines:
"/TCH-EXP/mnt/office/dept/ped/Bill New Structure/_Personal Folders/TFR/PowerPoint/Privat/Emilie Føs"da.ppt","",143872,Mon Mar 5 10:02:22 2007,Mon Mar 5 10:02:22 2007,"TFR012",""
"/TCH-EXP/mnt/office/dept/ped/Bill New Structure/_Personal Folders/TFR/Tfr/Siemens Data/Halfdan "B" data (2).msg","",2092544,Mon Feb 9 09:22:32 2004,Mon Feb 9 09:22:32 2004,"TFR012",""
"/TCH-EXP/mnt/office/dept/ped/Bill New Structure/_Personal Folders/TFR/Tfr/Siemens Data/Halfdan "B" data "20-nov-2003".msg","",1060864,Mon Feb 9 09:22:32 2004,Mon Feb 9 09:22:32 2004,"TFR012",""
In the first line, I need Føs"da.ppt to become Føs""da.ppt
In the second line I just need "B" to be ""B""
In the third line I need "B" to be ""B"" and "20-nov-2003" to be ""20-nov-2003""
Is there one regex search & replace I could use to address all three scenarios?
Thanks very much!
I've tried a simple search using capture groups to spot occurrences of " in the first column, but only by counting the appropriate number of commas.
Search: ^("/TCH-.*)"(.*","",.*,"")
Replace: $1""$2
This seems to work on the first example where there is only one " in the path.
What you might do if you use notepad++ is make use of \G and use a negative lookahead to make sure that the " you select is not followed by ," or the end of the string.
Then replace with the full match $0 followed by an extra double quote.
Find what
(?:\G(?!^)|"/TCH-EXP)[^"]+\K"+(?!,"|$)
Replace with
$0"
Explanation
(?:\G(?!^)|"/TCH-EXP) End of the previous match but not at the start or match "/TCH-EXP
[^"]+ Match 1+ times not double quote
\K"+ Forget what was matched and match 1+ times "
(?!,"|$) Negative lookahead to assert what is on the right is not ," or the end of the string
Regex demo
Problem
I have a long unstructured text which I need to extract groups of text out.
I have an ideal start and end.
This is an example of the unstructured text truncated:
more useless gibberish at the begininng...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
bunch of text with lots of newlines in between... Closing 11.11 1,111.11 111,111.11
more useless gibberish between the groups...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
The word START appears in the middle sometimes multiple times, but it's fine bunch of text with lots of newlines in between... Closing 22.22 2,222.22 222,222.22
more useless gibberish at the end...
separated by new lines...
What I have tried
In the example above, I want to extract out 2 groups of text that lie between START and Closing
I have successfully done so using regex
/(?<=START)(?s)(.*?)(?=Closing)/g
This is the result https://regex101.com/r/vo7CLx/1/
What's wrong?
Unfortunately, I also need to extract the end of the line containing Closing string.
If you notice from the regex101 link, there's a Closing 11.11 1,111.11 111,111.11 in the first match. And a Closing 22.22 2,222.22 222,222.22 in the second match.
Which the regex does not match.
Is there a way to do this in a single regex? so that even the ending tag with the numbers are included?
Try this Regex:
(?s)(?<=START)(.*?Closing(?:\s*[\d.,])+)
Click for Demo
Explanation:
(?s) - single line modifier which means a . in the regex will match a newline
(?<=START) - Positive lookbehind to find the position immediately preceded by a START
(.*?Closing(?:\s*[\d.,])+) - matches 0+ occurrences of any character lazily until the next occurrence of the word Closing which is followed by a sequence (?:\s*[\d.,])+
(?:\s*[\d.,])+ - matches 0+ occurrences of a whitespace followed by a digit or a . or a ,. The + at the end means we have to match this sub-pattern 1 or more times
(START)(?s)(.*?)(Closing)(\s+((,?\d{1,3})+.\d+))+ should match everything you want, see here!
You can try this regex,
START(.*)Closing(.*)(((.?\d{1,3})+.\d+)+.\d+.\d+.\d)\d
I have the following set of strings:
tel:+1 855 345 3455
tel:+185564354
tel:+85523456
tel:1855345445
tel:6047222733
tel:+54434553
tel:+1833453335
I am trying to write a regex that will omit any string value that contains an 855 number that may or may not be preceded by a 1, +, space or a combination of all three.
I tried a few but none seem to give me a 100% accurate match.
The one that seems to work for most strings is: **^tel:[+]?[1]?[ ]?[^8][^5][^5].*$** but it also matches these two string values:
tel:+1 855 345 3455
tel:+185564354
And I am not sure why.
Can any regex whiz help?
Try with following regex.
Regex: ^(?=.*855).*$
Explanation:
(?=.*855) a positive lookahead for 855 anywhere in the string. If present then only whole string will be a match.
Regex101 Demo
An alternative approach is
^[1\s+]*(?:855)[\s0-9]*$
^ is the beginning of the string.
[1\s+]* matches any of those characters 0 or more times.
(?:855) is a non-capturing group that means that 855 must be in the string.
[\s0-9]* matches any character in the class (space or digit) 0 more times.
$ is end of string.
if you don't want spaces following 855 change the character class in 4. to what you want.
working regex
Here's one approach:
tel:(?!\+?1?\s?855).*
tel: <- litteral match
(?!\+?1?\s?855) <- negative look ahead for any numbers starting with +1\s855 where the +1\s are optional, but must be in that order
.* <- match for the rest of the string for strings that aren't caught by the negative lookahead.
https://regex101.com/r/pIcTHA/1
Very very simple question I think.....I am attempting to get an match for the following:
17 SEP 2014 2
Currently the following expressions get a match by ignoring the white spaces between the date and the number (note: the number can be more than a single digit):
^(([0-9])|([0-2][0-9])|([3][0-^(([0-9])|([0-2][0-9])|([3][0-1]))\ (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s\d{4}\s* ([0-9]+)$
Probably not the most elegant, but as a total beginner, it's a start and does get me a match.
What I really need to be able to check though, is that there is exactly 25 white spaces between the date and the digit though. Can anyone tell me how I can get a match only if there are exactly 25 white spaces?
Cheers in advance!
If you want to match exactly 25 whitespace characters, you can use:
\s{25}
What you used (\s*) will match any number of characters, including zero.
It's getting towards the end of the day and this is annoying me - one day I'll find the time to learn regex properly as I know it can save a lot of time when extracting info from text.
I need to match strings that match the following signature:
6 spaces followed by up 31 alphanumerics (or spaces) and then no more alphanumeric text on that line.
E.g.
' sampleheading ' - is fine
' sampleheading 10^21/1 ' - should not match
' sampleheading sample ' - should not match
I've got ^(\s{6}[\w\s]{1,31}) matching the first bit correctly I think but I can't seem to get it to only select lines that don't have any text following the initial match.
Any help appreciated!
Edit:
I've updated the text as a number of you noted my hastily entered original samples would actually all have tested fine.
Use $ to match end of line:
^(\s{6}[\w\s]{1,31})$
Or, if you may still have spaces afterwards that you want to ignore:
^(\s{6}[\w\s]{1,31})\s*$
You can use a $ to indicate the end of a line, using \s* to allow optional whitespace at the end.
^\s{6}[\w\s]{1,31}\s*$
Your samples don't match what you're saying you're wanting, however. They only start with four spaces, rather than six, and, in the last sample, "sampleheading sample"
is within the 31 character limit, so it matches, too. (The middle sample is within the length, too, but has non-word characters in it, so it doesn't match). Is that what you want?
add a $ to match the end of the line, e.g.
^(\s{6}[\w\s]{1,31})$
Aren't you simply saying 'match 6 spaces followed by 31 alphanumerics' ? There's not concept there of 'and no more alphanumerics'
I think what you have is good so far (!), but you need to follow it with (say) [^\w] - i.e. 'not an alphanumeric'.
Try this one out:
^\s{6}[\w\s]{1,31}\W.*$