I'm trying to match a file which is delimited by multiple spaces. The problem I have is that the first field can contain a single space. How can I match this with a regex?
Eg:
Name Other Data Other Data 2
Bob Smith XX1 0101010101
John Doe XX2 0101010101
Bob Doe XX3 0101010101
John Smith XX4 0101010101
Can I split these lines into three fields with a regex, splitting by a space but allowing for the single space in the first field?
Hi the following regex should work
(\w*\s\w*)\s+\w{2}\d\s+\d*
This would work:
Pattern:
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Replacement:
+$1+ -$2- *$3*
$1 contains the first column, $2 the second and $3 the third one.
Example:
http://regexr.com?32tbt
You could split at two or more spaces:
[ ]{2,}
But you are probably better off, determining the lengths of the captures of this regular expression:
(Name[ ]+)(Other Data[ ]+)
And then to use a simple substring method that slices your lines into portions of the same length.
So in your case the first capture would be 15 characters long, the second 14 and the column would have 13 (but the last one doesn't really matter, which is why it isn't actually captured). Then you take the first 15, the next 14 and the remaining characters of every line and trim each one (remove trailing whitespace).
I think the simplest is to use a regex that matches two or more spaces.
/ +/
Which breaks down as... delimiter (/) followed by a space () followed by another space one or more times (+) followed by the end delimiter (/ in my example, but is language specific).
So simply put, use regex to match space, then one or more spaces as a means to split your string.
Usually, with this kind of files, the best approach is to get a substring based on where your required information is and then trim it. I see your file contains 16 chars before the second field, you can get a substring of length 16 from the beginning which will contain your desired text. You should trim it to get only the text you need without the spaces.
If the spacing pattern you posted is consistent (if it won't change among different files of this kind) you have also another problem: what happens to longer names?
Name Other Data
Johnny AppleseeXX1
TutankamonfirstXX2
if you really want to use a regex, be sure to avoid those corner cases.
Related
I need to make this exercise about regexes and text manipulation in vim.
So I have this file about the most scoring soccer players in history, with 50 entries looking like this:
1 Cristiano Ronaldo Portugal 88 121 0.73 03 Manchester United Real Madrid
The whitespaces between the fields are tabs (\t)
The fields each respond to a differen category: etc...
This last field contains one or more clubs the player has played in. (so not a fixed number of clubs)
The question: replace all tabs with a ';', except for the last field, where the clubs need to be seperated by a ','.
So I thought: I just replace all of them with a comma, and then I replace the first 7 commas with a semicolon. But how do you do that? Everything - from regex to vim commands - is allowed.
The first part is easy: :2,$s/\t/,/g
But the second part, I can't seem to figure out.
Any help would be greatly appreciated.
Thanks, Zeno
This answer is similar to #Amadan's, but it makes use of the ability to provide an expression as the replace string to actually do the difficult bit of changing the first set of tabs to semicolons:
%s/\v(.{-}\t){7}/\=substitute(submatch('0'), '\t', ';', 'g')/|%s/\t/,/g
Broken down this is a set of three substitute commands. The first two are cobbled together with a sub-replace-expression:
%s/\v(.{-}\t){7}/\=substitute(submatch('0'), '\t', ';', 'g')/
What this does is find exactly seven occurrances ({7}) of any character followed by a tab, in a non-greedy way. ((.{-}\t)). Then we replace this entire match (submatch(0)) with the result of the substitute expression (\=substitute(...)). The substitute expression is simple by comparison as it just converts all tabs to semicolons.
The last substitute just changes any other tabs on the line to commas.
See :help sub-replace-expression
Here's one way you could do it:
:let #q=":s/\t/;\<cr>"
:2,$norm 7#q
:2,$s/\t/,/g
Explanation:
First, we define a macro 'q' that will replace one tab with a semicolon. Now, on any line we can simply run this macro n times to replace the first n tabs. To automatically do this to every line, we use the norm command:
:2,$norm 7#q
This is essentially the same thing as literally typing 7#q (e.g. "run macro 'q' seven times") on every line in the specified range. From there, we can simply replace every tab with a comma.
:2,$s/\t/,/g
:2,$s/\t\(.*\t\)\#=/;/g
:2,$s/\t/,
Change any tabs where there is a tab later to ;
Change any remaining tabs to ,
EDIT: Misunderstood. Here is a fixed version:
:2,$s/\(\(\t.*\)\{7}\)\#<=\t/,/g
:2,$s/\t/;/g
Change any tabs where there's seven tabs before it to ,
Change any remaining tabs to ;
My PatternsOnText plugin has (among others) a :SubstituteSelected command that allows to specify the match positions. With this, you can easily replace the first 8 tabs with semicolons, and then use a regular substitute to change the remaining tabs into commas:
:2,$SubstituteSelected/\t/;/g 1-8
:2,$s/\t/,/g
We solved the issue by just capturing the first 8 groups manually ([^\t]*\t)(...)(...) and then separate them with a semicolon (\1;\2;...;) then replacing the remaining tabs with comma's | 2,$s/\t/,/g
Thanks to everyone trying to help!
I need a regex which will match a phrase (with specific length and structure) even if there is additional white space in the middle (anywhere).
Let's say we have some description:
Serial numbers: ABC1234567890 XYZ0987654321
Then we want to find all phrases matching to regex [A-Z]{3}[0-9]{10}, but that description is malformed because of processing by external service. That service splits description to chunks, 12 digits each. So it will be:
Serial numbe
rs: ABC12345
67890 XYZ098
7654321
Important: "Serial numbers:" isn't fixed, it can be everything, so required phrases can be split anywhere (ABC1 234567890, ABC1234567 890 etc.). New line and space have the same meaning from the phrase matching perspective, but in special cases there can be more white chars between parts of phrase (for example, space as last char of chunk + new line, multiple spaces in source description). It just simply should treat whole "white space" between two strings as 1 space (ABC1 234567890 = ABC1234 567890, also with new line break). Those serials can be anywhere in malformed description (as I wrote: "Serial numbers:" part is optional, can be anything), also there can be more serial numbers within description. [A-Z]{3}[0-9]{10} also is only an example, I want to know how to achieve matching with optional white space in the middle, but base regex can be different.
EXPECTED RESULT: collection of matched phrases (serial numbers from the example).
ABC1234567890
XYZ0987654321
Info: result can contain white chars within phrase (from above example it would be: ABC12345 67890 and XYZ098 7654321). Most important thing is to match the base phrase (serial number).
Is it possible to make regex which will match it? I think it would be rather simple algorithm to match it without regex, but maybe it can be done with regular expression and make it "oneliner".
this will handle multiple spaces multiple times
(([A-Z]\s*){3}([0-9]\s*){10})
will match AB C A A A A AD E12 34567890
since AD E12 34567890 fits the pattern
https://regex101.com/r/bK3sF8/1
Edit:
Just considering one(you can adjust for multiples) \n (break lines) in and outside the word here:([\w\n?]*)
You should try grouping the result
in this case:
/(([\w\n?]*)\s([\w\n?]*):\s([\w\n?]*)\n?\n?\s([\w\n]*))/ig
you can get the serial number by groups $3 and $4
http://regexr.com/3d67n
This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.
I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx
So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.