Regex Partial String CSV Matching - regex

Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:
Two words,Word,Word,Word,"Number, number"
What I need to do is format it like this...
"Two words","Word",Word","Word","Number, number"
I have had a RegEx pattern of
s/,/","/g
working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.
I managed to write up
s/,[A-Za-z0-9]/","/g
which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that
s/(,)[A-Za-z0-9]\b
should work, but it doesn't.
Anyone have an idea?

My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.

s/,([^ ])/","$1/ will match a "," followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.
Depending on which regex engine you're using, you might be writing \1 or other things instead of $1.
If you're using Perl or otherwise have access to a regex engine with negative lookahead, s/,(?! )/","/ (a "," not followed by a space) works.
Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.

This question is similar to: Replace patterns that are inside delimiters using a regular expression call.
This could work:
s/"([^"]*)"|([^",]+)/"$1$2"/g

Looks like you're using Sed.
While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.
Through sed, your command would be:
sed 's/[ \"]*,[ \"]*/\", \"/g'
Note that you'll still have to put doublequotes at the beginning and end of the string.

Related

Regex: Trying to extract all values (separated by new lines) within an XML tag

I have a project that demands extracting data from XML files (values inside the <Number>... </Number> tag), however, in my regular expression, I haven't been able to extract lines that had multiple data separated by a newline, see the below example:
As you can see above, I couldn't replicate the multiple lines detection by my regular expression.
If you are using a script somewhere, your first plan should be to use a XML parser. Almost every language has one and it should be far more accurate compared to using regex. However, if you just want to use regex to search for strings inside npp, then you can use \s+ to capture multiple new lines:
<Number>(\d+\s)+<\/Number>
https://regex101.com/r/MwvBxz/1
I'm not sure I fully understand what you are trying to do so if this doesn't do it then let me know what you are going for.
You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag:
Find:
.*?<Number>(.*?)<\/Number>.*
Replace:
$1
finally i was able to find the right regular expression, I'll leave it below if anyone needs it:
<Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>)
Explanation:
\d: Shortcut for digits, same as [1-9]
\n: Newline.
+: Find the previous element 1 to many times.
Have a good day everybody,
After giving it some more thought I decided to write a second answer.
You can make use of look arounds:
(?<=<Number>)[\d\s]+(?=<\/Number>)
https://regex101.com/r/FiaTKD/1

EditPad: How to replace multiple search criteria with multiple values?

I did some searching and found tons of questions about multiple replacements with Regex, but I'm working in EditPadPro and so need a solution that works with the regex syntax of that environment. Hoping someone has some pointers as I haven't been able to work out the solution on my own.
Additional disclaimer: I suck with regex. I mean really... it's bad. Like I barely know wtf I'm doing.So that being said, here is what I need to do and how I'm currently approaching it...
I need to replace two possible values, with their corresponding replacements. My two searches are:
(.*)-sm
(.*)-rad
Currently I run these separately and replace each with simple strings:
sm
rad
Basically I need to lop off anything that comes prior to "sm" so I just detect everything up to and including sm, and then replace it all with that string (and likewise for "rad").
But it seems like there should be a way to do this in a single search/replace operation. I can do the search part fine with:
(.*)-sm|(.*)-rad
But then how to replace each with it's matching value? That's where I'm stuck. I tried:
sm|rad
but alas, that just becomes the literal complete string that is used for replacement.
Jonathan, first off let me congratulate you for using EPP Pro for regex in your text. It's my main text editor, and the main reason I chose it, as a regex lover, is that its support of regex syntax is vastly superior to competing editors. For instance Notepad++ is known for its shoddy support of regular expressions. The reason of course is that EPP's author Jan Goyvaerts is the author of the legendary RegexBuddy.
A picture is worth a thousand words... So here is how I would do your replacement. Just hit the "replace all button". The expression in the regex box assumes that anything before the dash that is not a whitespace character can be stripped, so if this is not what you want, we need to tune it.
Search for:
(.*)-(sm|rad)
Now, when you put something in parenthesis in Regex, those matches are stored in temporary variables. So whatever matched (.*) is stored in \1 and whatever matched (sm|rad) is stored in \2. Therefore, you want to replace with:
\2
Note that the replacement variable may be different depending on what programming language you are using. In Perl, for example, I would have to use $2 instead.

Notepad++ masschange using regular expressions

I have issues to perform a mass change in a huge logfile.
Except the filesize which is causing issues to Notepad++ I have a problem to use more than 10 parameters for replacement, up to 9 its working fine.
I need to change numerical values in a file where these values are located within quotation marks and with leading and ending comma: ."123,456,789,012.999",
I used this exp to find and replace the format to:
,123456789012.999, (so that there are no quotation marks and no comma within the num.value)
The exp used to find is:
([,])(["])([0-9]+)([,])([0-9]+)([,])([0-9]+)([,])([0-9]+)([\.])([0-9]+)(["])([,])
and the exp to replace is:
\1\3\5\7\9\10\11\13
The problem is parameters \11 \13 are not working (the chars eg .999 as in the example will not appear in the changed values).
So now the question is - is there any limit for parameters?
It seems for me as its not working above 10. For shorter num.values where I need to use only up to 9 parameters the string for serach and replacement works fine, for the example above the search works but not the replacement, the end of the changed value gets corrupted.
Also, it came to my mind that instead of using Notepad++ I could maybe change the logfile on the unix server directly, howerver I had issues to build the correct perl syntax. Anyone who could help with that maybe?
After having a little play myself, it looks like back-references \11-\99 are invalid in notepad++ (which is not that surprising, since this is commonly omitted from regex languages.) However, there are several things you can do to improve that regular expression, in order to make this work.
Firstly, you should consider using less groups, or alternatively non-capture groups. Did you really need to store 13 variables in that regex, in order to do the replacement? Clearly not, since you're not even using half of them!
To put it simply, you could just remove some brackets from the regex:
[,]["]([0-9]+)[,]([0-9]+)[,]([0-9]+)[,]([0-9]+)[.]([0-9]+)["][,]
And replace with:
,\1\2\3\4.\5,
...But that's not all! Why are you using square brackets to say "match anything inside", if there's only one thing inside?? We can get rid of these, too:
,"([0-9]+),([0-9]+),([0-9]+),([0-9]+)\.([0-9]+)",
(Note I added a "\" before the ".", so that it matches a literal "." rather than "anything".)
Also, although this isn't a big deal, you can use "\d" instead of "[0-9]".
This makes your final, optimised regex:
,"(\d+),(\d+),(\d+),(\d+)\.(\d+)",
And replace with:
,\1\2\3\4.\5,
Not sure if the regex groups has limitations, but you could use lookarounds to save 2 groups, you could also merge some groups in your example. But first, let's get ride of some useless character classes
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
We could merge those groups:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
^^^^^^^^^^^^^^^^^^^^
We get:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(,)
Let's add lookarounds:
(?<=\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(?=,)
The replacement would be \2\4\6\8.
If you have a fixed length of digits at all times, its fairly simple to do what you have done. Even though your expression is poorly written, it does the job. If this is the case, look at Tom Lords answer.
I played around with it a little bit myself, and I would probably use two expressions - makes it much easier. If you have to do it in one, this would work, but be pretty unsafe:
(?:"|(\d+),)|(\.\d+)"(?=,) replace by \1\2
Live demo: http://regex101.com/r/zL3fY5

How to search (using regex) for a regex literal in text?

I just stumbled on a case where I had to remove quotes surrounding a specific regex pattern in a file, and the immediate conclusion I came to was to use vim's search and replace util and just escape each special character in the original and replacement patterns.
This worked (after a little tinkering), but it left me wondering if there is a better way to do these sorts of things.
The original regex (quoted): '/^\//' to be replaced with /^\//
And the search/replace pattern I used:
s/'\/\^\\\/\/'/\/\^\\\/\//g
Thanks!
You can use almost any character as the regex delimiter. This will save you from having to escape forward slashes. You can also use groups to extract the regex and avoid re-typing it. For example, try this:
:s#'\(\\^\\//\)'#\1#
I do not know if this will work for your case, because the example you listed and the regex you gave do not match up. (The regex you listed will match '/^\//', not '\^\//'. Mine will match the latter. Adjust as necessary.)
Could you avoid using regex entirely by using a nice simple string search and replace?
Please check whether this works for you - define the line number before this substitute-expression or place the cursor onto it:
:s:'\(.*\)':\1:
I used vim 7.1 for this. Of course, you can visually mark an area before (onto which this expression shall be executed (use "v" or "V" and move the cursor accordingly)).

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.