How to remove brackets and quotes using SAS - sas

I have a list of artists that is formatted like so:
['Justin Bieber']
['Brockhampton']
etc
and I want to make it so these variables no longer have quotes or brackets and instead look like:
Justin Bieber
Brockhampton
How would I do this?

Use the compress function.
=compress(artist, "[']");
The second argument adds both square brackets and the quotation mark to the list of characters to remove.
I'm doing this entirely from memory and it's years since I used SAS, so it might struggle with the quotation mark inside the quotation marks. You could also try
=compress(artist, '[]', 'p');
where the third argument adds all punctuation marks to the list of characters to remove.
Anyway, the compress function is what you want. Experiment with it if the exact arguments above don't quite work!

Related

Are there any characters that are not allowed/used in regex

I have the somehow weird requirement that several regex should be passed as one single string to a jenkins plugin.
They should be entered in one single textfield and I have to split this string in a List of Regex later on.
Now the issue is, I can't think of any way to delimit the regexes in the string so I can later split this string as a character like a , could also be considered part of a regex itself.
E.g. if I'd use a , for the two regex "(\d+,?\s+\d{1})\.xls" and "\w+\.exe" :
"(\d+,?\s+\d{1})\.xls,\w+\.exe"
would be split into 3 regexes: "(\d+", "?\s+\d{1})\.xls" and "\w+\.exe"
where the first 2 are obviously invalid.
So my actual question is, are there any characters, that can never appear in a regex which I could use to delimit my regexes?
No, any and all characters can appear in a regex. Use any serialisation format to serialise your list of strings into a clearly expressed list format, e.g. JSON:
["(\\d+", "?\\s+\\d{1})\\.xls", "\\w+\\.exe"]
Alternatively CSV or anything else that can express a list of things and properly escapes characters used to denote item separators.

Adjust existing regex to ignore semicolon inside quotes

I am using a regex to read csv files and split its columns. The input of files changes frequently, and is unpredictable how the content will come (not the format). I already use the following regex to read the csv file and split the columns:
;(?=(?:[^\"]*\"*[^\"]*\")*[^\"]*$)
It was working until I faced a input like these:
'02'.'018'.'7975';PRODUCT 1;UN;02
'02'.'018'.'7976';PRODUCT 2;UN;02
'02'.'018'.'7977';PRODUCT 3;UN;02
'02'.'018'.'7978';"PRODUCT 4 ; ADDITIONAL INFO";UN;02 // Problem
'02'.'018'.'7979';"PRODUCT 5 ; ADDITIONAL INFO";UN;02 // Problem
I would like to understand how I can adjust my regex and adapt it to ignore semicolon inside quotes.
I am using Java with the method split from String class.
Bear in mind that you should probably use a parser for this, but if you must use regex, here's one that should work:
;(?=[^"]*(?:(?:"[^"]*){2})*$)
Explanation
; matches the semicolon.
(?=...) is a positive lookahead. It checks that the pattern contained in it will match, without actually matching it.
[^"]*(?:(?:"[^"]*){2})*$ ensures that there are an even number of quotes in the rest of the string.

Regex for text before a comma within double brackets

I am trying to extract text that is not only inside of double brackets but also before a comma. I have only been able to solve the two issues separately (I think) but can't figure out how to bring it together
Double brackets: """\[\[(.+?)\]\]*"""
Before comma: """([^,]+)"""
I think you definitely need to escape the []:
\[\[(.+?)\]\]*
If it needs to be before a comma as well, can't you use:
(\[\[(.+?)]]*,)
I might be missing something here. Sorry.

Regular expression to delete row from csv

I have a line from CSV
first decimal;;;first text;;second text with newlines, special symbols, including semicolons;second decimal, always present;first dot separated float, may not present;second dot separated float, may not present;third text that present only if present previous float
I need to delete second text (with new lines and special symbols).
As for now I have expression like:
(?<=;;)(.*?)(?=;\d+)
First part of it does not work, and I don't know how to make it select text preceded by only two semicolons (for now it selects text preceded by two or more semicolons and first decimal preceded by semicolons + newline if I turn on dotall). Besides, I do not know how to include newline symbol here (.*?).
If you have a CSV file that contains semicolons and newlines as part of quoted fields, then regex is not the right tool for this. Imagine what would happen if you had a field like "This is one field;;don't split this;42"...
If you're sure that you'll never have two semicolons before or within a quoted field, then you may give regex a try. But a dedicated CSV parser would definitely be a safer bet.
That said, let's see why your regex fails:
Imagine the line 1;;;2;3. Your regex will match ;2 because it fulfills all the requirements - there are two semicolons before it, and a semicolon plus digit after it. It's also the shortest possible match at this position in the string.
What can you do? You could use another lookbehind assertion to make sure that it's not possible to match three semicolons before the current position:
(?<=;;)(?<!;;;)(.*?)(?=;\d+)
Give it a try - but look into CSV libraries too, because they will solve your problem better.

Split a string based on each time a Deterministic Finite Automata reaches a final state?

I have a problem which has an solution that can be solved by iteration, but I'm wondering if there's a more elegant solution using regular expressions and split()
I have a string (which excel is putting on the clipboard), which is, in essence, comma delimited. The caveat is that when the cell values contain a comma, the whole cell is surrounded with quotation marks (presumably to escape the commas within that string). An example string is as follows:
123,12,"12,345",834,54,"1,111","98,273","1,923,002",23,"1,243"
Now, I want to elegantly split this string into individual cells, but the catch is I cannot use a normal split expression with comma as a delimiter, because it will divide cells that contain a comma in their value. Another way of looking at this problem, is that I can ONLY split on a comma if there is an EVEN number of quotation marks preceding the comma.
This is easy to solve with a loop, but I'm wondering if there's a regular expression.split function capable of capturing this logic. In an attempt to solve this problem, I constructed the Deterministic Finite Automata (DFA) for the logic.
The question now is reduced to the following: is there a way to split this string such that a new array element (corresponding to /s) is produced each time the final state (state 4 here) is reached in a DFA?
Using regex (unescaped): (?:(?:"[^"]*")|(?:[^,]*))
Use that and call Regex.Matches() which is .NET, or its analog in other platforms.
You could further expand the above to this: ^(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*))(?:,(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*)))*$
This will parse the whole string in 1 shot, but you need named groups and multi-capture per group for this to work (.NET supports it).
Eligible commas are also followed by an even number of quotes, and VBScript does support lookaheads. Try splitting on this:
",(?=(?:[^""]*""[^""]*"")*[^""]*$)"