Replace dots using `gsub` - regex

I am trying to replace all the "." in a specific column of my data frame with "/". There are other characters in each cell and I want to make sure I only change the "."'s.
When I use gsub, I get an output that appears to make the changes, but then when I go to View(), the changes are not actually made...I thought gsub was supposed to actually change the value in the data frame. Am I using it incorrectly? I have my code below.
gsub(".", "/", spy$Identifier, ignore.case = FALSE, perl = FALSE,
fixed = TRUE, useBytes = FALSE)
I also tried sub, but the code I have below changed every entry itself to "/" and I am not sure how to change it.
spy$Identifier <- sub("^(.).*", "/", spy$Identifier)
Thanks!

My recommendation would be to escape the "." character:
spy$Identifier <- gsub("\\.", "/", spy$Identifier)
In regular expression, a period is a special character that matches any character. "Escaping" it tells the search to look for an actual period. In R's gsub this is accomplished with two backslashes (i.e.: "\\"). In other languages, it's often just one backslash.

Related

re.sub() ellipsis in Python 3

I need a simple solution, but it's evading me. I am passing a list of strings to a for loop for some cleaning up, and need to remove any instance of an ellipsis. Here's an example of what I've tried:
text_list = ["string1", "string2", "string3...", "string.4"]
for i in range(len(text_list)):
text_list[i] = re.sub("\.", "", text_list[i])
text_list[i] = re.sub("\.{3}", "", text_list[i])
text_list[i] = re.sub("\.\.\.", "", text_list[i])
Naturally, none of these removes an ellipsis. The period is removed, though. So my output would be:
for text in text_list:
print(text)
>>>string1
string2
string3... <- THIS ONE DIDN'T CHANGE
string4 <- BUT THIS ONE DID
I've exhausted my regex documentation and Google searches. How do I match an ellipsis with a regex?
#swalladge had the right notion here: use unicode. Here is his answer.
"If you want to remove an actual ellipsis, as in the unicode HORIZONTAL ELLIPSIS character (…), then you need to use that in the code, since 3 periods won't match it." –#swalladge
#rickdenhaan also had an easier way to accomplish the task. Thanks!

Regex to select text outside of underscores

I am looking for a regex to select the text which falls outside of underscore characters.
Sample text:
PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant
Basically I need to be able to select the first keyword which is always before the first underscore and the last keyword which is always after the last underscore. As an additional complexity, there case also be texts which have no underscore at all, these need to be selected completely as well.
The best I got yet was this expression:
^((?! *\_[^)]*\_ *).)*
which is only yielding me the first part, not the second and it has no support for the non-underscore yet at all.
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
Thanks!
Use JavaScript string function split(). Check below example.
var t = "PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant";
var arr = t.split('_');
console.log(arr);
//Access the required parts like this
console.log(arr[0] + ' ' + arr[arr.length - 1]);
Perhaps something like this:
/(^[^_]+)|([^_]+$)/g
That is, match either:
^[^_]+ the beginning of the string followed by non-underscores, or
[^_]+$ non-underscores followed by the end of the string.
var regex = /(^[^_]+)|([^_]+$)/g
console.log("A_b_c_D".match(regex)) // ["A", "D"]
console.log("A_b_D".match(regex)) // ["A", "D"]
console.log("A_D".match(regex)) // ["A", "D"]
console.log("AD".match(regex)) // ["AD"]
I'm not sure if you should use a regex here. I think splitting the string at underscore, and using the first and last element of the resulting array might be faster, and less complicated.
Trivial with .replace:
str.replace(/_.*_/, '')
// "PartIWantPartIwant"
With matching, you'd need to be selecting and concatenating groups:
parts = str.match(/^([^_]*).*?([^_]*)$/)
parts[1] + parts[2]
// "PartIWantPartIwant"
EDIT
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
This is not possible: a regular expression cannot match a discontinuous span.

Swift 3: iosMath label removing all spaces

I'm trying to display text which may at times contain a math expression so I am using MTMathUILabel from iosMath. I generate the labels dynamically and add them to a stack as I pull the strings from the db. The problem is that all text which is not math appears with no spaces. i.e:
In db: Solve the following equation: (math here)
In label: Solvethefollowingequation: (math here)
Here is what I have tried so far:
for question in all_questions {
let finalString = question.question?.replacingOccurrences(of: " ", with: "\\space", options: .literal, range: nil)
let label = MTMathUILabel()
label.textColor = UIColor.black
label.latex = finalString
stack.addArrangedSubview(label)
}
But the problem is that it literally places two . And xcode doesn't let me write just one \ because it is not escaped. However if I just write
print("\\space")
Then it will print just one.
How can I fix this so I add only one \? If this cannot be done, how can I achieve what I want? Is there a better library out there?
After giving a quick look at MTMathUILabel's doc and LaTeX conventions, I believe you should replace your spaces with a tilde character "~". This will make them non-breaking spaces and avoid the backslash issue (which is probably due to \space not being understood by MTMathUILabel).
Systematic replacement of all spaces may yield undesirable result if the formula itself has legitimate spaces in it.
For example, a quadratic equation would be expressed as:
x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}
You will end up replacing spaces inside curly braces, and that may or may not be what you want:
x~=~\frac{-b~\pm~\sqrt{b^2-4ac}}{2a}

Select until next dot followed by \s?

I could use some help writing a regex. I have the following text:
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.prime COLUMN-LABEL "" FORMAT "X(35)"
ZUNACT.sec COLUMN-LABEL " " FORMAT "X(30)"
INFDON.sep COLUMN-LABEL "" FORMAT "99/99/9999"
IF INFDON.top THEN "S" ELSE (IF INFDON.REPORT THEN "R" ELSE (IF INFDON.prime <> "" THEN INFDON.prime ELSE "")) COLUMN-LABEL "R" FORMAT "X(1)"
/* _UIB-CODE-BLOCK-END */
&ANALYZE-RESUME
WITH SEPARATORS SIZE 83.57 BY 5.08
BGCOLOR 15 FGCOLOR 1 FONT 6 FIT-LAST-COLUMN.
I have to find this whole block in a text file, so far I have this regex:
(?:DEFINE|DEF)\s([\w\s]*)BROWSE\s+([\w-]+)\s+([^.]*)\.
My problem is that it selects only this :
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.
When I want to select until the final point. Basically, the rule I want to apply is "until next dot followed by \s".
But I can't figure out how to write this regex.
Allow "non-dot" [^.] OR "dots not followed by space" \.(?!\s):
DEF(INE)?\s([\w\s]*)BROWSE\s+([\w-]+)\s+(([^.]|\.(?!\s))*)\.
Note also the simplification of the leading term.
Probably the most readable way to do that is
(?:DEFINE|DEF)\s([\w\s]*)BROWSE[\S\s]+?\.\s
You turn the + operator lazy with ?, meaning by default it matches everything until it hits the first period followed by a space.
If you have the option to use an ungreedy regex library, the simplest yet closest to what you specified would be
DEFINE\s+BROWSE.*?\.\s
Note, however, that the trailing whitespace may not be there at the end of your input text, leaving the last statement unmatched.
You may find it useful to have a lexer (scanner) like flex or ANTLR tokenize your string. This approach has the advantage that the lexer takes care of the white space and lets you specify the form of the block of interest in more detail.

Removing parentheses as unwanted text in R using gsub

I'm trying to clean up a column in my data frame where the rows look like this:
1234, text ()
and I need to keep just the number in all the rows. I used:
df$column = gsub(", text ()", "", df$column)
and got this:
1234()
I repeated the operation with only the parentheses, but they won't go away. I wasn't able to find an example that deals specifically with parentheses being eliminated as unwanted text. sub doesn't work either.
Anyone knows why this isn't working?
Parentheses are stored metacharacters in regex. You should escape them either using \\ or [] or adding fixed = TRUE. But in your case you just want to keep the number, so just remove everything else using \\D
gsub("\\D", "", "1234, text ()")
## [1] "1234"
If your column always looks like a format described above :
1234, text ()
Something like the following should work:
string extractedNumber = Regex.Match( INPUT_COLUMN, #"^\d{4,}").Value
Reads like: From the start of the string find four or more digits.