Parsing as string of data but leaving out quotes - regex

I need to use RegEx to run through a string of text but only return that parts that I need. Let's say for example the string is as follows:
1234,Weapon Types,100,Handgun,"This is the text, "and", that is all."""
\d*,Weapon Types,(\d*),(\w+), gets me most of the way, however it is the last part that I am having an issue with. Is there a way for me to capture the rest of the string i.e.
"This is the text, "and", that is all."""
without picking up the quotes? I've tried negating them, however it just stops the string at the quote.
Please keep in mind that the text for this string is unknown so doing literal matches will not work.

You've given us something very difficult to solve. It's okay that you have nested commas inside your string. Once we come across a double-quote, we can ignore everything until the end quote. This would gooble up commas.
But how will your parser know that the next double-quote isn't ending the string. How does it know that it a nested double-quote?
If I could slightly modify your input string to make it clear what is a nested quote, then parsing is easy...
var txt = "1234,Weapon Types,100,Handgun,\"This is the text, "and", that is all.\",other stuff";
var m = Regex.Match(txt, #"^\d*,Weapon Types,(\d*),(\w+),""([^""]+)""");
MessageBox.Show(m.Groups[3].Value);
But if your input string must have nested quotes like that, then we must come up with some other rule for detecting what is the real end of the string. How about this?
var txt = "1234,Weapon Types,100,Handgun,\"This is the text, \"and\", that is all.\",other stuff";
var m = Regex.Match(txt, #"^\d*,Weapon Types,(\d*),(\w+),""(.+)"",");
MessageBox.Show(m.Groups[3].Value);
The result is...
This is the text, "and", that is all.

Related

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

How to get items into array from string with comma separated values in type script and any item has comma it will be in double quotes

I've been struggling to get all items of below string into an array.
abc,"de,f",hi,"hello","te,st&" items into an array in Typescript.
If any string has comma (,) or ampersand (&) in it,It will be placed in double quotes.
Tried split function but it fails as my strings can have comma as well.
Any help in this regard is highly appreciated.
Thank you.
If you are looking to use a regular expression matching, can you try a different regEx that would match strings inside quotes first, then strings outside quotes, something like (\".+?\")|(^[^\"]+,)|(,[^\"]+,)
I don't know how relevant it would be in case of TypeScript, but I am guessing you'd be able to work something out that takes this Pattern and gives you the matches one by one
First of all, I think that you are making the things more complicated than what they are by implementing the following logic:
has comma (,) or ampersand (&) in it,It will be placed in double quotes.
Instead of doing this that way, you should systematically put your elements inside double quote:
abc,"de,f",hi,"hello","te,st&"
→
"abc","de,f","hi","hello","te,st&"
you will have then the following string to parse.
A regex like this one will do the job:
(?<=,")([^"]*)(?=",)|(?<=")([^"]*)(?=",)|(?<=")([^"]*)(?="$)
using back references $1$2$3, you can extract your elements.
RegEx /(?:^|,)(\"(?:[^\"])\"|[^,])/ has helped me get the required values.
var test = '"abc,123",test,123,456,"def:get"';
test.split(/(\"(?:[^\"])\"|[^,])/);
Its returning the below array.
["", ""abc,123"", ",", "test", ",", "123", ",", "456", ",", ""def:get"", ""]
And when a particular values in side double quotes,I just trimmed them to get the actual values and have ignore empty items of array..
use the split a string .....
let fullName = "First,Last"
let fullNameArr = fullName.characters.split{$0 == ","}.map(String.init)
fullNameArr[0] // First
fullNameArr[1] // Last

It is possible to pass a reference in a regex quantifier in Java?

I had a programming exercise where we had to write a format method that would take as parameters a String, as the text input we want to format, and an integer, as the length of the lines we want.
We also had specific rules, and one of them was: if a word is longer than the desired length, it should go on a line by itself.
So an easy way to do this was to search with a Regex expression a series of non-white space characters being at least length and possibly more, and to replace this matched by itself with two \n at each side...
perhaps the code is clearer: let's imagine our desired line length is 10
String s = line.replaceAll("([ \t])(\\S{10,})([ \t])", "\n$2\n");
System.out.println(s);
This actually works fine.
My problem is that the line length should be passed as a parameter to the format method; so the only way I can refer to this parameter is by using its reference. For example:
public static String format(String s, int length) { ...
//missing stuff
String s = line.replaceAll("([ \t])(\S{length,})([ \t])", "\n$2\n");
System.out.println(s);
However in that case it will return an error: PatternSyntaxException
My question is: is there any way to use a reference into a quantifier (instead of a number), as argument for the number of occurrences we want to match? As it seems it would be really useful, I was surprised that it didn't seem to work. Also it doesn't look like a lot of people encountered this issue.
try this
String s = line.replaceAll("([ \t])(\S{"+length+",})([ \t])", "\n$2\n");
System.out.println(s);

How to stop Ember.Handlebars.Utils.escapeExpression escaping apostrophes

I'm fairly new to Ember, but I'm on v1.12 and struggling with the following problem.
I'm making a template helper
The helper takes the bodies of tweets and HTML anchors around the hashtags and usernames.
The paradigm I'm following is:
use Ember.Handlebars.Utils.escapeExpression(value); to escape the input text
do logic
use Ember.Handlebars.SafeString(value);
However, 1. seems to escape apostrophes. Which means that any sentences I pass to it get escaped characters. How can I avoid this whilst making sure that I'm not introducing potential vulnerabilities?
Edit: Example code
export default Ember.Handlebars.makeBoundHelper(function(value){
// Make sure we're safe kids.
value = Ember.Handlebars.Utils.escapeExpression(value);
value = addUrls(value);
return new Ember.Handlebars.SafeString(value);
});
Where addUrlsis a function that uses a RegEx to find and replace hashtags or usernames. For example, if it were given #emberjs foo it would return #emberjs foo.
The result of the above helper function would be displayed in an Ember (HTMLBars) template.
escapeExpression is designed to convert a string into the representation which, when inserted in the DOM, with escape sequences translated by the browser, will result in the original string. So
"1 < 2"
is converted into
"1 < 2"
which when inserted into the DOM is displayed as
1 < 2
If "1 < 2" were inserted directly into the DOM (eg with innerHTML), it would cause quite a bit of trouble, because the browser would interpret < as the beginning of a tag.
So escapeExpression converts ampersands, less than signs, greater than signs, straight single quotes, straight double quotes, and backticks. The conversion of quotes is not necessary for text nodes, but could be for attribute values, since they may enclosed in either single or double quotes while also containing such quotes.
Here's the list used:
var escape = {
"&": "&",
"<": "<",
">": ">",
'"': """,
"'": "'",
"`": "`"
};
I don't understand why the escaping of the quotes should be causing you a problem. Presumably you're doing the escapeExpression because you want characters such as < to be displayed properly when output into a template using normal double-stashes {{}}. Precisely the same thing applies to the quotes. They may be escaped, but when the string is displayed, it should display fine.
Perhaps you can provide some more information about input and desired output, and how you are "printing" the strings and in what contexts you are seeing the escaped quote marks when you don't want to.

Removing parentheses as unwanted text in R using gsub

I'm trying to clean up a column in my data frame where the rows look like this:
1234, text ()
and I need to keep just the number in all the rows. I used:
df$column = gsub(", text ()", "", df$column)
and got this:
1234()
I repeated the operation with only the parentheses, but they won't go away. I wasn't able to find an example that deals specifically with parentheses being eliminated as unwanted text. sub doesn't work either.
Anyone knows why this isn't working?
Parentheses are stored metacharacters in regex. You should escape them either using \\ or [] or adding fixed = TRUE. But in your case you just want to keep the number, so just remove everything else using \\D
gsub("\\D", "", "1234, text ()")
## [1] "1234"
If your column always looks like a format described above :
1234, text ()
Something like the following should work:
string extractedNumber = Regex.Match( INPUT_COLUMN, #"^\d{4,}").Value
Reads like: From the start of the string find four or more digits.