Parse string where "separator" can be part of data? - c++

I have std strings like these:
UserName: Message
At first look it seems like an easy problem, but this issue is in that the name's last character could be a ':' and the first letter of the message part of the string could be a ':' too. The user could also have spaces in their name.
So A user might be names 'some name: '
and might type a message ' : Hello'
Which would look like:
'some name: : : Hello'
I do have the list (vector) of usernames though.
Given this, is there a way I could extract the username from this sort of string? (Ideally without having to iterate through the list of users)
Thanks

Try a regex like (\w+?):\ \w+.

If you can't gaurentee that the username won't contain a ":" characters, and you want to avoid iterating the entire list each time to check, you could try a shortcut.
Keep a vector of just the usernames that contain special chars (I'm imagining that this is a small subset of all usernames). Check those first, if you find a match, take the string after [username]: . Otherwise, you can simply do a naive split on the colon.

I would use string tokens
string text = "token, test string";
char_separator<char> sep(":");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH(string t, tokens)
{
cout << t << "." << endl;
}

The way I would approach this is to simply find the first colon. Split the string there, and then trim the two remaining strings.
It's not entirely clear to me why there are additional colons and if they are part of the value. If they need to be removed, then you'll also need to strip them out.

Related

Extracting key-value pairs from a string using ruby & regex

I want to accomplish the following with ruby and if possible a regex:
Input: "something {\"key\":\"value\",\"key2\":3}"
Output: [["\"key\"", "\"value\""], [["\"key2\"", "3"]]
My attempt so far:
s = "something {key:\"value\",key2:3}"
s.scan(/.* {(?:([^:]+):([^,}]+),?)+}$/)
# Output: [["\"key2\"", "3"]]
For some reason the regex above only matches the last key value pair. Does someone know how to retrieve all the pairs?
Just to be clear, "something" can be any kind of string. For this reason, solutions such as (1) splitting the text directly on the equal or (2) a regex as used in s.scan(/(?:([^:]+):([^,}]+),?)/) don't work for me.
I know there are similar questions on SO. Still, from what I saw, they mostly tend towards the solutions 1 & 2 or focus on a single key value pair.
your string looks like a json data structure encoded as a string, you can use JSON.parse for this as long as you remove the word "something " from the string
require 'json'
string = "something {\"key\":\"value\",\"key2\":3}"
# the following line removes the word something
string = string[string.index("{")..-1]
x = JSON.parse(string)
puts x["key"]
puts x["key2"]
you can then convert that to an array if required
alternatively if you want to use regular expressions try
string.scan(/(?:"(\w+)":"?(\w+)"?)/)

How to differentiate between space and tab while performing file operations in ocaml

I have a dumped(.rem) file with 3 entries per line, separated by tabs - "\t" as shown below.
Hello World Ocaml
I like Ocaml
To read from this file, the type is passed in a cast(attrbs) along with the file like this:
type attrbs = list (string * string * string);
let chi = (open_in file : attrbs) in
let v = input_value chi in close_in chi
Now, I get a list in "v", which I use further. In fact, it also works if the entries are separated by space.
This works fine if all the 3 entries in a row do not contain any spaces within themselves. I would like to use another file which has the first entry as a string with spaces, second entry as a string without spaces, and third entry as any string as shown below:
This is with spaces Thisiswithoutspaces Thisissomestring
Another one with spaces Anotheronewithoutspaces AnotherString
If I use the code mentioned, since it does not differentiate between space and tab, it takes only the first three words - "This", "is", and "with". I want it to include the spaces and consider "This is with spaces" as an entire string.
I tried searching the web, but couldn't find any solution for it.
Update:
The issue was with the way I read them. If I use specific formats like "%s %s %s", they will work only if we add the # character like "%s#\t%s#\t%s". It is given under the title: "Scanning indications in format strings" in https://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.html. The issue is solved.
Glad you managed to do this yourself.
However, I wouldn't recommend using Scanf for that. You can do this:
match String.split_on_char '\t' (input_line chi) with
| [a;b;c] -> ...
| exception End_of_file -> ...
| l_wrong_size -> ...
This way, you are not only sure to not rely on the quirky behavior of Scanf, but you can also easily specify what to do on malformed input.
The issue was with the way I read them. If I use specific formats like "%s %s %s", they will work only if we add the # character like "%s#\t%s#\t%s". It is given under the title: "Scanning indications in format strings" in https://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.html. The issue is solved.

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

Capitalize first letter of words in a string

I'm having trouble figuring out how to transform a string into camel case in groovy. Say I start out with a string that looks like "1-800 FOO.BAR". Ultimately, I want this to turn into "1800FooDotBar". I've been able to get 1800FOODotBar by doing the following:
String str = "1-800 FOO.BAR"
String tempStr = str.replaceAll(/(?i)\.com/, "DotCom")
String newStr = tempStr.replaceAll(/\\W/, "")
I'm just not sure how to get rid of those capital letters in the middle. I've come across some information about a capitalize() method that should be able to help, but I'm just not familiar enough with Groovy to know how to use it. I think I need to split the string into individual strings for each word and then capitalize the first letter of each of those strings, but then how do I build the end result back up? I know that similar questions have been asked, but I'm just not seeing how to take that information and make complete Groovy code from it. Thanks in advance!
Very roughly:
String str = "1-800 FOO.BAR"
println str.replaceAll(/\./, " Dot ").split(/[^\w]/).collect { it.toLowerCase().capitalize() }.join("")
=> 1800FooDotBar

Parsing as string of data but leaving out quotes

I need to use RegEx to run through a string of text but only return that parts that I need. Let's say for example the string is as follows:
1234,Weapon Types,100,Handgun,"This is the text, "and", that is all."""
\d*,Weapon Types,(\d*),(\w+), gets me most of the way, however it is the last part that I am having an issue with. Is there a way for me to capture the rest of the string i.e.
"This is the text, "and", that is all."""
without picking up the quotes? I've tried negating them, however it just stops the string at the quote.
Please keep in mind that the text for this string is unknown so doing literal matches will not work.
You've given us something very difficult to solve. It's okay that you have nested commas inside your string. Once we come across a double-quote, we can ignore everything until the end quote. This would gooble up commas.
But how will your parser know that the next double-quote isn't ending the string. How does it know that it a nested double-quote?
If I could slightly modify your input string to make it clear what is a nested quote, then parsing is easy...
var txt = "1234,Weapon Types,100,Handgun,\"This is the text, "and", that is all.\",other stuff";
var m = Regex.Match(txt, #"^\d*,Weapon Types,(\d*),(\w+),""([^""]+)""");
MessageBox.Show(m.Groups[3].Value);
But if your input string must have nested quotes like that, then we must come up with some other rule for detecting what is the real end of the string. How about this?
var txt = "1234,Weapon Types,100,Handgun,\"This is the text, \"and\", that is all.\",other stuff";
var m = Regex.Match(txt, #"^\d*,Weapon Types,(\d*),(\w+),""(.+)"",");
MessageBox.Show(m.Groups[3].Value);
The result is...
This is the text, "and", that is all.