awk split by sequence of whitespaces - regex

Can split(string, array, separator) in awk use sequence of whitespaces as the separator (or more generally any regexp as the separator)?
Obviously, one could use the internal autosplit (that runs on each line of the input with value of FS variable as the separator) and with simple for and $0 magic do the trick. However, I was just wondering if there's a more straightforward way using the splititself.

The GNU Awk User's Guide states:
split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the
pieces in array. The first piece is stored in array[1], the second
piece in array[2], and so forth. The string value of the third
argument, fieldsep, is a regexp describing where to split string (much
as FS can be a regexp describing where to split input records). If
the fieldsep is omitted, the value of FS is used. split returns the
number of elements created. The split function, then, splits strings
into pieces in a manner similar to the way input lines are split into
fields
Here is a short (somewhat silly) example that uses a simple regular expression ".s " that will match any single character followed by a lower-case s and a space. The result of the split is put into array a. Note that the parts that match are not placed into the array.
BEGIN {
s = "this isn't a string yes isodore?"
count = split(s, a, ".s ")
printf("number of splits: %d\n", count)
print "Contents of array:"
for (i = 1; i <= count; i++)
printf "a[%d]: %s\n", i, a[i]
}
The output:
$ awk -f so.awk
number of splits: 3
Contents of array:
a[1]: th
a[2]: isn't a string y
a[3]: isodore?
The article Advanced Awk for Sysadmins show an example of parsing a line using split(). This page contains an example of using a regular expression to split data into
an array.

From the GNU awk(1) manual page:
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and returns the number of fields. If r is omitted, FS is used instead.
The point here is that you can use any regular expression to perform field splitting--at least you can with gawk. If you're using something else, you'll need to check your documentation.

Related

How to delimit this text file? strtok

so there's a text file where I have 1. languages, a 2. text of a number written in the said language, 3. the base of the number and 4. the number written in digits. Here's a sample:
francais deux mille quatre cents 10 2400
How I went about it:
struct Nomen{
char langue[21], nomNombre [31], baseC[3], nombreC[21];
int base, nombre;
};
and in the main:
if(myfile.is_open()){
{
while(getline(myfile, line))
{
strcpy(Linguo[i].langue, strtok((char *)line.c_str(), " "));
strcpy(Linguo[i].nomNombre, strtok(NULL, " "));
strcpy(Linguo[i].baseC, strtok(NULL, " "));
strcpy(Linguo[i].nombreC, strtok(NULL, "\n"));
i++;
}
Difficulty: I'm trying to put two whitespaces as a delimiter, but it seems that strtok() counts it as if there were only one whitespace. The fact there are spaces in the text number, etc. is messing up the tokenization. How should I go about it?
strtok treats any single character in the provided string as a delimiter. It does not treat the string itself as a single delimiter. So " " (two spaces) is the same as " " (one space).
strtok will also treat multiple delimiters together as a single delimiter. So the input "t1 t2" will be tokenized as two tokens, "t1" and "t2".
As mentioned in comments, strtok is also writes the NUL character into the input to create the token strings. So, it is an error to pass the result of string::c_str() as input to the function. The fact that you need to cast the constant string should have been enough to dissuade you from this approach.
If you want to treat a double space as a delimiter, you will have to scan the string and search for them yourself. Given you are using C APIs, you can consider strstr. However, in C++, you can use string::find.
Here's an algorithm to parse your string manually:
Given an input string input:
language is the substring from the start of input to the first SPC character.
From where language ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
text is the substring from the start of input to the first double SPC sequence.
From where text ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
Parse base, and parse number.

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

regexp with varying integer lengths

I want split these strings from
CH1Avg
Ch2Avg
Ch3
Ch4Avg
Ch5
Ch6Avg
Chan7
Channel9
Ch010
Ch011Avg
Chann12Average
...up to...
Ch100AVG
I need to split them into their consituent parts
"Ch", ##, "Avg"
1st and 3rd components are of variable length and form. I want to split using the 2nd component which is an integer of vary length from 0 to 100. The integer may or may not be zero padded.
Any thoughts? I am trying to use () without much success.
To split the string into the constituent parts, I suggest using named tokens for convenience:
strCell = {'CH1Avg'
'Ch2Avg'
'Ch3'
'Ch4Avg'
'Ch5'
'Ch6Avg'
'Chan7'
'Channel9'
'Ch010'
'Ch011Avg'
'Chann12Average'}
out = regexp(strCell,'(?<channelName>\D+)(?<channelNum>\d+)(?<channelType>\w*)','names')
out = [out{:}];
out(end)
ans =
channelName: 'Chann'
channelNum: '12'
channelType: 'Average'
Split on (\d+). The parentheses ensure that the number you're splitting on will also become part of the array.

How to separate a line of input into multiple variables?

I have a file that contains rows and columns of information like:
104857 Big Screen TV 567.95
573823 Blender 45.25
I need to parse this information into three separate items, a string containing the identification number on the left, a string containing the item name, and a double variable containing the price. The information is always found in the same columns, i.e. in the same order.
I am having trouble accomplishing this. Even when not reading from the file and just using a sample string, my attempt just outputs a jumbled mess:
string input = "104857 Big Screen TV 567.95";
string tempone = "";
string temptwo = input.substr(0,1);
tempone += temptwo;
for(int i=1 ; temptwo != " " && i < input.length() ; i++)
{
temptwo = input.substr(j,j);
tempone += temp2;
}
cout << tempone;
I've tried tweaking the above code for quite some time, but no luck, and I can't think of any other way to do it at the moment.
You can find the first space and the last space using std::find_first_of and std::find_last_of . You can use this to better split the string into 3 - first space comes after the first variable and the last space comes before the third variable, everything in between is the second variable.
How about following pseudocode:
string input = "104857 Big Screen TV 567.95";
string[] parsed_output = input.split(" "); // split input string with 'space' as delimiter
// parsed_output[0] = 104857
// parsed_output[1] = Big
// parsed_output[2] = Screen
// parsed_output[3] = TV
// parsed_output[4] = 567.95
int id = stringToInt(parsed_output[0]);
string product = concat(parsed_output[1], parsed_output[2], ... ,parsed_output[length-2]);
double price = stringToDouble(parsed_output[length-1]);
I hope, that's clear.
Well try breaking down the files components:
you know a number always comes first, and we also know a number has no white spaces.
The string following the number CAN have whitespaces, but won't contain any numbers(i would assume)
After this title, you're going to have more numbers(with no whitespaces)
from these components, you can deduce:
grabbing the first number is as simple as reading in using the filestream <<.
getting the string requires you to check until you reach a number, grabbing one character at a time and inserting that into a string. the last number is just like the first, using the filestream <<
This seems like homework so i'll let you put the rest together.
I would try a regular expression, something along these lines:
^([0-9]+)\s+(.+)\s+([0-9]+\.[0-9]+)$
I am not very good at regex syntax, but ([0-9]+) corresponds to a sequence of digits (this is the id), ([0-9]+\.[0-9]+) is the floating point number (price) and (.+) is the string that is separated from the two number by sequences of "space" characters: \s+.
The next step would be to check if you need this to work with prices like ".50" or "10".

Split a string on whitespace in Go?

Given an input string such as " word1 word2 word3 word4 ", what would be the best approach to split this as an array of strings in Go? Note that there can be any number of spaces or unicode-spacing characters between each word.
In Java I would just use someString.trim().split("\\s+").
(Note: possible duplicate Split string using regular expression in Go doesn't give any good quality answer. Please provide an actual example, not just a link to the regexp or strings packages reference.)
The strings package has a Fields method.
someString := "one two three four "
words := strings.Fields(someString)
fmt.Println(words, len(words)) // [one two three four] 4
DEMO: http://play.golang.org/p/et97S90cIH
From the docs:
Fields splits the string s around each instance of one or more consecutive white space characters, as defined by unicode.IsSpace, returning a slice of substrings of s or an empty slice if s contains only white space.
If you're using tip: regexp.Split
func (re *Regexp) Split(s string, n int) []string
Split slices s into substrings separated by the expression and returns
a slice of the substrings between those expression matches.
The slice returned by this method consists of all the substrings
of s not contained in the slice returned by FindAllString. When called
on an expression that contains no metacharacters, it is equivalent to strings.SplitN.
Example:
s := regexp.MustCompile("a*").Split("abaabaccadaaae", 5)
// s: ["", "b", "b", "c", "cadaaae"]
The count determines the number of substrings to return:
n > 0: at most n substrings; the last substring will be the unsplit remainder.
n == 0: the result is nil (zero substrings)
n < 0: all substrings
I came up with the following, but that seems a bit too verbose:
import "regexp"
r := regexp.MustCompile("[^\\s]+")
r.FindAllString(" word1 word2 word3 word4 ", -1)
which will evaluate to:
[]string{"word1", "word2", "word3", "word4"}
Is there a more compact or more idiomatic expression?
You can use package strings function split
strings.Split(someString, " ")
strings.Split