I have the following dataset:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(MA_234_AAF_US AL_87665_ACH_USA TX_3_GH_US LA_689_KLO_US KY_3435_Z_USA)
9.96567 10.559998 12.935112 13.142867 9.35608
9.758375 9.856 10.002945 8.090142 10.313352
11.594983 9.274136 12.486753 6.661111 10.529528
10.354564 9.893115 10.625778 13.265523 7.405652
12.7978 10.76272 11.527348 10.112844 11.64973
10.63846 11.040354 8.569465 8.781206 11.448466
9.254233 13.808356 10.817062 9.545164 8.759109
11.8417 10.15155 12.72436 11.102546 11.506034
9.864883 9.864952 14.45111 10.12562 9.753519
9.965327 11.517155 9.910269 8.988406 11.359774
end
I would like to change the order of the text in the variable names like this:
US_MA_AAF_234 USA_AL_ACH_87665 US_TX_GH_3 US_LA_KLO_689 USA_KY_Z_3435
I have tried the code provided in the answers in this question:
Remove middle character from variable names
However, I could not make it work.
Here is an alternative approach.
It's inferior to using rename in one line, which addresses the purpose well. Scrutiny will show the necessary correspondence with that approach. It hinges on the names being elements separated by underscores, which are removed and then reinserted.
clear
input float(MA_234_AAF_US AL_87665_ACH_USA TX_3_GH_US LA_689_KLO_US KY_3435_Z_USA)
9.96567 10.559998 12.935112 13.142867 9.35608
end
foreach name of var * {
local new = subinstr("`name'", "_", " ", .)
tokenize `new'
rename `name' `4'_`1'_`3'_`2'
}
describe, fullnames
Contains data
obs: 1
vars: 5
size: 20
-------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------
US_MA_AAF_234 float %9.0g
USA_AL_ACH_87665
float %9.0g
US_TX_GH_3 float %9.0g
US_LA_KLO_689 float %9.0g
USA_KY_Z_3435 float %9.0g
-------------------------------------------------------------------------------------------
EDIT:
As #PearlySpencer points out, the statements within the loop
local new = subinstr("`name'", "_", " ", .)
tokenize `new'
rename `name' `4'_`1'_`3'_`2'
could be replaced by
tokenize `name', parse(_)
rename `name' `7'_`1'_`5'_`3'
The difference is that the underscores will get placed in local macros 2, 4, 6.
Related
In SAS DI when I connect a user written transformation to an output table, the variable _OUTPUT_connect is assigned. In my case it looks something like this:
%let _OUTPUT_connect = DEFER=YES READBUFF=25000 DBCLIENT_MAX_BYTES=1 DB_LENGTH_SEMANTICS_BYTE=NO PATH=MY_PATH AUTHDOMAIN="MY_AUTH_DOMAIN"
Now I'm trying to extract the PATH and AUTHDOMAIN variables from _OUTPUT_connect. My solution for now is the following:
%let _authdomain = %sysfunc(scan(&_OUTPUT_connect,7," "));
%let _path = %sysfunc(scan(%sysfunc(scan(&_OUTPUT_connect,5," ")),2,"="));
This works but it breaks if the order of the _OUTPUT_connect variables changes.
I thought I'd use regex to match the paramater values: PATH=[match_this] and AUTHDOMAIN="[match_this]", but I have problems parsing the variable _OUTPUT_connect because it contains double quotes. When I manually assign _OUTPUT_connect without the double quotes I can do the following
data _null_;
re = prxparse('/PATH=(\w)*/');
string = "&_OUTPUT_connect";
position = prxmatch(re, string);
put position=;
matched_pattern=prxposn(re, 0, string);
put matched_pattern=;
run;
Output:
position=75
matched_pattern=PATH=A1091211_SAS_SRV
The problem however is that _OUTPUT_connect contains double quotes, and the regex function fails when the input string contains double quotes. Since _OUTPUT_connect is assigned automatically, I cannot change the format.
I've tried to remove the double quotes from _OUTPUT_connect using this %let unquoted =%sysfunc(translate(%quote(&test),' ','"'));. This does work, but it puts a whitespace in place of the double quotes.
Is there an easy way to retrieve the values of PATH and AUTHDOMAIN from _OUTPUT_connect?
You can extract the name value pairs of the connection string by using SCAN with modifiers.
Example:
data nvps(label='name value pairs' keep=name value);
s = 'name1=value1 name2="value2" name3="value 3"';
do index = 1 to countw(s,' ','q');
nvp = scan(s,index,' ','q');
name = scan(nvp,1,'=','q');
value = scan(nvp,2,'=','q');
output;
end;
run;
I have a string variable named talk. Let's say I want to find all instances of the word "please" in talk and, within each row, add a suffix to each "please" that contains an incrementing count of the word.
For example, if talk looks like this:
"will you please come here please do it as soon as you can if you please"
I want it to look like this instead:
"will you please1 come here please2 do it as soon as you can if you please3"
In other words, "please1" indicates that it's the first "please" to occur, "please2" is the second, etc.
I have written some code (below) using regex and several loops but it doesn't work perfectly and, even I could work out the kinks, it seems overly complicated. Is there a simpler way to do this?
# I first extract the portion of 'talk' beginning from the 1st please to the last
gen talk_pl = strtrim(stritrim(regexs(0))) if regexm(talk, "please.+please")
# I count the number of times "please" occurs in 'talk_pl'
egen count = noccur(talk_pl), string("please")
# in the loop below, x = 2nd to last word; i = 3rd to last word
qui levelsof count
foreach n in `r(levels)' {
local i = `n' -1
local x = `i' -1
replace talk_pl = regexrf(talk_pl, "please$", "please`n'") if count == `n'
replace talk_pl = regexrf(talk_pl, "please (?=.+?please`n')", "please`i' ") if count == `n'
replace talk_pl = regexrf(talk_pl, "please (?=.+?please`i')", "please`x' ") if count == `n'
}
* Example generated by -dataex-. To install: ssc install dataex
clear
input str71 talk
"will you please come here please do it as soon as you can if you please"
end
// Install egenmore if not installed already
* ssc install egenmore
clonevar wanted = talk
// count occurrences of "please"
egen countplease = noccur(talk), string(please)
// Loop over 1 to max number of occurrences
sum countplease, meanonly
forval i = 1/`r(max)' {
replace wanted = ustrregexrf(wanted, "\bplease\b", "please`i'")
}
list
+---------------------------------------------------------------------------------------+
1. | talk |
| will you please come here please do it as soon as you can if you please |
|---------------------------------------------------------------------------------------|
| wanted | countp~e |
| will you please1 come here please2 do it as soon as you can if you please3 | 3 |
+---------------------------------------------------------------------------------------+
I'm trying to do something fairly simple (I think) but I can't get my head round it. I'm trying to write a loop that checks if a character variable in a data frame contains any of a certain list of substrings, and to assign a corresponding value to a dummy variable.
so, imagine a data.frame, n=2000, with a variable data.frame$text. Furthermore, I have a character vector containing all the substrings I want to text data.frame$text for. Let's call it hillary_exists :
hillary_exists <- c("Hilary Clinton", "hilary clinton","hilaryclinton", "hillaryclinton", "HilaryClinton",
"HillaryClinton","Hillary Clinton", "Hillary Rodham Clinton", "Hillary", "Hilary", "#Hillary2016", "#ImWithHer",
"Hillary2016", "hillary", "hilary", "Clinton 2016", "Clinton", "Secretary of State Clinton",
"Senator Clinton", "Hilary Rodham", "Hilary Rodham Clinton", "Hilary Rodham-Clinton", "Hillary Rodham-Clinton")
Now, I want my loop to test every row of data.frame$text for the existence of every element of hillary_exists, and if any of them is TRUE, to generate a new value of 1 for the variable data.frame$hillary_mention . This is what I tried:
for(i in hillary_exists){
if(grepl(hillary_exists[i], data.frame$text)){
data.frame$hillary_mention <- 1
} else {
data.frame$hillary_mention <- 0 }
}
But obviously I'm missing the i component for the data.frame$text element, but I don't know how to address it.
Any help would be greatly appreciated! Thanks
One approach we can use to get this to work is to turn hillary_exists into a regex: hillary_regex <- paste(hillary_exists, collapse = "|"). Essentially, this just takes all of your terms and turns it into a big OR statement. This takes care of one of the loops for us automatically. Next, we just loop over our text column, data.frame$text, using sapply.
data.frame$hillary_mention <- sapply(data.frame$text, function(s) grepl(hillary_regex, s, ignore.case = TRUE))
It's good to use ignore.case = TRUE here because there may be mentions in the text that aren't accounted for in hillary_exists, such as "hIllary cLinTon".
Assume a data frame has many columns that all say “bonus”. The goal is to rename each bonus column uniquely with an appended number. Example data:
string <- c("bonus", "bonus", "bonus", "bonus")
string
[1] "bonus" "bonus" "bonus" "bonus"
Desired column name output:
[1] "bonus1" "bonus2" "bonus3" "bonus4"
Assume you don’t know how many bonus columns there are be so you cannot simply paste from 1 to that number of columns to each bonus column name.
The following approach works but seems inelegant and seems too hard-coded:
bonus.count <- nrow(count(grep(pattern = "bonus", x = string)))
string.numbered <- paste0(string, seq(from = 1, to = bonus.count, 1)
How can the gsub function (or another regex-based function) substitute an incremented number? Along the lines of
string.gsub.numbered <- gsub(pattern = "bonus", replacement = "bonusincremented by one until no more bonuses", x = string)
As far as I know, gsub can't run any sort of function over each result, but using regexpr and regmatches makes this pretty easy
string <- c("bonus", "bonus", "bonus", "bonus")
m <- regexpr("bonus",string)
regmatches(string,m) <- paste0(regmatches(string,m), 1:length(m))
string
# [1] "bonus1" "bonus2" "bonus3" "bonus4"
The nice part is that regmatches allows for assignment so it's easy to swap out the matched values.
1) Using string defined in the question we can write:
paste0(string, seq_along(string))
2) If what you really have is something like this:
string2 <- "As a bonus we got a bonus coupon."
and you want to change that to "As a bonus1 we got a bonus2 coupon." then gsubfn in the gsubfn package can do that. Below, the fun method of the p proto object will be applied to each occurrence of "bonus" with count automatically incremented. THe proto object p automatically saves the state of count between matches to allow this:
library(gsubfn)
string2 <- "As a bonus we got a bonus coupon." # test data
p <- proto(fun = function(this, x) paste0(x, count))
gsubfn("bonus", p, string2)
giving:
[1] "As a bonus1 we got a bonus2 coupon."
There are additional exxamples in the proto vignette.
I am trying to read a file that looks as follows:
Data Sampling Rate: 256 Hz
*************************
Channels in EDF Files:
**********************
Channel 1: FP1-F7
Channel 2: F7-T7
Channel 3: T7-P7
Channel 4: P7-O1
File Name: chb01_02.edf
File Start Time: 12:42:57
File End Time: 13:42:57
Number of Seizures in File: 0
File Name: chb01_03.edf
File Start Time: 13:43:04
File End Time: 14:43:04
Number of Seizures in File: 1
Seizure Start Time: 2996 seconds
Seizure End Time: 3036 seconds
So far I have this code:
fid1= fopen('chb01-summary.txt')
data=struct('id',{},'stime',{},'etime',{},'seizenum',{},'sseize',{},'eseize',{});
if fid1 ==-1
error('File cannot be opened ')
end
tline= fgetl(fid1);
while ischar(tline)
i=1;
disp(tline);
end
I want to use regexp to find the expressions and so I did:
line1 = '(.*\d{2} (\.edf)'
data{1} = regexp(tline, line1);
tline=fgetl(fid1);
time = '^Time: .*\d{2]}: \d{2} :\d{2}' ;
data{2}= regexp(tline,time);
tline=getl(fid1);
seizure = '^File: .*\d';
data{4}= regexp(tline,seizure);
if data{4}>0
stime = '^Time: .*\d{5}';
tline=getl(fid1);
data{5}= regexp(tline,seizure);
tline= getl(fid1);
data{6}= regexp(tline,seizure);
end
I tried using a loop to find the line at which file name starts with:
for (firstline<1) || (firstline>1 )
firstline= strfind(tline, 'File Name')
tline=fgetl(fid1);
end
and now I'm stumped.
Suppose that I am at the line at which the information is there, how do I store the information with regexp? I got an empty array for data after running the code once...
Thanks in advance.
I find it the easiest to read the lines into a cell array first using textscan:
%// Read lines as strings
fid = fopen('input.txt', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
and then apply regexp on it to do the rest of the manipulations:
%// Parse field names and values
C = regexp(C{:}, '^\s*([^:]+)\s*:\s*(.+)\s*', 'tokens');
C = [C{:}]; %// Flatten the cell array
C = reshape([C{:}], 2, []); %// Reshape into name-value pairs
Now you have a cell array C of field names and their corresponding (string) values, and all you have to do is plug it into struct in the correct syntax (using a comma-separated list in this case). Note that the field names have spaces in them, so this needs to be taken care of before they can be used (e.g replace them with underscores):
C(1, :) = strrep(C(1, :), ' ', '_'); %// Replace spaces with underscores
data = struct(C{:});
Here's what I get for your input file:
data =
Data_Sampling_Rate: '256 Hz'
Channel_1: 'FP1-F7'
Channel_2: 'F7-T7'
Channel_3: 'T7-P7'
Channel_4: 'P7-O1'
File_Name: 'chb01_03.edf'
File_Start_Time: '13:43:04'
File_End_Time: '14:43:04'
Number_of_Seizures_in_File: '1'
Seizure_Start_Time: '2996 seconds'
Seizure_End_Time: '3036 seconds'
Of course, it is possible to prettify it even more by converting all relevant numbers to numerical values, grouping the 'channel' fields together and such, but I'll leave this to you. Good luck!