I have data from a chat that i want to read in one entry at the time. Every time a person has hit "send" should be one observation. The problem is when there is breaks (enter) in the text. I can't manage to make SAS keep reading this as the same observation. Here is some dummy data:
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)
How are you then?
08:26 - Greg: All good.
I want this to be 5 observations but i can only manage SAS to read this as 7 obs. Desired dataset should look like:
Obs VAR1
1 08:23 - Greg: Hi!
2 08:24 - Sue: Hello
3 08:24 - Greg: How are you?
4 08:25 - Sue: Just fine :) How are you then?
5 08:26 - Greg: All good.
I play around with the code:
data testing;
infile datalines ;
input var1 $60. ;
datalines;
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)
How are you then?
08:26 - Greg: All good.
;
But the actual file is a txt and have more irregularities than the above dummy example. I have tried to use the trailing # but cant get it to work the way i want. Maybe trailing # is not what i am after. Any suggestions how to proceed?
Try this.
Keep a running variable that is the last value. If the current value has a time stamp in the first 4 characters, then output it and reset the value to "". Append the current value to the running variable. Finally, output the last line, no matter what.
data testing(keep=line);
set testing end=last;
format line $2000.;
retain line;
if _n_ > 1 then do;
if index(substr(var1,1,4),":") then do;
output;
line = "";
end;
end;
put line= var1=;
line = catx(" ",line , var1);
put line=;
if last then do;
output;
put "AT LAST";
end;
run;
I unsuccesfully tried to find a solution in row data input, anyway I hope that this will be useful for you, postprocessing strings:
data testing;
infile datalines ;
input var1 $60.;
datalines;
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)
How are you then?
08:26 - Greg: All good.
;
data testing01;
set testing;
retain row 0;
if input(substr(var1,1,2),8.) le 24 and input(substr(var1,1,2),8.) ne .
and substr(var1,3,1)=':'
and input(substr(var1,4,2),8.) le 59 and input(substr(var1,4,2),8.) ne . then row = row+1; else row=row;
run;
proc transpose data=testing01 out=testing02;
var var1;
by row;
run;
data testing03;
length final $2000;
set testing02;
array str[*] col:;
do i=1 to dim(str);
if str[i] ne '' then final=cats(strip(final)||' '||strip(str[i]));
end;
drop col: row i _name_;
run;
filename FT15F001 temp;
data testing ;
infile FT15F001 end=eof ;
length string $6323;
retain string;
input #;
if _n_=1 then string=_infile_;
else if not missing(_infile_) and anydigit(_infile_)^=1 then string=catx(' ',string,_infile_);
else if not missing(_infile_) and anydigit(_infile_)=1 then do;
output;
call missing(string);
string=_infile_;
end;
if eof then output;
PARMCARDS;
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)
How are you then?
08:26 - Greg: All good.
;
There are a lot of ways to do this, depending on your particular use case.
Here's a regular expression one. This won't work if you have > 32767 total characters, unless you have some way to split it into chunks, but for smaller files works well; and the general approach can be used even if you read in a line at a time.
data test;
infile "c:\temp\chat.txt" recfm=f lrecl=32767;
input #;
rx_find = prxparse('~(\d\d:\d\d -.*?)(?=(?:\b\d\d:\d\d)|$)~ios');
rc_find = prxmatch(rx_find,_infile_);
pos=1;
pos2=0;
start=1;
call prxposn(rx_find,1,pos,len);
do until (pos2=0);
call prxposn(rx_find,1,pos,len);
found=substr(_infile_,pos,len);
output;
start=pos+len;
call prxnext(rx_find,start,-1,_infile_,pos2,len2);
end;
stop;
run;
Related
This is a follow-up of my previous question:
How to import a txt file with single quote mark in a variable and another in another variable.
The solution there works perfectly until there is not a variable whose values could be null.
In this latter case, I get:
filename sample 'c:\temp\sample.txt';
data _null_;
file sample;
input;
put _infile_;
datalines;
001|This variable could be null|PROVA|MILANO|1000
002||'80S WERE GREAT|FORLI'|1100
003||'80S WERE GREAT|ROMA|1110
;
data want;
data prova;
infile sample dlm='|' lrecl=50 truncover;
format
codice $3.
could_be_null $20.
nome $20.
luogo $20.
importo 4.
;
input
codice
could_be_null
nome
luogo
importo
;
putlog _infile_;
run;
proc print;
run;
Is it possible to correctly load a file like the one in the example directly in SAS, without manually modifying the original .txt?
You will need to pre-process the file to fix the issue.
If you add quotes around the values then you will not have the problem.
002||"'80S WERE GREAT"|"FORLI'"|1100
IF you know that none of the values contain the delimiter then adding a space before every delimiter
002 | |'80S WERE GREAT |FORLI' |1100
will let you read it without the DSD option.
If lines are shorter than 32K bytes then it can be done in the same step that reads the data.
data test2 ;
infile sample dlm='|' truncover ;
input #;
_infile_ = tranwrd(_infile_,'|',' |');
input (var1-var5) (:$40.);
run;
proc print;
run;
Results:
Obs var1 var2 var3 var4 var5
1 001 This variable could be null PROVA MILANO 1000
2 002 '80S WERE GREAT FORLI' 1100
3 003 '80S WERE GREAT ROMA 1110
One way to test if you have the issue is to make sure each line has the right number of fields.
filename sample temp;
options parmcards=sample;
parmcards;
001|This variable could be null|PROVA|MILANO|1000
002||'80S WERE GREAT|FORLI'|1100
003||'80S WERE GREAT|ROMA|1110
;
data _null_;
infile sample dsd end=eof;
if eof then do;
call symputx('nfound',nfound);
putlog / 'Found ' nfound :comma11.
'problem lines out of ' _n_ :comma11. 'lines.'
;
end;
input;
retain expect nfound;
words=countw(_infile_,'|','qm');
if _n_=1 then expect=words;
else if expect ne words then do;
nfound+1;
if nfound <= 10 then do;
putlog (_n_ expect words) (=) ;
list;
end;
end;
run;
Example Results:
_N_=2 expect=5 words=4
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
2 002||'80S WERE GREAT|FORLI'|1100 32
_N_=3 expect=5 words=3
3 003||'80S WERE GREAT|ROMA|1110 30
Found 2 problem lines out of 4 lines.
PS Go tell SAS to enhance their delimited file processing: https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-file-variations/idi-p/435977
You need to add the DSD option to your INFILE statement.
https://support.sas.com/techsup/technote/ts673.pdf
DSD (delimiter-sensitive data) option—Specifies that SAS should treat
delimiters within a data value as character data when the delimiters
and the data value are enclosed in quotation marks. As a result, SAS
does not split the string into multiple variables and the quotation
marks are removed before the variable is stored. When the DSD option
is specified and SAS encounters consecutive delimiters, the software
treats those delimiters as missing values. You can change the default
delimiter for the DSD option with the DELIMTER= option.
I figured out the solution to my problem already, but I'd like to know what is happening exactly, and why, or maybe if there is a workaround to the following:
Suppose you have:
data test;
length group $20.;
subject=1; hours=0; group= 'hour 1'; output;
subject=1; hours=1; group= 'hour 15'; output;
subject=1; hours=2; group= 'hour 15'; output;
subject=2; hours=0; group= 'hour 1'; output;
subject=2; hours=1; group= 'hour 15'; output;
subject=2; hours=2; group= 'hour 15'; output;
run;
And you are sorting on the hours first, then group because it is character and wouldn't properly sort otherwise.
proc sort data=test;
by subject hours group;
run;
Now when you run this code to retrieve only the first record of each group:
data test2;
set test;
by subject hours group;
if first.group;
run;
It will print each record.
I recently learned that 'When you use more than one variable in the BY statement; If the first/last variable linked to a primary BY-variable changes to 1, the first/last variable linked to the second BY-variable will also be changed to one.'.
So of course, because the hours variable changes, the first/last from group is also reset.
So 'why' is this code running fine?
data test2;
set test;
by subject group;
if first.group;
run;
It seems a bit weird to have to leave out variables you sorted on, and it doesn't appear so flexible, you can't use a macro variable list as an input to sort and by statement in a data step for example...? If this is just the way it is, is there maybe another preferred way of doing these kind of operations? I can see myself making this error often, just copy pasting the list of sorting variables...
If you want to use a BY statement to generate FIRST. and LAST. variables for a grouped variable that is not actually sorted then use the NOTSORTED keyword on the BY statement.
For example you might want to order the data by HOUR and then group it by the STATUS so that you can find out at what hour they transitioned to that STATUS.
data have;
input subject hour status $;
cards;
1 0 C
1 1 B
1 2 B
1 3 D
2 0 A
2 1 D
2 2 D
;
data want ;
set have ;
by subject status notsorted;
if first.status;
run;
Result:
Obs subject hour status
1 1 0 C
2 1 1 B
3 1 3 D
4 2 0 A
5 2 1 D
I have created the following SAS table:
DATA test;
INPUT name$ Group_Number;
CARDS;
Joseph 1
Stephanie 2
Linda 3
James 1
Jane 2;
run;
I would like to change group number from a character type into a numeric type.
Here is my attempt:
data test2;
set test;
Group_Number1 = input(Group_Number, best5.);
run;
The problem is that when I execute:
proc contents data = test2;
run;
The output table shows that group number is still of a character type. I think that the problem may be that I have "best5." in my input statement. However I am not 100% sure what is wrong.
How can I fix the solution?
If you have a character variable your code will work. But you don't, you have a numeric variable in your sample data. So either your fake data is incorrect, or you don't have the problem you think you do.
Here's an example that you can run to see this.
*read group_number as numeric;
DATA test_num;
INPUT name$ Group_Number;
CARDS;
Joseph 1
Stephanie 2
Linda 3
James 1
Jane 2
;
run;
Title 'Group_Number is Numeric!';
proc contents data=test;
run;
*read group_number as character;
DATA test_char;
INPUT name$ Group_Number $;
CARDS;
Joseph 1
Stephanie 2
Linda 3
James 1
Jane 2
;
run;
data test_converted;
set test_char;
group_number_num = input(group_number, 8.);
run;
Title 'Group_Number is Character, Group_Number1 is Numeric';
proc contents data=test_converted;
run;
try this:
data test2;
set test;
Group_Number1 = input(put(Group_Number,best5.),best5.);
run;
I understand how to use pointer control to search for a phrase in the raw data and then read the value into a SAS variable. I need to know how to tell SAS to stop reading the raw data when it encounters a particular phrase.
For example in the below code I want to read the data only between phrases Start and Stop. So the Jelly should not be part of the output
data work.two;
input #"1" Name :$32.;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
You cannot really combine those into a single pass through the file. The problem is that the #'1' will skip past the line with STOP in it so there is no way your data step will see it.
Pre-process the file.
filename copy temp;
data _null_;
file copy ;
retain start 0 ;
input ;
if index(_infile_,'Start') then start=1;
if start then put _infile_;
if index(_infile_,'Stop') then stop;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
data work.two;
infile copy ;
input #"1" Name :$32. ##;
run;
You can make the logic to detect what parts of the source file to include as complex as you need.
All names are the second position from the left of each row, so name could be got by scan function, if there is 'Stop' in the row then stop loop.
data work.two;
input ##;
Name=scan(_infile_,-2);
if indexw(_infile_,'Stop')>0 then stop;
input;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
I have a dataset which has 1 column and n rows like this:
Dataset1:
Column1
--------
AAA AAA
BBB BBB
CCC CCC
DDD DDD
EEE EEE
I want to make from this data 1 row like:
"AAA AAA"n "BBB BBB"n "CCC CCC"n "DDD DDD"n "EEE EEE"n
I will make this in macro.
I used like catx function. But the function removes spaces from data..
I used do loop like this :
.
.
.
.
data _NULL_;
if &I^=1 then do;
frstclmn=&frstclmn||""""||"&clmn"||""""||"n ";
end;
run;
.
.
.
But I couldn't assing a variable in do lop in data statement with itself.
How can I do ? Thanks
Edit:
%MACRO result;
data _NULL_;
set &LIB_NAME..column_list;
retain namelist;
length namelist $5000;
namelist=catx(' ',namelist,cats('"',name,'"n'));
run;
---how can I use "namelist" variable here ? out of data statement.---
%MEND result;
This code runs perfectly. Now I want to use this namelist variable out of this data statement. If I print like this %put &namelist=; It show wrong result in macro. I want to use this variable result in macro other statement.
It's not clear to me what output you seek. Perhaps this will give you some hints.
data names;
input name $32.;
cards;
AAA AAA
BBB BBB
CCC CCC
DDD DDD
EEE EEE
;;;;
run;
proc sql noprint;
select nliteral(name) into :namelist separated by ' ' from names;
quit;
run;
%put NOTE: &=namelist;
NOTE: NAMELIST="AAA AAA"N "BBB BBB"N "CCC CCC"N "DDD DDD"N "EEE EEE"N
The sql method data _null_ shows above is the better method, but if you're going to do it in data step, use ' ' as your delimiter.
data _NULL_;
set sashelp.class;
retain namelist;
length namelist $500;
namelist=catx(' ',namelist,cats('"',name,'"n'));
put namelist=;
run;
Of course you could use quote, or nliteral, both to better effect.
data _NULL_;
set sashelp.class;
retain namelist;
length namelist $500;
namelist=catx(' ',namelist,nliteral(name));
put namelist=;
run;