I was wondering if it is possible to split a variable in to multiple variables. I have a big variable I would like to split into multiple columns.
"0060175052";"Three Chords and the Truth: Hope, Heartbreak, and Changing Fortunes in Nashville";"Laurence Leamer";"1997","Harpercollins"
I would like to split it where the semicolons are. So:
v1 = "0060175052"
v2 = "Three Chords and the Truth: Hope, Heartbreak, and Changing Fortunes in
Nashville"
v3 = "Laurence Leamer"
v4 = "1997"
v5 = "Harpercollins"
Sofie:
How you split the value depends on where it is coming from.
Reading a data file
For the case of reading a text file with an INPUT statement you indicate the delimiter in the INFILE statement, for example:
INFILE *input-file* DSD DLM=';';
Parsing a data value
For the case of a variable in an existing data set the SCAN function can extract different parts of a string.
v1 = SCAN (big_variable, 1, ';');
...
v5 = SCAN (big_variable, 5, ';');
If the big variable values can contain consecutive semi-colons that indicate a blank value you will need to use the M modifier in the modifiers argument. For example:
v1 = SCAN (big_variable, 1, ';', 'M');
For only five parts you probably don't need to array-ify the process. If the string has many parts to split, an array would by used to reduce coding repetition:
attrib v1-v20 length=$200;
array v v1-v20;
do index = 1 to dim(v);
v(index) = SCAN (big_variable, index, ';');
end;
More advanced scanning techniques would use Perl regular expressions as surfaced by the SAS PRX* call routines and functions -- such as PRXPARSE, PRXMATCH, PRXNEXT, etc...
Related
I have data in Stata coding info in string words. I need to filter the rows if any variable is named "R45851" if not I do not need the whole row
enter image description here
I think this is about finding observations (rows, in spreadsheet jargon) with a value (not a name) within one or more variables. One recipe is
gen found = 0
foreach v of var I* {
replace found = (`v' == "R45851") if !found
}
list if found
If the varlist (here I*) contains numeric variables, filter first, or put capture in front of the replace.
I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;
First i have created this table
data rmlib.tableXML;
input XMLCol1 $ 1-10 XMLCol2 $ 11-20 XMLCol3 $ 21-30 XMLCol4 $ 31-40 XMLCol5 $ 41-50 XMLCol6 $ 51-60;
datalines;
| AAAAA A||AABAAAAA|| BAAAAA|| AAAAAA||AAAAAAA ||AAAA |
;
run;
I want to clean, concatenate and export. I have written the following code
data rmlib.tableXML_LARGO;
file CleanXML lrecl=90000;
set rmlib.tableXML;
array XMLCol{6} ;
array bits{6};
array sqlvars{6};
do i = 1 to 6;
*bits{i}=%largo(XMLCol{i})-2;
%let bits =input(%largo(XMLCol{i})-2,comma16.5);
sqlvars{i} = substr(XMLCol{i},2,&bits.);
put sqlvars{i} &char10.. #;
end;
run;
the macro largo count how many characters i have
%macro largo(num);
length(put(&num.,32500.))
%mend;
What i need is instead of have char10, i would like that this number(10) would be the length, of each string, so to have something like
put sqlvars{i} &char&bits.. #;
I don't know if it possible but i can't do it.
I would like to see something like
AAAAA AAABAAAAA BAAAAA AAAAAAAAAAAAA AAAA
It is important to me to keep the spaces(this is only an example of an extract of a xml extract). In addition I will change (for example) "B" for "XPM", so the size will change after cleaning the text, that it what i need to be flexible in the char
Thank you for your time
Julen
I'm still not quite sure what you want to achieve, but if you want to combine the text from multiple varriables into one variable, then you could do something along the lines:
proc sql;
select name into :names separated by '||'
from dictionary.columns
where 1=1
and upcase(libname)='YOURLIBNAME'
and upcase(memname)='YOURTABLENAME';
quit;
data work.testing;
length resultvar $ 32000;
set YOURLIBNAME.YOURTABLENAME;
resultvar = &names;
resultvar2 = compress(resultvar,'|');
run;
Wasn't able to test this, but this should work if you replace YOURLIBNAME and YOURTABLENAME with your respective tables. I'm not 100% sure if the compress will preserve the spaces in the text.. But I think it should.
The format $VARYING. <length-variable> is a good candidate for solving this output problem.
On the presumption of having a number of variables whose values are vertical-bar bounded and wanting to output to a file the concatenation of the values without the bounding bars.
data have;
file "c:\temp\want.txt" lrecl=9000;
length xmlcol1-xmlcol6 $100;
array values xmlcol1-xmlcol6 ;
xmlcol1 = '| A |';
xmlcol2 = '|A BB|';
xmlcol3 = '|A BB|';
xmlcol4 = '|A BBXC|';
xmlcol5 = '|DD |';
xmlcol6 = '| ZZZ |';
do index = 1 to dim(values);
value = substr(values[index], 2); * ignore presumed opening vertical bar;
value_length = length(value)-1; * length with still presumed closing vertical bar excluded;
put value $varying. value_length #; * send to file the value excluding the presumed closing vertical bar;
end;
run;
You have some coding errors in that is making it difficult to understand what you want to do.
Your %largo() macro doesn't make any sense. There is no format 32500.. The only reason it would run in your code is because you are trying to apply the format to a character variable instead of a number. So SAS will automatically convert to use the $32500. instead.
The %LET statement that you have hidden in the middle of your data step will execute BEFORE the data step runs. So it would be less confusing to move it before the data step.
So replacing the call to %largo() your macro variable BITS will contain this text.
%let bits =input(length(put(XMLCol{i},32500.))-2,comma16.5);
Which you then use inside a line of code. So that line will end up being this SAS code.
sqlvars{i} = substr(XMLCol{i},2,input(length(put(XMLCol{i},$32500.))-2,comma16.5));
Which seems to me to be a really roundabout way to do this:
sqlvars{i} = substr(XMLCol{i},2,length(XMLCol{i})-2);
Since SAS stores character variables as fixed length, it will pad the value stored. So what you need to do is to remember the length so that you can use it later when you write out the value. So perhaps you should just create another array of numeric variables where you can store the lengths.
sqllen{i} = length(XMLCol{i})-2;
sqlvars{i} = substr(XMLCol{i},2,sqllen{i});
I'm trying to create a code that will open a file with a list of numbers in it and then take those numbers and smooth them as many times as the user wants. I have it opening and reading the file, but it will not transpose the numbers. In this format it gives this error: TypeError: unsupported operand type(s) for /: 'str' and 'float'. I also need to figure out how to make it transpose the numbers the amount of times the user asks it to. The list of numbers I used in my .txt file is [3, 8, 5, 7, 1].
Here is exactly what I am trying to get it to do:
Ask the user for a filename
Read all floating point data from file into a list
Ask the user how many smoothing passes to make
Display smoothed results with two decimal places
Use functions where appropriate
Algorithm:
Never change the first or last value
Compute new values for all other values by averaging the value with its two neighbors
Here is what I have so far:
filename = raw_input('What is the filename?: ')
inFile = open(filename)
data = inFile.read()
print data
data2 = data[:]
print data2
data2[1]=(data[0]+data[1]+data[2])/3.0
print data2
data2[2]=(data[1]+data[2]+data[3])/3.0
print data2
data2[3]=(data[2]+data[3]+data[4])/3.0
print data2
You almost certainly don't want to be manually indexing the list items. Instead, use a loop:
data2 = data[:]
for i in range(1, len(data)-1):
data2[i] = sum(data[i-1:i+2])/3.0
data = data2
You can then put that code inside another loop, so that you smooth repeatedly:
smooth_steps = int(raw_input("How many times do you want to smooth the data?"))
for _ in range(smooth_steps):
# code from above goes here
Note that my code above assumes that you have read numeric values into the data list. However, the code you've shown doesn't do this. You simply use data = inFile.read() which means data is a string. You need to actually parse your file in some way to get a list of numbers.
In your immediate example, where the file contains a Python formatted list literal, you could use eval (or ast.literal_eval if you wanted to be a bit safer). But if this data is going to be used by any other program, you'll probably want a more widely supported format, like CSV, JSON or YAML (all of which have parsers available in Python).
Noob question:
I have the output of a complex matrix done in Fortran, the contents looks like this:
(-0.594209719263636,1.463867815703586E-006)
(-0.783378034185788,-0.182301028756558) (-0.794024313844809,0.128219337674814)
(0.592814294881930,4.069892201461069E-002)
I want to read and use this data in a julia program.
No, I don't want to change the writting format, I would like to learn how to strip off
the "trash" characters like '(', or ','. This may be useful for arbitrary Input files.
2.I have tried with the following code:
file = open(pathtofilename, "r")
data_str = readall(ifile)
data_numbers_str = split(data_str)
data_numbers = split(data_numbers_str, ['('])
However, the manual is not quite self-explanatory [http://docs.julialang.org/en/release-0.2/stdlib/base/?highlight=split].
Here is what I'd do
data = "(-0.594209719263636,1.463867815703586E-006) (-0.783378034185788,-0.182301028756558) (-0.794024313844809,0.128219337674814) (0.592814294881930,4.069892201461069E-002)"
function pair_to_complex(pair)
nums = float(split(pair[2:end-1], ","))
return Complex(nums...)
end
numbers = map(pair_to_complex, split(data, " "))
To explain
The pair[2:end-1] removes the parenthesis
I then split that on the , to get an array with two numbers, still as strings
I convert them to Float64 with float(), obtaining an array of floats
I make a new complex number. The ... splats the array out so it provides the two arguments to Complex - I could have done Complex(nums[1],nums[2])
I then apply this logic using map to every term in the data.