How to re-determine column type in Stata - stata

I had a column of numbers in Stata that was however read in as strings because it contained a string value "nan" for one of the numbers. I have since replaced this with a missing value, so that the column only contains numbers now, albeit all in string format. What is the command to re-determine the type of the column?

Terminology: "columns" in Stata are always called variables.
Variables being numeric or string is in the first instance a matter of variable type or storage type. Display format is then to be assigned. "Format" in Stata doesn't mean variable type.
With data like this
clear
input str5 stryit
"1"
"2"
"42"
"666"
"NAN"
end
There are several prudent rules.
Check to see what kinds of observations wouldn't produce numeric values if coerced:
tab stryit if missing(real(stryit))
If there are many such kinds, you might need to rethink the approach.
Always leave the original variable as it came unless and until you are sure that you no longer need it. So use destring with force if you like but generate a new variable. In your case that would be fine.
destring stryit, force gen(ntryit1)
Better than using force is to be explicit about your conversion rules. That leaves a record of what you did (assuming naturally that you keep a record of all commands used in any serious analysis):
destring stryit, ignore("NA") gen(ntryit2)
You can explicitly change problematic values before destring. An advantage of that, like the previous rule, is that you have a record of what you did.
clonevar stryit2 = stryit
replace stryit2 = "." if stryit2 == "NAN"`
destring stryit2, gen(ntryit3)
Check to see that results make sense:
list
+------------------------------------------------+
| stryit ntryit2 ntryit1 stryit2 ntryit3 |
|------------------------------------------------|
1. | 1 1 1 1 1 |
2. | 2 2 2 2 2 |
3. | 42 42 42 42 42 |
4. | 666 666 666 666 666 |
5. | NAN . . . . |
+------------------------------------------------+
Disclaimer: original author of destring

Related

Get rid of characters between two characters in Splunk

I'm currently facing a little problem.
I'm a beginner with Splunk, and I need to print a temperature in a single value widget.
I want the temperature to have °C at the end.
When I'm doing this: | eval value = value +"°C"
The printed value is 80.00 °C.
I want 80°C to be printed.
I also tried to use the Major value and trend which is supposed to allow me to add a unit after a value but it prints it very tiny compared to the temperature value.
Try the eval function round() first (presuming "value" is just a number):
| eval value=round(value)+"°C"
Alternatively ... use replace():
| eval value=replace(value,"\.[^°]+","")

PostgreSQL - tricky regular expression - what am I missing?

I have data as follows - please see the fiddle here for all data and code below:
INSERT INTO t VALUES
('|0|34| first zero'),
('|45|0| second zero'),
('|0|0| both zeroes');
I want to SELECT from the start of the line
1st character in the line is a piple (|)
next characters are a valid (possibly negative - one minus sign) INTEGER
after the valid INT, another pipe
then another valid INT
then a pipe
The rest of the line can be anything at all - including sequences with pipe, INT, pipe, INT - but these are not to be SELECTed!
and I'm using a regex to try and SELECT the valid INTEGERs. A single ZERO is also a valid reading - one ZERO and one ZERO only!
The valid integers must be from between the first 3 pipe (|) characters and not elsewhere in the line - i.e.
^|3|3|adfasfadf |555|6666| -- tuple (3, 3) is valid
but
^|--567|-765| adfasdf -- tuple (--567, -765) is invalid - two minus signs!
and
^|This is stuff.... |34|56| -- tuple (34, 56) is invalid - doesn't start pipe, int, pipe, int!
Now, my regexes (so far) are as follows:
SELECT
SUBSTRING(a, '^\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n1,
SUBSTRING(a, '^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n2,
a
FROM t;
and the results I'm getting for my 3 records of interest are:
n1 n2 a
0 NULL |0|34| first zero -- don't want NULL, want 34
45 0 |45|0| second zero -- OK!
0 NULL |0|0| both zeroes -- don't want NULL, want 0
3 3 |3|3| some stuff here
...
... other data snipped - but working OK!
...
Now, the reason why it works for the middle one is that I have (0{1}|.... other parts of the regex in both the upper and lower one!
So, that means take 1 and only 1 zero OR... the other parts of the regex. Fine, I've got that much!
However, and this is the crux of my problem, when I try to change:
'^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
to
'^\|0{1}|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
Notice the 0{1}| bit I've added near the beginning of my regex - so, this should allow one and only one ZERO at the beginning of the second string (preceded by a pipe literal (|)) OR the rest... the pipe at the end of my 5 character snippet above in this case being part of the regex.
But the result I get is unchanged for the first 3 records - shown above, but it now messes up many records further down - one example a record like this:
|--567|-765|A test of bad negatives...
which obviously fails (NULL, NULL) in the first SELECT now returns (NULL,-765) for the second. If the first fails, I want the second to fail!
I'm at a loss to understand why adding 0{1}|... should have this effect, and I'm also at a loss to understand why my (0, NULL), (45, 0) and (0, NULL) don't give me (0, 0), (45, 0) and (0, 0) as I would expect?
The 0{1}| snippet appears to work fine in the capturing groups, but not outside - is this the problem? Is there a problem with PostgreSQL's regex implementation?
All I did was add a bit to the regex which said as well as what you've accepted before, please accept one and only one leading ZERO!
I have a feeling there's something about regexes I'm missing - so my question is as follows:
could I please receive an explanation as to what's going on with my regex at the moment?
could I please get a corrected regex that will work for INTEGERs as I've indicated. I know there are alternatives, but I'd like to get to the bottom of the mistake I'm making here and, finally
is there an optimum/best method to achieve what I want using regexes? This one was sort of cobbled together and then added to as further necessary conditions became clearer.
I would want any answer(s) to work with the fiddle I've supplied.
Should you require any further information, please don't hesitate to ask! This is not a simple "please give me a regex for INTs" question - my primary interest is in fixing this one to gain understanding!
Some simplifications could be done to the patterns.
SELECT
SUBSTRING(a, '^\|(0|[+-]?[1-9][0-9]*)\|[+-]?[0-9]+\|') AS n1,
SUBSTRING(a, '^\|[+-]?[0-9]+\|(0|[+-]?[1-9][0-9]*)\|') AS n2,
a
FROM t;
n1 | n2 | a
:--- | :--- | :--------------------------------------------------------------
0 | 34 | |0|34| first zero
45 | 0 | |45|0| second zero
0 | 0 | |0|0| both zeroes
3 | 3 | |3|3| some stuff here
null | null | |SE + 18.5D some other stuff
-567 | -765 | |-567|-765|A test of negatives...
null | null | |--567|-765|A test of bad negatives...
null | null | |000|00|A test of zeroes...
54 | 45 | |54|45| yet more stuff
32 | 23 | |32|23| yet more |78|78| stuff
null | null | |This is more text |11|111|22222||| and stuff |||||||
null | null | |1 1|1 1 1|22222|
null | null | |71253412|ahgsdfhgasfghasf
null | null | |aadfsd|34|Fails if first fails - deliberate - Unix philosophy!
db<>fiddle here

How to build a propper H2O word2vec training_frame

How do I build a H2O word2vec training_frame that distinguishes between different document/sentences etc.?
As far as I can read from the very limited documentation I have found, you simply supply one long list of words? Such as
'This' 'is' 'the' 'first' 'This' 'is' 'number' 'two'
However it would make sense to be able to distinguish – ideally something like this:
Name | ID
This | 1
is | 1
the | 1
first | 1
This | 2
is | 2
number | 2
two | 2
Is that possible?
word2vec is a type of unsupervised learning: it turns string data into numbers. So to do a classification you need to do a two-step process:
word2vec for strings to numbers
any supervised learning technique for numbers to categories
The documentation contains links to a categorization example in each of R and Python. This tutorial shows the same process on a different data set (and there should be a H2O World 2017 video that goes with that).
By the way, in your original example, you don't just supply the words; the sentences are separated by NA. If you give h2o.tokenize() a vector of sentences, it will make this format for you. So your example would actually be:
'This' 'is' 'the' 'first' NA 'This' 'is' 'number' 'two'

split single variable value in two

i have dataset a
data q7;
input trt$;
cards;
a150
b250
c300
400
abc180
;
run;
We have to create dataset b like this
trt dose
a150 150mg
b250 250mg
c300 300mg
400 400mg
abc180 180mg
new dose variable is added & mg is written after each
numeric values
here is my solution - Basically use the compress functions to keep (hence the 'k') only numbers from the trt variable. From there then is just the case of concatenating mg to numbers.
data want;
set q7;
dose = cats(compress(trt,'0123456789','k'),'mg');
run;
The compress function default behaviour is to return a character string with specified characters removed from the original string.
so
compress(trt,'0123456789') would have removed all numbers from the trt variable.
However compress comes with a battery of modifiers that let the user alter the default behaviour.
So in your case, we wanted to keep numbers regardless of the number of preceding letters so I used the modifier k to keep instead the list of characters in this case 012345679
For a full list of modifiers please read the following link
cats is one of the many functions SAS have to concatenate strings, so passing the compress argument as 1st string and mg as 2nd string will concatenate both to produce your desired result
hope it helps

need to substring out numeric data from a character field in SAS

I have a data set with 2 variables: a subject id number and a result. The result is a character variable. It was read in from an excel spreadsheet. Most results are numbers, but some of the results have a letter after them which was serving as a footnote in the excel file. I need to get rid of the letters after the numbers so I can convert the data to numeric for analysis. How can I do this? Below is some code to create an example dataset of the structure that I'm talking about.
data test;
input id result $ ;
datalines;
1 13
2 15
3 20
4 25c
5 75
6 99c
7 89b
8 10a
9 100
10 67
;
run;
Have a look at the compress and input functions.
num = input(compress(result, , "dk"), best.);
input converts character to numeric, interpreting the data using the informat you provide (best. here).
compress can be used to strip certain characters from a string, here it is used with the d modifier to request that all numeric digits be excluded, and the k modifier to request that the selected characters be kept rather than removed.
You may have to tweak the compress arguments a bit to deal with more complicated cases such as decimal points.