destring 18 digit number returns rounding error - stata

I have IDs with a length of 18 as strings and I want to transform them to a numeric variable.
I tried the follwoing code:
destring var, replace
That returns the variable with a numeric format. However, the last digit of the ID includes a rounding error. E.g.: 123456789123456000 --> 123456789123456001
How can I destring my values without any change in the ID?

I can't reproduce that as this shows:
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen test = "123456789123456000"
. destring, gen(check)
test: all characters numeric; check generated as double
. l
+--------------------------------+
| test check |
|--------------------------------|
1. | 123456789123456000 1.235e+17 |
+--------------------------------+
. format check %23.0f
. l
+-----------------------------------------+
| test check |
|-----------------------------------------|
1. | 123456789123456000 123456789123456000 |
+-----------------------------------------+
The other way round does produce error as the string "123456789123456001" maps to 123456789123456000. In essence, you are bumping up against what can be held exactly in a double with 8 bytes for each number.

Related

Trimming whitespace in string variable

I have a variable acusip that looks like this:
000000111111
I need to eliminate black spaces at the end.
I know this function, but do not know the command:
strrtrim("acusip ")="acusip"
Any idea how to solve this issue?
The following works for me:
clear
set obs 1
generate acusip1 = "000000111111 "
generate acusip2 = strtrim(acusip1)
list
+-----------------------------------+
| acusip1 acusip2 |
|-----------------------------------|
1. | 000000111111 000000111111 |
+-----------------------------------+

Python spark extract characters from dataframe

I have a dataframe in spark, something like this:
ID | Column
------ | ----
1 | STRINGOFLETTERS
2 | SOMEOTHERCHARACTERS
3 | ANOTHERSTRING
4 | EXAMPLEEXAMPLE
What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:
ID | New Column
------ | ------
1 | STRIN_F
2 | SOMEO_E
3 | ANOTH_S
4 | EXAMP_E
I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:
import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))
Thanks all!
Use something like this:
df.withColumn('new_column', concat(df.Column.substr(1, 5),
lit('_'),
df.Column.substr(8, 1)))
This use the function substr and concat
Those functions will solve your problem.

How to remove all of the contents of a string except the first character?

I have a data set with first name, middle name, and last name. I'm going to merge it with another data set matching on the same variables.
In one data set the variable mi looks like:
Lowell
Ann
Carl
A
Fran
Allen
And I want it to look like:
L
A
C
A
F
A
I tried this:
gen mi2 = substr(mi, 2, length(mi))
but this does the opposite of what I want but it's the closest that I've been able to do. I know this is probably a really easy problem but I'm stumped at the moment.
You are on the right track with substr. See the example below:
clear
input str10 mi
Lowell
Ann
Carl
A
Fran
Allen
end
gen mi2 = substr(mi,1,1)
list, sep(0)
+--------------+
| mi mi2 |
|--------------|
1. | Lowell L |
2. | Ann A |
3. | Carl C |
4. | A A |
5. | Fran F |
6. | Allen A |
+--------------+
The second and third arguments to substr are the starting position and number of characters respectively. In this case, you want to start at the first character, and take one character, so substr(mi, 1, 1) is what you need.

How to match sub pattern in Robot Framework?

I am doing following things in RFW:
STEP 1 : I need to match the "NUM_FLOWS" value from the following command output.
STEP 2 : If its "Zero - 0" , Testcase should FAIL. If its NON-ZERO, Test case is PASS.
Sample command output:
router-7F2C13#show app stats gmail on TEST/switch1234-15E8CC
--------------------------------------------------------------------------------
APPLICATION BYTES_IN BYTES_OUT NUM_FLOWS
--------------------------------------------------------------------------------
gmail 0 0 4
--------------------------------------------------------------------------------
router-7F2C13#
How to do this with "Should Match Regexp" and "Should Match" keywords? How to check only that number sub-pattern? (Example: In the above command output, NUM_FLOWS is NON-ZERO, Then testcase should PASS.)
Please help me to achieve this.
Thanks in advance.
My New robot file content:
Write show dpi app stats BitTorrent_encrypted on AVC/ap7532-15E8CC
${raw_text} Read Until Regexp .*#
${data[0].num_flows} 0
| | ${data}= | parse output | ${raw_text}
| | Should not be equal as integers | ${data[0].num_flows} | 0
| | ... | Excepted num_flows to be non-zero but it was zero | values=False
There are many ways to solve this. A simple way is to use robot's regular expression keywords to look for "gmail" at the start of a line, and then expect three numbers and then the number 0 (zero) followed by the end of the line. This assumes that a) NUM_FLOWS is always the last column, and b) there is only one line that begins with "gmail". I don't know if those are valid assumptions or not.
Because the data spans multiple lines, the pattern includes (?m) (the multiline flag) so that $ means "end of line" in addition to "end of string".
| | Should not match regexp | ${data} | (?m)\\s+gmail\\s+\\d+\\s+\\d+\\s+0\\s*$
| | ... | Expected non-zero value in the fourth column for gmail, but it was zero.
There are plenty of other ways to solve the problem. For example, if you need to check for other values in other columns, you might want to write a python keyword that parses the data and returns some sort of data structure.
Here's a quick example. It's not bulletproof, and makes some assumptions about the data passed in. I wouldn't use it in production, but it illustrates the technique. The keyword returns a list of items, and each item is a custom object with four attributes: name, bytes_in, bytes_our and num_flows:
# python library
import re
def parse_output(data):
class Data(object):
def __init__(self, raw_text):
columns = re.split(r'\s*', raw_text.strip())
self.name = columns[0]
self.bytes_in = int(columns[1])
self.bytes_out = int(columns[2])
self.num_flows = int(columns[3])
lines = data.split("\n")
result = []
# skip first four lines and the last two
for line in lines[4:-3]:
result.append(Data(line))
return result
Using it in a test:
*** Test Cases ***
| | # <put your code here to get the data from the >
| | # <router and store it in ${raw_text} >
| | ${raw_text}= | ...
| | ${data}= | parse output | ${raw_text}
| | Should not be equal as integers | ${data[0].num_flows} | 0
| | ... | Excepted num_flows to be non-zero but it was zero | values=False

Stata Regular expressions extracting numerical values

I have some data that looks like this
var1
h 01 .00 .0 abc
d 1.0 .0 14.0abc
1,0.0 0.0 .0abc
It should be noted that the last three alpha values are the same, and I am hoping to extract all the numerical values within the string. The code that I'm using look like this
gen x1=regexs(1) if regexm(var1,"([0-9]+) [ ]*(abc)*$")
However, this code only extracts the numbers before the abc term and stops after a space or a .. For example, only 0 before abc is extracted from the first term. I was wondering whether there is a way to handle this and extract all the numerical values before the alpha characters.
As #Roberto Ferrer points out, your question isn't very clear, but here is an example using moss from SSC:
. clear
. input str16 var1
var1
1. "h 01 .00 .0 abc"
2. "d 1.0 .0 14.0abc"
3. "1,0.0 0.0 .0abc"
4. end
. moss var1, regex match("([0-9]+\.*[0-9]*|\.[0-9]+)")
. l _match*
+---------------------------------------+
| _match1 _match2 _match3 _match4 |
|---------------------------------------|
1. | 01 .00 .0 |
2. | 1.0 .0 14.0 |
3. | 1 0.0 0.0 .0 |
+---------------------------------------+