I have a comma-separated string and I need to match all the commas in this string except for the commas inside the double-quotes. I'm using regex for this.
,,,,"8000000,B767-200","B767-200","Boeing 767-200","ACFT",,,,,,,,,,,,,,,,,,,,,,,,,,
I tried the following regex patterns but none of them are working in PL/SQL but working in online regex testers.
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
(?!\B"[^"]*),(?![^"]*"\B)
I'm using REGEXP_INSTR function inside a procedure in PL/SQL to identify the index of the commas. Can someone suggest me a working regex pattern in PL/SQL for this purpose or help me to write one.
Thank you.
Oracle does not support look-ahead and non-capturing groups so you will need to match the quotes.
Assuming you can either have a non-quoted string or a quoted string (which could contain escaped quotes) then you can the regular expression:
([^",]*|"(\\"|[^"])*"),
Which you could use like this:
WITH matches ( id, csv, start_pos, comma_pos, idx, num_matches ) AS (
SELECT id,
csv,
1,
REGEXP_INSTR( csv, '([^",]*|"(\\"|[^"])*"),', 1, 1, 1, NULL ) - 1,
1,
REGEXP_COUNT( csv, '([^",]*|"(\\"|[^"])*"),' )
FROM test_data
UNION ALL
SELECT id,
csv,
REGEXP_INSTR( csv, '([^",]*|"(\\"|[^"])*"),', 1, idx + 1, 0, NULL ),
REGEXP_INSTR( csv, '([^",]*|"(\\"|[^"])*"),', 1, idx + 1, 1, NULL ) - 1,
idx + 1,
num_matches
FROM matches
WHERE idx < num_matches
)
SELECT id,
idx,
start_pos,
comma_pos,
SUBSTR( csv, start_pos, comma_pos - start_pos ) AS value
FROM matches
so for your test data:
CREATE TABLE test_data ( id, csv ) AS
SELECT 1, ',,,,"8000000,B767-200","B767-200","Boeing 767-200","ACFT",,,,,,,,,,,,,,,,,,,,,,,,,,' FROM DUAL
which outputs:
ID | IDX | START_POS | COMMA_POS | VALUE
-: | --: | --------: | --------: | :-----------------
1 | 1 | 1 | 1 | null
1 | 2 | 2 | 2 | null
1 | 3 | 3 | 3 | null
1 | 4 | 4 | 4 | null
1 | 5 | 5 | 23 | "8000000,B767-200"
1 | 6 | 24 | 34 | "B767-200"
1 | 7 | 35 | 51 | "Boeing 767-200"
1 | 8 | 52 | 58 | "ACFT"
1 | 9 | 59 | 59 | null
1 | 10 | 60 | 60 | null
1 | 11 | 61 | 61 | null
1 | 12 | 62 | 62 | null
1 | 13 | 63 | 63 | null
1 | 14 | 64 | 64 | null
1 | 15 | 65 | 65 | null
1 | 16 | 66 | 66 | null
1 | 17 | 67 | 67 | null
1 | 18 | 68 | 68 | null
1 | 19 | 69 | 69 | null
1 | 20 | 70 | 70 | null
1 | 21 | 71 | 71 | null
1 | 22 | 72 | 72 | null
1 | 23 | 73 | 73 | null
1 | 24 | 74 | 74 | null
1 | 25 | 75 | 75 | null
1 | 26 | 76 | 76 | null
1 | 27 | 77 | 77 | null
1 | 28 | 78 | 78 | null
1 | 29 | 79 | 79 | null
1 | 30 | 80 | 80 | null
1 | 31 | 81 | 81 | null
1 | 32 | 82 | 82 | null
1 | 33 | 83 | 83 | null
db<>fiddle here
(Note: you wanted to match the commas and this regular expression does exactly what you ask; it does not match any final value in the comma-delimited list as there is no terminating comma. If you wanted to do that then use the regular expression ([^",]*|"(\\"|[^"])*")(,|$) db<>fiddle.)
If you want it in a procedure then:
CREATE PROCEDURE extract_csv_value(
i_csv IN VARCHAR2,
i_index IN INTEGER,
o_value OUT VARCHAR2
)
IS
BEGIN
o_value := REGEXP_SUBSTR( i_csv, '([^",]*|"(\\"|[^"])*")(,|$)', 1, i_index, NULL, 1 );
IF SUBSTR( o_value, 1, 1 ) = '"' THEN
o_value := REPLACE( SUBSTR( o_value, 2, LENGTH( o_value ) - 2 ), '\"', '"' );
END IF;
END;
/
then:
DECLARE
csv VARCHAR2(4000) := ',,,,"8000000,B767-200","B767-200","Boeing 767-200","ACFT",,,,,,,,,,,,,,,,,,,,,,,,,,';
value VARCHAR2(100);
BEGIN
FOR i IN 1 .. 10 LOOP
extract_csv_value( csv, i, value );
DBMS_OUTPUT.PUT_LINE( LPAD( i, 2, ' ' ) || ' ' || value );
END LOOP;
END;
/
outputs:
1
2
3
4
5 8000000,B767-200
6 B767-200
7 Boeing 767-200
8 ACFT
9
10
db<>fiddle here
I tried to solve it without using a REGEX so check the following PROCEDURE if works as expected.
CREATE OR REPLACE PROCEDURE p_extract(p_string IN VARCHAR) AS
TYPE table_result IS TABLE OF VARCHAR2(255) INDEX BY PLS_INTEGER;
t_retval table_result;
opening BOOLEAN := FALSE;
cnt INTEGER := 1;
I INTEGER := 1;
j INTEGER := 1;
BEGIN
WHILE cnt <= LENGTH(p_string) AND cnt <> 0 LOOP
IF substr(p_string, cnt, 1) = '"'THEN
opening := NOT opening;
END IF;
IF opening THEN
I := instr(p_string, '"', cnt + 1, 1);
t_retval(t_retval.COUNT + 1) := substr(p_string, cnt, I - cnt + 1);
END IF;
cnt := instr(p_string, '"', cnt + 1, 1);
END LOOP;
FOR K IN t_retval.FIRST..t_retval.LAST LOOP
dbms_output.put_line(t_retval(K));
END LOOP;
END;
Test it.
BEGIN
p_extract(',,,,"8000000,B767-200","B767-200","Boeing 767-200","ACFT",,,,,,,,,,,,,,,,,,,,,,,,,,');
END;
--OUTPUT
/*
"8000000,B767-200"
"B767-200"
"Boeing 767-200"
"ACFT"
*/
However this won't work if you miss the last or first "
Related
How do you extract all values containing part of a particular number and then delete them?
I have data where the ID contains different lengths and wants to extract all the IDs with a particular number. For example, if the ID contains either "-00" or "02" or "-01" at the end, pull to be able to see the hit rate that includes those—then delete them from the ID. Is there a more effecient way in creating this code?
I tried to use the substring function to slice it to get the result, but there is some other ID along with the specified position.
Code:
Proc sql;
Create table work.data1 AS
SELECT Product, Amount_sold, Price_per_unit,
CASE WHEN Product Contains "Pen" and Lenghth(ID) >= 9 Then ID = SUBSTR(ID,1,9)
WHEN Product Contains "Book" and Lenghth(ID) >= 11 Then ID = SUBSTR(ID,1,11)
WHEN Product Contains "Folder" and Lenghth(ID) >= 12 Then ID = SUBSTR(ID,1,12)
...
END AS ID
FROM A
Quit;
Have:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229-01 | Book | 20 | 5 |
| ABC134475472 02 | Folder | 29 | 7 |
| AB-1235674467-00 | Pencil | 26 | 1 |
| 69598346-02 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Wanted the final result:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229 | Book | 20 | 5 |
| ABC134475472 | Folder | 29 | 7 |
| AB-1235674467 | Pencil | 26 | 1 |
| 69598346 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Just test if the string has any embedded spaces or hyphens and also that the last word when delimited by space or hyphen is 00 or 01 or 02 then chop off the last three characters.
data have;
infile cards dsd dlm='|' truncover ;
input id :$20. product :$20. amount_sold price_per_unit;
cards;
123456789 | Pen | 30 | 2 |
63495837229-01 | Book | 20 | 5 |
ABC134475472 02 | Folder | 29 | 7 |
AB-1235674467-00 | Pencil | 26 | 1 |
69598346-02 | Correction pen | 15 | 1.50 |
6970457688 | Highlighter | 15 | 2 |
584028467 | Color pencil | 15 | 10 |
;
data want;
set have ;
if indexc(trim(id),'- ') and scan(id,-1,'- ') in ('00' '01' '02') then
id = substrn(id,1,length(id)-3)
;
run;
Result
amount_ price_
Obs id product sold per_unit
1 123456789 Pen 30 2.0
2 63495837229 Book 20 5.0
3 ABC134475472 Folder 29 7.0
4 AB-1235674467 Pencil 26 1.0
5 69598346 Correction pen 15 1.5
6 6970457688 Highlighter 15 2.0
7 584028467 Color pencil 15 10.0
There may be other solutions but you have to use some string functions. I used here the functions substr, reverse (reverting the string) and indexc (position of one of the characters in the string):
data have;
input text $20.;
datalines;
12345678
AB-142353 00
AU-234343-02
132453 02
221344-09
;
run;
data want (drop=reverted pos);
set have;
if countw(text) gt 1
then do;
reverted=strip(reverse(text));
pos=indexc(reverted,'- ')+1;
new=strip(reverse(substr(reverted,pos)));
end;
else new=text;
run;
I have data that look like this:
| Country | Year | Firm | Profit |
|---------|------|------|--------|
| A | 1 | 1 | 10 |
| A | 1 | 2 | 20 |
| A | 1 | 3 | 30 |
| A | 1 | 4 | 40 |
I want to create a new variable for each firm i that calculates the following:
For example, the value of the variable for firm 1 would be:
max(20 - 10, 0) + max(30 - 10, 0) + max(40 - 10, 0)
How can I do this in Stata by country and year?
Below is a direct solution to your problem (note the use of dataex for providing example data):
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Country float(Year Firm Profit)
"A" 1 1 10
"A" 1 2 20
"A" 1 3 30
"A" 1 4 40
end
generate Wanted = -Profit
bysort Country Year (Wanted): replace Wanted = sum(Profit) - _n * Profit
list
+-----------------------------------------+
| Country Year Firm Profit Wanted |
|-----------------------------------------|
1. | A 1 4 40 0 |
2. | A 1 3 30 10 |
3. | A 1 2 20 30 |
4. | A 1 1 10 60 |
+-----------------------------------------+
The logic behind it is the following:
Note: This was the first answer posted. It didn't avoid the pitfall of taking the OP's algebra literally and wanting to implement the calculation in terms of maxima within groups. But I realised after posting that there must be a much simpler way of doing it and #Romalpa Akzo got there, which is excellent. I undeleted this on request because it does show some machinery for looping over groups and implementing a calculation for each group with a customised Mata function.
Here I write a Mata function to return the wanted result for a group and then loop over the groups to populate a pre-defined variable.
To test the code for a dataset with more than one group, I use mpg from Stata's auto toy dataset.
mata :
void wanted (string scalar varname, string scalar usename, string scalar resultname) {
real scalar i
real colvector x, result, zero
result = x = st_data(., varname, usename)
zero = J(rows(x), 1, 0)
for(i = 1; i <= rows(x); i++) {
result[i] = sum(rowmax((x :- x[i], zero)))
}
st_store(., resultname, usename, result)
}
end
sysuse auto, clear
sort foreign rep78 mpg
egen group = group(foreign rep78), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1 / `G' {
replace touse = group == `g'
mata : wanted("mpg", "touse", "wanted")
}
How did that work out? Here are some results:
. list mpg wanted group if foreign, sepby(group)
+--------------------------+
| mpg wanted group |
|--------------------------|
53. | 21 7 Foreign 3 |
54. | 23 3 Foreign 3 |
55. | 26 0 Foreign 3 |
|--------------------------|
56. | 21 35 Foreign 4 |
57. | 23 19 Foreign 4 |
58. | 23 19 Foreign 4 |
59. | 24 13 Foreign 4 |
60. | 25 8 Foreign 4 |
61. | 25 8 Foreign 4 |
62. | 25 8 Foreign 4 |
63. | 28 2 Foreign 4 |
64. | 30 0 Foreign 4 |
|--------------------------|
65. | 17 84 Foreign 5 |
66. | 17 84 Foreign 5 |
67. | 18 77 Foreign 5 |
68. | 18 77 Foreign 5 |
69. | 25 42 Foreign 5 |
70. | 31 18 Foreign 5 |
71. | 35 6 Foreign 5 |
72. | 35 6 Foreign 5 |
73. | 41 0 Foreign 5 |
|--------------------------|
74. | 14 . . |
+--------------------------+
So, how would that be applied to your data?
clear
input str1 Country Year Firm Profit
A 1 1 10
A 1 2 20
A 1 3 30
A 1 4 40
end
egen group = group(Country Year), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1/`G' {
replace touse = group == `g'
mata: wanted("Profit", "touse", "wanted")
}
Results:
. list Firm Profit wanted, sepby(group)
+------------------------+
| Firm Profit wanted |
|------------------------|
1. | 1 10 60 |
2. | 2 20 30 |
3. | 3 30 10 |
4. | 4 40 0 |
+------------------------+
I am trying to essentially count in a new base, specified by a given alphabet. So for the following params:
#define ALPHABET "abc"
#define ALPHABET_LEN 3
I would get the following result:
0 | a
1 | b
2 | c
3 | aa
4 | ab
...
9 | ca
10| cb
I have tried to do this with the following code:
#include <iostream>
#include <string>
#define ALPHABET "abc"
#define ALPHABET_LEN 3
int main()
{
for (int i = 0; i <= 10; i++) {
int l = 0;
int n_ = i;
int n = i;
char secret[4];
while (n > 0) {
secret[l] = ALPHABET[n%ALPHABET_LEN];
n /= ALPHABET_LEN;
l++;
}
std::cout << i << " | " << secret << std::endl;
}
}
Unfortunately, this prints the following:
0 |
1 | b
2 | c
3 | ab
4 | bb
5 | cb
6 | ac
7 | bc
8 | cc
9 | aab
10 | bab
This is not the expected pattern. How can I fix this, and is there a better way than just peeling off the next character using mod?
This is a challenging algorithm problem. The description of the problem is misleading. A base conversion suggests that there will be a 'zero' value which would be something like this:
a, b, c, a0, aa, ab, ac, b0, b1, b2, c0, etc.
However, in this problem, 'a' represents 'zero', but 'a' also is the first digit.
Looking at this as a base conversion creates a rabbit hole of complexity.
Rather, one has to figure out the algorithm to calculate each successive digit and ignore the idea of base conversion.
The first approach is to come up with a formula for each digit, which looks like this:
int LEN = ALPHABET_LEN; // Use a shorter variable name
std::string = ALPHABET[i % LEN]; // first digit
if(i > LEN - 1) {
secret = ALPHABET[(i/LEN -1)%LEN] + secret;
}
if(i > LEN * (LEN+1) - 1) {
secret = ALPHABET[(i/(LEN *(LEN+1)) - 1)%LEN] + secret;
}
if(i > LEN * (LEN+1) *(LEN+1) - 1) {
secret = ALPHABET[(i/(LEN *(LEN+1) * (LEN+1) ) - 1)%LEN] + secret;
}
As you work out the formula for each successive digit, you realize that the base is really LEN+1 rather than LEN.
This approach works, but it always requires an additional if statement for each successive digit. Sure, for i = 1 .. 10, this works fine. But what if i = 10,000,000. It would require an endless successive series of if statements.
However, after discovering the pattern, it is now a little easier to create a while() statement that avoids the need for an endless series of if statements.
#include <iostream>
#include <string>
//works as #define, but I like string so I can easily test with
// longer strings such as "abcd"
std::string ALPHABET {"abc"};
int main()
{
const int LEN = ALPHABET.size(); // Shorten the var name
for (int i = 0; i <= 100; i++) { // use i <= 100 to verify the algorithm
int number = i; // save the number that we are working on for this row
std::string secret = "";
secret += ALPHABET[i % LEN];
while(number / LEN > 0){
number = number/LEN - 1; // the base is really 4 rather than 3
secret = ALPHABET[number%LEN] + secret;
}
std::cout << i << " | " << secret << std::endl;
}
}
The output will be:
0 | a
1 | b
2 | c
3 | aa
4 | ab
5 | ac
6 | ba
7 | bb
8 | bc
9 | ca
10 | cb
11 | cc
12 | aaa
13 | aab
14 | aac
15 | aba
16 | abb
17 | abc
18 | aca
19 | acb
20 | acc
21 | baa
22 | bab
23 | bac
24 | bba
25 | bbb
26 | bbc
27 | bca
28 | bcb
29 | bcc
30 | caa
31 | cab
32 | cac
33 | cba
34 | cbb
35 | cbc
36 | cca
37 | ccb
38 | ccc
39 | aaaa
40 | aaab
41 | aaac
42 | aaba
43 | aabb
44 | aabc
45 | aaca
46 | aacb
47 | aacc
48 | abaa
49 | abab
50 | abac
51 | abba
52 | abbb
53 | abbc
54 | abca
55 | abcb
56 | abcc
57 | acaa
58 | acab
59 | acac
60 | acba
61 | acbb
62 | acbc
63 | acca
64 | accb
65 | accc
66 | baaa
67 | baab
68 | baac
69 | baba
70 | babb
71 | babc
72 | baca
73 | bacb
74 | bacc
75 | bbaa
76 | bbab
77 | bbac
78 | bbba
79 | bbbb
80 | bbbc
81 | bbca
82 | bbcb
83 | bbcc
84 | bcaa
85 | bcab
86 | bcac
87 | bcba
88 | bcbb
89 | bcbc
90 | bcca
91 | bccb
92 | bccc
93 | caaa
94 | caab
95 | caac
96 | caba
97 | cabb
98 | cabc
99 | caca
100 | cacb
Process finished with exit code 0
int i = 0;
int n = i;
while (n > 0) {
Clearly, in first iteration, the inner loop will not run a single iteration, and thus secret will not contain a as it should.
P.S. You fail to null-terminate secret, so the behaviour of the program is undefined when you insert it into the output stream.
I have the following (sorted) variable:
35
35
37
37
37
40
I want to create a new variable which will increment by one when a new number comes up in the original variable.
For example:
35 1
35 1
37 2
37 2
37 2
40 3
I thought about using the by or bysort commands but none of them seems to solve the problem. This looks like something many people need, but I couldn't find an answer.
You are just counting how often a value differs from the previous value. This works also for observation 1 as any reference to a value for observation 0 is returned as missing, so in your example 35 is not equal to missing.
clear
input x
35
35
37
37
37
40
end
gen new = sum(x != x[_n-1])
list, sepby(new)
+----------+
| x new |
|----------|
1. | 35 1 |
2. | 35 1 |
|----------|
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|----------|
6. | 40 3 |
+----------+
by would be pertinent if you had blocks of observations to be treated separately. One underlying principle here is that true or false comparisons (here, whether two values are unequal) are evaluated as 1 if true and 0 is false.
#Nick beat me to it by a couple of minutes but here's another -cleaner- way of doing this:
clear
input foo
35
35
37
37
37
40
end
egen counter = group(foo)
list
+---------------+
| foo counter |
|---------------|
1. | 35 1 |
2. | 35 1 |
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|---------------|
6. | 40 3 |
+---------------+
This approach uses the egen command and its associated group() function.
There are also a couple of options for this function, with missing being perhaps the most useful.
From the command's help file:
"...missing indicates that missing values in varlist (either . or "") are to be treated like any other value when assigning groups, instead of as missing values being assigned to the group missing..."
clear
input foo
35
35
.
37
37
37
40
.
end
egen counter = group(foo), missing
sort foo
list
+---------------+
| foo counter |
|---------------|
1. | 35 1 |
2. | 35 1 |
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|---------------|
6. | 40 3 |
7. | . 4 |
8. | . 4 |
+---------------+
Instead of:
drop counter
egen counter = group(foo)
sort foo
list
+---------------+
| foo counter |
|---------------|
1. | 35 1 |
2. | 35 1 |
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|---------------|
6. | 40 3 |
7. | . . |
8. | . . |
+---------------+
Another option is label:
"... The label option returns integers from 1 up according to the distinct groups of varlist in sorted order. The integers are labeled with the values of varlist or the value labels, if they exist..."
Using the example without the missing values:
egen counter = group(foo), label
list
+---------------+
| foo counter |
|---------------|
1. | 35 35 |
2. | 35 35 |
3. | 37 37 |
4. | 37 37 |
5. | 37 37 |
|---------------|
6. | 40 40 |
+---------------+
I am looking to following code at following link
https://www.geeksforgeeks.org/divide-and-conquer-set-2-karatsuba-algorithm-for-fast-multiplication/
// The main function that adds two bit sequences and returns the addition
string addBitStrings( string first, string second )
{
string result; // To store the sum bits
// make the lengths same before adding
int length = makeEqualLength(first, second);
int carry = 0; // Initialize carry
// Add all bits one by one
for (int i = length-1 ; i >= 0 ; i--)
{
int firstBit = first.at(i) - '0';
int secondBit = second.at(i) - '0';
// boolean expression for sum of 3 bits
int sum = (firstBit ^ secondBit ^ carry)+'0';
result = (char)sum + result;
// boolean expression for 3-bit addition
carry = (firstBit&secondBit) | (secondBit&carry) | (firstBit&carry);
}
// if overflow, then add a leading 1
if (carry) result = '1' + result;
return result;
}
I am having difficulty in understanding following expressions
// boolean expression for sum of 3 bits
int sum = (firstBit ^ secondBit ^ carry)+'0';
and other expression
// boolean expression for 3-bit addition
carry = (firstBit&secondBit) | (secondBit&carry) | (firstBit&carry);
What is difference between two? What are they trying to achieve?
Thanks
To understand this, a table with all possible combinations may help. (For our luck, the number of combinations is very limited for bits.)
Starting with AND (&), OR (|), XOR (^):
a | b | a & b | a | b | a ^ b
---+---+-------+-------+-------
0 | 0 | 0 | 0 | 0
0 | 1 | 0 | 1 | 1
1 | 0 | 0 | 1 | 1
1 | 1 | 1 | 1 | 0
Putting it together:
a | b | carry | a + b + carry | a ^ b ^ carry | a & b | b & carry | a & carry | a & b | a & carry | b & carry
---+---+-------+---------------+---------------+-------+-----------+-----------+-------------------------------
0 | 0 | 0 | 00 | 0 | 0 | 0 | 0 | 0
0 | 0 | 1 | 01 | 1 | 0 | 0 | 0 | 0
0 | 1 | 0 | 01 | 1 | 0 | 0 | 0 | 0
0 | 1 | 1 | 10 | 0 | 0 | 1 | 0 | 1
1 | 0 | 0 | 01 | 1 | 0 | 0 | 0 | 0
1 | 0 | 1 | 10 | 0 | 0 | 0 | 1 | 1
1 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 1
1 | 1 | 1 | 11 | 1 | 1 | 1 | 1 | 1
Please, note, how the last digit of a + b resembles exactly the result of a ^ b ^ carry as well as a & b | a & carry | b & carry resembles the first digit of a + b.
The last detail is, adding '0' (ASCII code of digit 0) to the resp. result (0 or 1) translates this to the corresponding ASCII character ('0' or '1') again.