How to parse txt-file using REGEXP_SUBSTR in oracle? - regex

I have got a lot of txt cards with same format.
And I need parse it to get some values from them.
I don't understand how use regexp substr in Oracle. Please, Help me write sql statements, which return values, which I marked between **-symbols (for example: first string, 02/02/11, AA11223344 and etc):
From: "abc def (**first string**)" <email#site.com>
02/01/2011 09:27 First Date : **02/02/11**
Player : BILL BROWN ID : **AA11223344**
At : YELLOW STREET. CD Number : **A11223FER**
Code :
BUYS : **123M** (M) of AAA 0 02/02/11 Owner : **England**
Shiped : **02/04/11**
Number : **11.223344** Digi : **1.2370000**
Ddate: **02/04/11**
Notes : **First line here**
* Size : **USD 11,222,333.44**
* Own ( **0 days** ): **0.00**
* Total : USD **222,333,444.55**

You can recursively apply regexp evaluation by using hierarchical queries; at each level, you look for the level-th occurrence in your string.
Please pay attention to the "non greedy" operator (??) in pattern string, explained here, as well as regular expression functions.
with test as (
select 'From: "abc def (**first string**)" <email#site.com>
02/01/2011 09:27 First Date : **02/02/11**
Player : BILL BROWN ID : **AA11223344**
At : YELLOW STREET. CD Number : **A11223FER**
Code :
BUYS : **123M** (M) of AAA 0 02/02/11 Owner : **England**
Shiped : **02/04/11**
Number : **11.223344** Digi : **1.2370000**
Ddate: **02/04/11**
Notes : **First line here**
* Size : **USD 11,222,333.44**
* Own ( **0 days** ): **0.00**
* Total : USD **222,333,444.55** ' as txt
from dual
)
select TRIM('*' FROM regexp_substr(txt, '\*\*(.*??)\*\*', 1, LEVEL, 'n') )
from test
CONNECT BY regexp_subSTR(txt, '\*\*(.*??)\*\*', 1, LEVEL, 'n') is not null

Related

Regex to caputure data type and field length in multiline definition file

I tried this on Regex101 but I couldn't figure this out. I have a definition file that contains many definitions (sample output below). I'm trying to find only definitions where the datatype is equal to 4 and the maxlength is 0.
I'm close but my regex will match too much.
Here is what I have:
/(datatype\s+: 4[\s\S]*?(maxlength\s+:\s0))/g
This does match the cases that I want but also will match the case where datatype is 4 but maxlength is not 0 until it finds the next occurrence of manxlength = 0.
Sample data (sorry, it's long):
field {
id : 536870914
name : Set Field (Submit Mode)
datatype : 4
fieldtype : 1
create-mode : 2
option : 2
timestamp : 1489159658
owner : John Smith
last-changed : John Smith
length-units : 0
maxlength : 0
clob-store-opt : 0
menu-style : 1
qbe-match-op : 1
fulltext-optns : 0
permission : 12\1
}
field {
id : 536870915
name : Schema Name
datatype : 4
fieldtype : 1
create-mode : 2
option : 1
timestamp : 1165057260
owner : John Smith
last-changed : John Smith
length-units : 0
maxlength : 30
clob-store-opt : 0
menu-style : 1
qbe-match-op : 1
fulltext-optns : 0
permission : 12\1
}
field {
id : 536870916
name : Type
datatype : 4
fieldtype : 1
create-mode : 2
option : 1
timestamp : 1165057260
owner : John Smith
last-changed : John Smith
length-units : 0
maxlength : 30
clob-store-opt : 0
menu-style : 2
qbe-match-op : 1
fulltext-optns : 0
permission : 12\1
}
field {
id : 536870917
name : Set Field (Query Mode)
datatype : 4
fieldtype : 1
create-mode : 2
option : 2
timestamp : 1489159658
owner : John Smith
last-changed : John Smith
length-units : 0
maxlength : 0
clob-store-opt : 0
menu-style : 1
qbe-match-op : 1
fulltext-optns : 0
permission : 12\1
}
Also note that this is a very limited sample and there could be hundreds of these fields with different values.
You need to temper the [\s\S]*? with a negative lookahead so as to build a tempered greedy token:
/(datatype\s+: 4\b(?:(?!datatype\s+: \d)[\s\S])*?(maxlength\s+:\s0\b))/g
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
The (?:(?!datatype\s+: \d)[\s\S])*? matches any char ([\s\S]), zero or more repetitions, as few as possible, that is not a starting point of the datatype\s+: \d (a datatype substring, \s+ - one or more whitespaces, :, a space and a digit (\d).

How to create a flat dictionary in Python?

I have a text which I want to convert into a dictionary.
Here's the format of the text :
Apple 0
orange 5:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
orange 6:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
Apple 1
orange 12:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
orange 13:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
I want to convert the dictionary something like this :
dic_text = {'apple-0-orange-5-text1' : 'random text','apple-0-orange-5-text2' : 'random text','apple-0-orange-5-text3' : 'random text','apple-0-orange-5-text4' : 'random text','apple-0-orange-6-text1' : 'random text','apple-0-orange-6-text2' : 'random text','apple-0-orange-6-text3' : 'random text','apple-0-orange-6-text4' : 'random text','apple-1-orange-12-text1' : 'random text','apple-1-orange-12-text2' : 'random text','apple-1-orange-12-text3' : 'random text','apple-1-orange-12-text4' : 'random text','apple-1-orange-13-text1' : 'random text','apple-1-orange-13-text2' : 'random text','apple-1-orange-13-text3' : 'random text','apple-1-orange-13-text4' : 'random text'}
Can anyone tell me a generic way of making a dictionary something like above?
Assuming the following information that you did not provide (please edit the question clarifying if this holds or not):
That all the elements are on separate lines
That all the elements take at most one line (so random text does not span multiple lines)
That you want the keys in lowercase
That you do not want to preserve the whitespace at beginning/end of the keys and random text
random text cannot be just whitespace
The "Apple X" line does not contain a :
The "orange Y" line is the only kind of line that ends in : (plus eventually whitespace), so random text cannot end in :.
After an "Apple X" line there is always an "orange Y" line (possibly after some empty lines).
Then you can do something like this:
def build_dict(iterable):
result = {}
main_key = None
sub_key = None
for line in iterable:
# remove whitespace at beginning/end of line
line = line.strip()
if not line:
# throw away empty lines
continue
elif ':' not in line:
# we found an "Apple X" line, transform that into apple-X
main_key = '-'.join(line.lower().split())
sub_key = None
elif line[-1] == ':':
# we found an "orange X" line
sub_key = '-'.join(line.lower().split())
else:
# add a `textX : random_text` element
key, value = line.split(':')
result['-'.join([main_key, sub_key, key.strip()])] = value.strip()
return result
So you keep track of which Apple X value is in the main_key, and which orange Y value is in the sub_key and after that all lines text X : random_text are splitted on : and the three keys are combined and the value is saved in the dictionary.
If the assumptions I made do not hold then you have to handle things like multiline values etc, which depends on exactly the format of the file.

Can we extract dyanmic data from string using regex?

I want to validate and get the data for following tags(9F03,9F02,9C ) using regex:
9F02060000000060009F03070000000010009C0101
Above string is in Tag - length - value format.
Where 9F02,9F03,9C are tags and have fixed length but their position and value in string can vary.
Just after the tag there is the length of the value in bytes that tag can store.
for example:
9F02=tag
06=Length in bytes
000000006000= value
Thanks,
Ashutosh
Standard regex doesn't know how to count very well, it behaves like a state machine in that way.
What you can do though if the number of possibilities is small is represent each possibility in a state in regex, and use multiple regex queries for each tag ...
/9F02(01..|02....|03......)/
/9C(01..|02....)/
... And so on.
Example here.
http://rubular.com/r/euHRxeTLqH
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String[] args) {
String s = "9F02060000000060009F03070000000010009C0101";
String regEx = "(9F02|9F03|9C)";
Pattern p = Pattern.compile(regEx);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println("Tag : "+ m.group());
String length = s.substring(m.end(), m.end()+2);
System.out.println("Length : " + length);
int valueEndIndex = new Integer(m.end()) + 3 + new Integer(length);
String value = s.substring(m.end()+3,valueEndIndex);
System.out.println("Value : "+ value);
}
}
}
This code will give you following output :
Tag : 9F02
Length : 06
value : 000000
Tag : 9F03
Length : 07
value : 0000000
Tag : 9C
Length : 01
value : 1
I am not sure about byte length you are mentioning here, but I guess this code shall help you kick start!

Use string comparisons to split a column in R

To the best of my search this question hasn't been asked before.
I have a dataframe column called Product. This column has the company name as well as product model in just one column.
product.df <- data.frame("Product" = c("Company1 123M UG", "Company1 234M-I", "Company2 763-87-U","Company2 777-87", "Company3 Name1 87M", "Company3 Name1 O77M", "Company3 Name1 765-U MP"))
I want to split out the company names and product model number from this single column into two columns. I need a function that can find similar words between rows and classify them as Company names and the rest of the letters as product model number. No two rows as far as i can tell have same model numbers. So in the case above. I would get this answer
new.product.df <- data.frame("CompanyName" = c("Company1", "Company1", "Company2","Company2", "Company3 Name1", "Company3 Name1", "Company3 Name1"), "Model" = c("123M UG", "234M-I", "763-87-U", "777-87", "87M", "O77M", "765-U MP"))
I need a function that can compare two strings and return me similar continuous letters and dissimilar letters.
If you're guaranteed the first word is always a company name, then simply do a fixed split on the first space with max 2 output:
require(stringi)
stri_split_fixed(product.pd[,1], ' ', n=2)
or:
apply(product.df, 2, function(...) { stri_split_fixed(..., ' ', n=2) } )
[1] "Company1" "123M UG"
[1] "Company1" "234M-I"
[1] "Company2" "763-87-U"
[1] "Company2" "777-87"
[1] "Company3" "Name1 87M"
[1] "Company3" "Name1 O77M"
[1] "Company3" "Name1 765-U MP"
Try this
new.product.df <- data.frame(company=
unlist(lapply(strsplit(as.character(product.df$Product), split=" .[0-9]"), function(x) x[1])),
name =
unlist(lapply(strsplit(as.character(product.df$Product), split="[1|2] "), function(x) x[2]))
)
according to your data the separator between company and product is the first space character , so the first step we need to convert this first space character to something else , in this example to __ , later I'll tell you why we need to do this .
this is your actual data
Product
1 Company1 123M UG
2 Company1 234M-I
3 Company2 763-87-U
4 Company2 777-87
5 Company3 Name1 87M
6 Company3 Name1 O77M
7 Company3 Name1 765-U MP
this code to do this kind of conversion
product.df$Product <- sub(product.df$Product , pattern = " " , replacement = "__" ,
perl = T)
the data should be something like this
Product
1 Company1__123M UG
2 Company1__234M-I
3 Company2__763-87-U
4 Company2__777-87
5 Company3__Name1 87M
6 Company3__Name1 O77M
7 Company3__Name1 765-U MP
then use the tidyr library to separate this new data frame
library("tidyr")
new.product.df <- separate( product.df , Product , c("Company" , "Model") , sep = "__")
the reason behind converting space character to __ is that company name also may include space character like companies 123M UG & Name1 87M this will cause error later so the first step in this solution is to avoid this later when separating the column.
of course it will be better if we separated on the first occurrence of space character , but I don't know how because the global modifier is turned on by default for separator regex , so any suggestions are welcome

Regular Expression: Remove the middle string

I have the following sentence:
** DATE : 04/12/2014 * TIME: 07:49:42 **
I only want to capture 04/12/2014 07:49:42.
I've tried this .*DATE : ([0-9\/]+.*TIME: [0-9\:]+)
But I got this: "04/12/2014 * TIME: 07:49:42."
How can I remove " * TIME:"?
I need it in pure regex, so I'm testing at http://www.regexr.com/.
[\* ]*Date\s*:\s*([0-9/]+)[ \*]*time\s*:\s*([0-9:]+)[ \*]*
in replace statement u use $1 $2 then you will get what you want
What about:
/DATE\s+:\s+([0-9/]+).+TIME\s+:\s+([0-9:]+)/
This should do the trick:
/\*\* +DATE : ([\d|\/]+) +\* TIME: ([\d|:]+) +\*\*/
This will return a tuple. So, for example, using JavaScript:
var s = "** DATE : 04/12/2014 * TIME: 07:49:42 **",
re = /\*\* +DATE : ([\d|\/]+) +\* TIME: ([\d|:]+) +\*\*/;
re.exec(s); // returns ["original string", "04/12/2014", "07:49:42"]
breakdown:
\*\* +DATE : (note the space after the colon) matches up to "DATE : "
([\d|\/]+) matches numbers and slashes and captures them as the first group.
+\* TIME: matches up to "TIME: "
([\d|:]+) captures the time by matching numbers or colons
+\*\*/ finishes off the sequence