regular expression select everything after word - regex

I am trying to select everything after CORP ACT OPTION NO. however my regular expression stops after it meets a /
My reg expression atm to extract the info is CORP ACT OPTION NO.([^/]*)
CORP ACT REFERENCE : 007XS0212069115
SENDER'S REFERENCE : 1212070800330001
FUNCTION OF MESSAGE : NEW MESSAGE
CORP ACT EVENT : INTEREST PAYMENT
PLACE OF SAFEKEEPING : US
ISIN : XS0212069115
ISIN/DESCRIPTION : KFW 4.750 071212 GBP
METHOD OF INTEREST COMPUTATION : A006
EX-DATE : 20121207
RECORD DATE : 20121206
CORP ACT OPTION NO. : 001
CORPORATE ACTION OPTION CODE : CASH
CURRENCY OPTION : GBP
RESULTING AMT : GBP617,5
PAYMENT DATE : 20121207
EXCHANGE RATE : GBP/GBP/1,
INTEREST RATE : 4,75
SAFEKEEPING ACCOUNT : 000000000000
CONFIRMED BALANCE : FAMT/13000,
CREDIT/DEBIT IND : CREDIT
How can I select everything? Many thanks.

Just use \s\S in the character group:
CORP ACT OPTION NO\.([\s\S]*)
See it here in action: http://regexr.com?33vd0

Related

How to get the invoice number and Customer PO using a regex from the below text? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Credit Notes - Original
Invoice Number : 362/867/88540129 Customer PO : 124698753
Invoice Date : 2019-10-17 Reference A : BONNEVILLE POWER
Sales Order : UTS003832 ADMINISTRATIO
Business Partner : BP0042488 Customer Tax : 12-9871234
For the invoice number you may try:
(?<=\bInvoice Number : )\S+
And for the customer PO:
(?<=\bCustomer PO : )\d+
Demo
These is just an abstract regex solution though. Most likely, if you were using a programming language, you would take a different approach.

Error: "Function REGEXEXTRACT parameter 2 value "..." is not a valid regular expression

I have some problem righting a regex on Google sheet.
I have in one cell a text with several informations :
D ID : d_************
T ID : t_************
Date : **/**/2019
O ID : ************
And I'd like to have every ID in different cells.
I tried this:
=REGEXEXTRACT(L9,"\[(?<=D ID : )(.*)(?=T)]\")
But I got the error in the title
try:
=ARRAYFORMULA(REGEXEXTRACT(FILTER(SPLIT(A1; CHAR(10));
REGEXMATCH(SPLIT(A1; CHAR(10)); "ID")); ": (.*)"))

Regex capture lines A, B, or C in any order only when not preceded by D

I have a file with content something like this:
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS SUBJECT CORP
CENTRAL INDEX KEY: 0000000000
STANDARD INDUSTRIAL CLASSIFICATION: []
IRS NUMBER: 123456789
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
Then later in the file, it has something like this:
<REPORTING-OWNER>
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS OWNER CORP
CENTRAL INDEX KEY: 0101010101
STANDARD INDUSTRIAL CLASSIFICATION: []
What I need to do is capture the company conformed name, central index key, IRS number, fiscal year end, or whatever I am looking to extract, but only in the subject company section--not the reporting owner section. These lines may be in any order, or not present, but I want to capture their values if they are present.
The regex I was trying to build looks like this:
(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
The desired results would be as follows:
conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"
Any flavor of regex is acceptable for this, as I'll adapt to whatever works best for the scenario. Thank you for reading about my quandary and for any guidance you can offer.
I ended up figuring it out on my own. Try it out here.
/SUBJECT COMPANY:\s+COMPANY DATA:(?:\s+(?:(?:COMPANY CONFORMED NAME:\s+(?'conformed_name'[^\n]+))|(?:CENTRAL INDEX KEY:\s+(?'CIK'\d{10}))|(?:STANDARD INDUSTRIAL CLASSIFICATION:\s+(?'assigned_SIC'[^\n]+))|(?:IRS NUMBER:\s+?(?'IRS_number'\w{2}-?\w{7,8}))|(?:STATE OF INCORPORATION:\s+(?'state_of_incorporation'\w{2}))|(?:FISCAL YEAR END:\s+(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))\n))+/s
To match only the company section, and only when preceded by “SUBJECT COMPANY”, use a look behind:
(?<=SUBJECT COMPANY:\t\n \n )(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))

Use regex recursively taking indentation level into account

I am trying to parse a custom input file for a simulation code I am writting. It consist of nested "objects" with properties, values (see the link).
Here is an example file and the regex I am using currently.
([^:#\n]*):?([^#\n]*)#?.*\n
It is made such that each match is a line, with two capture group, one for the property and one for its value. It also excludes "#" and ":" from the character set, as they correspond to a comment delimiter and a property:value delimiter respectively.
How can I modify my regex so as to match the structure recursively? That is if line n+1 has an identation level higher than line n, it should be matched as a subgroup of line n's match.
I am working on Octave, which uses PCRE regex format.
I asked if you have control over the data format because as it is, the data is very easy to parse with YAML instead of regex.
The only problem is that the object is not well formed:
1) Take the regions object for example, it has many attributes called layer all of them. I think your intention is to build a list of layers instead of a lot of properties with the same name.
2) Consider now each layer property that has a corresponding value. Following each layer are orphan attributes that I presume belong to each layer.
With these ideas in mind. If you form your object following YAML rules, it would be a breeze to parse it.
I know that you are working in Octave, but consider the modifications I made to your data, and how easy it is to parse it, in this case with python.
DATA AS YOU HAVE IT NOW
case :
name : tandem solar cell
options :
verbose : true
t_stamp : system
units :
energy : eV
length : nm
time : s
tension : V
temperature: K
mqty : mole
light : cd
regions :
layer : Glass
geometry:
thick : 80 nm
npoints : 10
optical :
nk_file : vacuum.txt
layer : FTO
geometry:
thick : 10 nm
npoints : 10
optical :
nk_file : vacuum.txt
MODIFIED DATA TO COMPLY WITH YAML SYNTAX
case :
name : tandem solar cell
options :
verbose : true
t_stamp : system # a sample comment
units :
energy : eV
length : nm
time : s
tension : V
temperature: K
mqty : mole
light : cd
regions :
- layer : Glass # ADDED THE - TO MAKE IT A LIST OF LAYERS
geometry : # AND KEEP INDENTATION PROPERLY
thick : 80 nm
npoints : 10
optical :
nk_file : vacuum.txt
- layer : FTO
geometry:
thick : 10 nm
npoints : 10
optical :
nk_file : vacuum.txt
With only these instruction you get your object parsed:
import yaml
data = yaml.load(text)
""" your data would be parsed as:
{'case': {'name': 'tandem solar cell',
'options': {'t_stamp': 'system', 'verbose': True},
'regions': [{'geometry': {'npoints': 10, 'thick': '80 nm'},
'layer': 'Glass',
'optical': {'nk_file': 'vacuum.txt'}},
{'geometry': {'npoints': 10, 'thick': '10 nm'},
'layer': 'FTO',
'optical': {'nk_file': 'vacuum.txt'}}],
'units': {'energy': 'eV',
'length': 'nm',
'light': 'cd',
'mqty': 'mole',
'temperature': 'K',
'tension': 'V',
'time': 's'}}}
"""

Regex to find everything in between

I have the following regex which works when there is no leading /d,"There is 1 interface on the system:
or a trailing ",2017-01-...
Here is the regex:
(?m)(?<_KEY_1>\w+[^:]+?):\s(?<_VAL_1>[^\r\n]+)$
Here is a sample of what I am trying to parse:
1,"There is 1 interface on the system:
Name : Mobile Broadband Connection
Description : Qualcomm Gobi 2000 HS-USB Mobile Broadband Device 250F
GUID : {1234567-12CD-1BC1-A012-C1A1234CBE12}
Physical Address : 00:a0:c6:00:00:00
State : Connected
Device type : Mobile Broadband device is embedded in the system
Cellular class : CDMA
Device Id : A1000001234f67
Manufacturer : Qualcomm Incorporated
Model : Qualcomm Gobi 2000
Firmware Version : 09010091
Provider Name : Verizon Wireless
Roaming : Not roaming
Signal : 67%",2017-01-20T16:00:07.000-0700
I am trying to extract field names where for example Cellular class would equal CDMA but for all fields beginning after:
1,"There is 1 interface on the system: (where 1 increments 1,2 3,4 and so on
and before the tailing ",2017-01....
Any help is much appreciated!
You could use look-ahead to ensure that the strings you match come before a ",\d sequence, and do not include a ". The latter would ensure you will only match between double quotes, of which the second has the pattern ",\d:
/^\h*(?<_KEY_1>[\w\h]+?)\h*:\h*(?<_VAL_1>[^\r\n"]+)(?="|$)(?=[^"]*",\d)/gm
See it on regex101
NB: I put the g and m modifiers at the end, but if your environment requires them at the start with (?m) notation, that will work too of course.
Your example string seems to be a record from a csv file. This is how I will accomplish the task with Python (2.7 or 3.x):
import csv
with open('file.csv', 'r') as fh:
reader = csv.reader(fh)
results = []
for fields in reader:
lines = fields[1].splitlines()
keyvals = [list(map(str.strip, line.split(':', 1))) for line in lines[1:]]
results.append(keyvals)
print(results)
It can be done in a similar way with other languages.
You haven't responded to my comments or any of the answers, but here is my answer - try
^\s*(?<_KEY_1>[\w\s]+?)\s*:\s*(?<_VAL_1>[^\r\n"]+).*$
See it here at regex101.