Match text from two different lines using Regex

Match text from two different lines using Regex - redex

Hei.
I have an accounting system called Adra Matcher and I want to use it to match two different lines containing some of the same text in both lines. An example of the text is "INV 7895852", but the numbers will always be different from pair to pair.
The two lines with the same 11 characters (INV -------) should match.
This is examples of one type of text:
20220830 USD 792 CHQ# 4545 INV 8888828
20220830 USD 1440 CHQ# 4585 INV 8924216
20220830 USD 1152 CHQ# 2584 INV 8812578
20220830 USD 324 CHQ# 66088287 INV 8945308
Those lines are to be match with these lines:
20220830 Public Schools INV 888828
20220830 COLLEGE INV 8924216
20220830 School District INV 8812578 check
20220830 School District CHQ 1584255 INV 8945308 check
The only thing that for sure is the same in the two lines are INV and the 7 digit after.
Can someone help me with this code?
Maren
Tried these different codes, but some didnt work at all, and one matched lines that didn't have the same number in the text.

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.

You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.

Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.

If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Match until specific word...

Cannot for the life of me figure this out...
I am parsing a recipe:
Steak and Eggs
Serves 1
150g potatoes, diced
2 tsp olive oil
½ white onion, peeled and chopped
Salt and pepper
200g sirloin steak or rump steak, trimmed of fat
2 large eggs
Reduced carb meal
Family friendly
This is my the first part of the reg ex which matches the title and Serving number:
(\D*\n*)(?:\nServes )(\d\n)
I want to stop matching when it hits the categories at the end. I've worked out these may contain 'Reduced' 'Family' 'Quick' etc.
I have then tried to do this as so:
/(\D*\n*)(?:\nServes )(\d\n)((.*\n)*)(?:Carb|Reduced|Quick|Family)/
However, if there are two tags, as in the example, 'Reduced Carb meal' will be included as the Reg Ex continues until the 'Family' line.
Any help would be appreciated as I've been on this for 2 hours!

You simply need to change your non capturing group to a negative lookahead, like this:
/(\D*\n*)(?:\nServes )(\d\n)(.*(?:\n(?!Carb|Reduced|Quick|Family).*)*)/
Edited according to #bobble bubble's suggestion, give him the credit!

Bookmark Lines with positive $ amounts in Notepad++

I have a text doc with about 9000 lines. The data is alpha numeric. Within the doc, there are approximately 150 lines I need to identify. The only common factor is that each contains a dollar amount. I've tried multiple Regex searches, and just can't get it right.
INVALID PAYMENT AMT
013 1887000 CRJ 0.00 03/04/2015-01222015 - Code 938
INVALID PAYMENT AMT
019 0 ,CRJ 426.72 03/06/2015-01282015 - Code 628
In the example above, I need to bookmark the line with the 426.72. I don't care about the other 3 lines. Every line I need in the document has a positive dollar amount.

Perhaps:
(([1-9][0-9]*)\.([0-9]*[1-9][0-9]|00)*)|(0\.([0-9]*[1-9][0-9]))

Regular Expression for group

I have a text were I need to find 3 groups strings.
I try expression: \r?\n\r?\n\r?[0-9A-Z].*\d{7} but I find only 2 strings instead 3.
I should highlight 00170784,HEDINV,00173575 but I get only 00170784 and 00173575
This is the text:
BUY
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.375
SELL
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.5938

May I suggest using this instead?
^\d{8}$|^[A-Z]{6}$
It has two capture groups it looks for. One is an 8 digit sequence for a whole line. The other is a 6 letter sequence for a whole line. That grabs what you're looking, unless there's a specific reason you're using all those linebreak matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js