Regular Expressions to extract certain information - regex

I have a body of information. A very large text file that is roughly 200k lines. This text file was built by merging thousands of pages of PDF text (extracted via OCR obviously). This information is 'meeting minutes' from a medical board. Within this information is a reoccurring pattern of critical information that follows such as"
##-## (this is a numbered designation of the 'case')
ACTION: [.....] (this is a sentence that describes what procedure or action is being taken with this 'case')
DECISION [.....] (this is a sentence that describes the outcome or decision of a medical board about this specific case and action)
Here is a live example (with some data scrambled for obvious medical information reasons)
06-02 Cancer and bubblegum trials Primary Investigator:
"Dr. Strangelove, Ph.D."
"ACTION: At the January 4, 2015 meeting, request for review and approval of the Application for Initial Review"
and attachments for the above-referenced study.
"DECISION: After discussing the risks and safety of the human subjects that will take part in this study, the Board"
approved the submitted documents and initiation of the study. Waiver of Consent granted.
"Approval Period: January 4, 2015 – January 3, 2016"
"Total = 6. Vote: For = 6, Against = 0, Abstain = 0"
My need is to extract very simple key information that would end up looking like:
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
So the key criteria is the ##-## field and whatever sentence follows the keywords ACTION & DECISION
So far by using regular expression in TextWrangler I am able to match
(\d\d-\d\d) or (ACTION) or (DECISION).... what I am having a hard time doing is figuring out how to select all other text and delete it, or simply copy this grouping and put it into another file.
I plan to use regular expression and anything else in a Bash file that is ran inside text wrangler. Any help is so greatly appreciated as I am a noob with regular expression. Bash scripting I am novice with.

Assuming there is a minor mistake in your input file: DECISION: ... instead of DECISION ..., you could easily achieve this using awk. All we have to do is check if a line starts with either DECISION, ACTION or ##-##. A regular expression for this is /^(##-##)|^(ACTION)|^(DECISION)/. The resulting awk one-liner is as follows:
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' /path/to/file
Example usage:
$ head -n7 file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
Here is a live example (with some data scrambled for obvious medical
information reasons)
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
If the data of the action and decision is between square brackets you'll need another regex to extract the information, in that case leave a comment.

Related

Find All String Occurrences, Except The Last One Found, and Remove Them

I am using Google Docs to open Walmart receipts that I email to myself. The Walmart store that I use 99.9% of the time seems to have made some firmware update to the Ingenico POS terminal that makes it display a running SUBTOTAL after each item is identified by the scanner. Here are some images to support my question..
The POS terminal looks like this:
Second image is the is the electronic receipt which I email myself from their IOS app. It is presumably taken from the POS terminal because it has the extra running SUBTOTAL lines after each item like the POS terminal screen shows. It has been doing this for a few months and I've been given no reason to believe, by management, that it will be corrected any time soon.
The final image is my actual paper receipt. This is printed from the register, its the one that you walk out with it and show the greeter/exit person to check your buggy and the items you've purchased.
Note that it does not show the extra SUBTOTAL.
I open the electronic receipt in a Google Document and their automatic OCR spits out the text of the receipt. It does a pretty darn good job, I'd say its 95%+ accurate with these receipts. I apply a very crude little regex that reformats these electronic receipts so that I can enter them into a database and use that data for my family's budgeting, taxes, and so forth. That has been working very well for me, albeit I would like to further automate that process but thats for a different question some day perhaps.
Right now, that little crude regex no longer formats the receipt into something usable for me.
What I would like to do is to remove the extra SUBTOTALS from the (broken) electronic receipt but leave the last SUBTOTAL alone. I highlighted the last SUBTOTAL on the receipt, which is always there, and should remain.
I have seen two other questions that are similar but I could not apply them to my situation. One of them was:
Remove all occurrences except the last one
What have I tried?
The following regex works in the online tester at regex101.com:
\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})
It took me a while to come up with that regex from searching around but essentially I want it to find all of the SUBTOTAL literals with a preceding new-line and any decimal number amount from 0.01 to 999.99) and I just want to replace what that finds with a new-line and then I can allow my other regex creation to work on that like it used to before the firmware update to the POS terminal.
The regex correctly identifies every SUBTOTAL (including the last one) on the regex101.com site. I can apply a substitution of "\n" and I am back to seeing the receipt data I can work with but there were two issues:
1) I cant replicate this using Google Apps Script.
Here is my example:
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText()
.match('\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})')[1]
.replace(/%/mgi, "%\n");
body.clear();
body.setText(newText);
}
2) If I were to get the above code to work, I still have the issue of wanting to leave the last SUBTOTAL intact.
Here is a Google Doc that I have set up to experiment with:
https://docs.google.com/document/d/11bOJp2rmWJkvPG1FCAGsQ_n7MqTmsEdhDQtDXDY-52s/edit?usp=sharing
I use this regular expresion.
// JavaScript Syntax
'/\nSUBTOTAL\s\d{1,3}\.\d{2}| SUBTOTAL\n\d{1,3}\.\d{2}/g'
Also I make a script for google docs. You can use this Google Doc and see the results.
function deleting_subs() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText();
var out = newText.replace(/\nSUBTOTAL\s\d{1,3}\.\d{2}|` SUBTOTAL\n\d{1,3}\.\d{2}/g, '');
// This is need to become more readable the resulting text.
out = out.replace(/R /g, 'R\n');
body.clear();
body.setText(out);
}
To execute the script, open the google doc file and click on:
Add ons.
Del_subs -> Deleting Subs.
Tip: After execute the complement/add on (Deleting Subs), undo the document edition, in that way other users can return to previous version of the text.
Hope this help to you.

How do I match a group of text under a title that changes

New to this regex and everyone here has been an awesome resource for help but I’m running up against the wall and no matter what I cant see to get the grouping to work.
I’m looking to match the name of the room and the products and services that belong to that room. The number of rooms can vary same with the names, the description of the product or service may change but the line will always start with “Product” or “Service”.
If anyone can point me in the right direction it would be truly appreciated.
Master Bedroom
Product description of the product
Product description of the product
Service description of the service
Kitchen
Product description of the product
Services description of the service
You will probably get better results if you can use a regex alongside a bit of postprocessing. For example, the following regex will match all of the service/product lines:
(Product|Service[s]?)(.*)
But you will still need to get the name of the header. You could perhaps start with something like this:
(.*)\n((Product|Service[s]?)(.*)\n)+
In which case your capturing groups will include the name of the heading and then ALL of the lines in that section; you can then split and process each with the first regex I provided.
If you're able to share which programming language/tool you're using to run this processing, I can help you write the code to split the data correctly from the first regex.
You can look at this regex in action at regexr:
For the input string:
Master Bedroom
Product Bedknobs, cheap
Product Beautiful carpet polish
Service Free pillow sharpening
Kitchen
Product Sink grease
Services Inexpensive cucumber delivery
You will get the following groups:
Master Bedroom
Product Bedknobs, cheap
Product Beautiful carpet polish
Service Free pillow sharpening
and
Kitchen
Product Sink grease
Services Inexpensive cucumber delivery
[edit] note that this regex WILL capture the "Product/Service" string as its own group... Figured you could always throw it away if you didn't need it, but didn't hurt to have access to it after parsing :)

PDI - Multiple file input based on date in filename

I'm working with a project using Kettle (PDI).
I have to input multiple file of .csv or .xls and insert it into DB.
The file name are AAMMDDBBBB, where AA is code for city and BBBB is code for shop. MMDD is date format like MM-DD. For example LA0326F5CA.csv.
The Regexp I use in the Input file steps look like LA.\*\\.csv or DT.*\\.xls, which is return all files to insert it into DB.
Can you indicate me how to select the files the file just for yesterday (based on the MMDD of the file name).
As you need some "complex" logic in your selection, you cannot filter based only on regexp. I suggest you first read all filenames, then filter the filenames based on their "age", then read the file based on the selected filenames.
In detail:
Use the Get File Names step with the same regexp you currently use (LA.*\.csv or DT.*\.xls). You may be more restrictive at that stage with a Regexp like LA\d\d\d\d.....csv, to ensure MM and DD are numbers, and DDDD is exactly 4 characters.
Filter based on the date. You can do this with a Java Filter, but it would be an order of magnitude easier to use a Javascript Script to compute the "age" of you file and then to use a Filter rows to keep only the file of yesterday.
To compute the age of the file, extract the MM and DD, you can use (other methods are available):
var regexp = filename.match(/..(\d\d)(\d\d).*/);
if(regexp){
var age = new Date() - new Date(2018, regexp[1], regexp[2]);
age = age /1000 /60 /60 /24;
};
If you are not familiar with Javascript regexp: the match will test
the filename against the regexp and keep the values of the parenthesis
in an array. If the test succeed (which you must explicitly check to
avoid run time failure), use the values of the match to compute the
corresponding date, and subtract the date of today to get the age.
This age is in milliseconds, which is converted in days.
Use the Text File Input and Excel Input with the option Accept file from previous step. Note that CSV Input does not have this option, but the more powerful Text File Input has.
well I change the Java Filter with Modified Java Script Value and its work fine now.
Another question, how can I increase the Performance and Speed of my current transformation(now I have 2 trans. for 2 cities)? My insert update make my tranformation to slow and need almost 1hour and 30min to process 500k row of data with alot of field(300mb) and my data not only this, if is it work more fast and my company like to used it, im gonna do it with 10TB of data/years and its alot of trans and rows. I need sugestion about it

Text Analysis Tools

I am currently building a datatable in base sas and using an index function to flag certain company names embedded in a paragraph of text in a column. If the company name exists I will flag them with a one. When I've looked into the paragraphs in more detail this simple approach doesn't work. Take this example below;
"John Smith advised Coco-cola on its merger with Pepsi". I'm searching on both Coca-cola and Pepsi but only want to flag Coca-cola in this example as John Smith "advised" them. I don't want both Coco-cola and Pepsi flagged with a "1". I understand that I can write code that takes words after certain anchor words such as "advised", "represented" which does work. What happens if one record simply lists all companies that they have advised without using an anchor words to identify them? Is there any tools out there that can do this automatically by AI?
Thanks
Chris

Google Cloud Natural Language API - Sentence Extraction ( Python 2.7)

I am working with the Google Cloud Natural Language API . My goal is to extract the sentences and sentiment inside a larger block of text and run sentiment analysis on them.
I am getting the following "unexpected indent" error. Based on my research, its doesn't appear to be a "basic" indent error (such as an rogue space etc.).
print('Sentence {} has a sentiment score of {}'.format(index,sentence_sentiment)
IndentationError:unexpected indent
the following line of code inside the for loop (see full code below) is causing the problem. If I remove it the issue goes away.
print(sentence.content)
Also if I move this print statement outside the loop, I don't get an error, but only the last sentence of the large block of text is printed (as could be expected).
I am totally new to programming - so if someone can explain what I am doing wrong in very simple terms and point me in the right direction I would be really appreciative.
Full script below
Mike
from google.cloud import language
text = 'Terrible, Terrible service. I cant believe how bad this was.'
client = language.Client()
document = client.document_from_text(text)
sent_analysis = document.analyze_sentiment()
sentiment = sent_analysis.sentiment
annotations = document.annotate_text(include_sentiment=True, include_syntax=True, include_entities=True)
print ('this is the full text to be analysed:')
print(text)
print('Here is the sentiment score and magnitude for the full text')
print(sentiment.score, sentiment.magnitude)
#now for the individual sentence analyses
for index, sentence in enumerate(annotations.sentences):
sentence_sentiment = sentence.sentiment.score
print(sentence.content)
print('Sentence {} has a sentiment score of {}'.format(index, sentence_sentiment))
This looks completely correct, though there may be a tab/space issue lurking there that did not survive being posted in your question. Can you get your text editor to display whitespace characters? There is usually an option to that. If it is a Python-aware editor, there will be a option to change tabs to spaces.
You may be able to make the problem go away by deleting the line
print(sentence.content)
and changing the following one to
print('{}\nSentence {} has a sentiment score of {}'.format(sentence.content, index, sentence_sentiment))