Find All String Occurrences, Except The Last One Found, and Remove Them - regex

I am using Google Docs to open Walmart receipts that I email to myself. The Walmart store that I use 99.9% of the time seems to have made some firmware update to the Ingenico POS terminal that makes it display a running SUBTOTAL after each item is identified by the scanner. Here are some images to support my question..
The POS terminal looks like this:
Second image is the is the electronic receipt which I email myself from their IOS app. It is presumably taken from the POS terminal because it has the extra running SUBTOTAL lines after each item like the POS terminal screen shows. It has been doing this for a few months and I've been given no reason to believe, by management, that it will be corrected any time soon.
The final image is my actual paper receipt. This is printed from the register, its the one that you walk out with it and show the greeter/exit person to check your buggy and the items you've purchased.
Note that it does not show the extra SUBTOTAL.
I open the electronic receipt in a Google Document and their automatic OCR spits out the text of the receipt. It does a pretty darn good job, I'd say its 95%+ accurate with these receipts. I apply a very crude little regex that reformats these electronic receipts so that I can enter them into a database and use that data for my family's budgeting, taxes, and so forth. That has been working very well for me, albeit I would like to further automate that process but thats for a different question some day perhaps.
Right now, that little crude regex no longer formats the receipt into something usable for me.
What I would like to do is to remove the extra SUBTOTALS from the (broken) electronic receipt but leave the last SUBTOTAL alone. I highlighted the last SUBTOTAL on the receipt, which is always there, and should remain.
I have seen two other questions that are similar but I could not apply them to my situation. One of them was:
Remove all occurrences except the last one
What have I tried?
The following regex works in the online tester at regex101.com:
\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})
It took me a while to come up with that regex from searching around but essentially I want it to find all of the SUBTOTAL literals with a preceding new-line and any decimal number amount from 0.01 to 999.99) and I just want to replace what that finds with a new-line and then I can allow my other regex creation to work on that like it used to before the firmware update to the POS terminal.
The regex correctly identifies every SUBTOTAL (including the last one) on the regex101.com site. I can apply a substitution of "\n" and I am back to seeing the receipt data I can work with but there were two issues:
1) I cant replicate this using Google Apps Script.
Here is my example:
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText()
.match('\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})')[1]
.replace(/%/mgi, "%\n");
body.clear();
body.setText(newText);
}
2) If I were to get the above code to work, I still have the issue of wanting to leave the last SUBTOTAL intact.
Here is a Google Doc that I have set up to experiment with:
https://docs.google.com/document/d/11bOJp2rmWJkvPG1FCAGsQ_n7MqTmsEdhDQtDXDY-52s/edit?usp=sharing

I use this regular expresion.
// JavaScript Syntax
'/\nSUBTOTAL\s\d{1,3}\.\d{2}| SUBTOTAL\n\d{1,3}\.\d{2}/g'
Also I make a script for google docs. You can use this Google Doc and see the results.
function deleting_subs() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText();
var out = newText.replace(/\nSUBTOTAL\s\d{1,3}\.\d{2}|` SUBTOTAL\n\d{1,3}\.\d{2}/g, '');
// This is need to become more readable the resulting text.
out = out.replace(/R /g, 'R\n');
body.clear();
body.setText(out);
}
To execute the script, open the google doc file and click on:
Add ons.
Del_subs -> Deleting Subs.
Tip: After execute the complement/add on (Deleting Subs), undo the document edition, in that way other users can return to previous version of the text.
Hope this help to you.

Related

Parse Days in Status field from Jira Cloud for Google Sheets

I am using Jira Cloud for Sheets Adds on in order to get Days in Status field from Jira, it seems to have the following syntax, from this post
<STATUS_ID>_*:*_<NUMBER_OF_TIMES_ISSUE_WAS_IN_THIS_STATUS>_*:*_<SECONDS>_*|
Here is an example:
10060_*:*_1_*:*_1121033406_*|*_3_*:*_1_*:*_7409_*|*_10000_*:*_1_*:*_270003163_*|*_10088_*:*_1_*:*_2595005_*|*_10087_*:*_1_*:*_1126144_*|*_10001_*:*_1_*:*_0
I am trying to extract for example how many times the issue was In QA status and the duration on a given status. I am dealing with parsing this pattern for obtaining this information and return it using an ARRAYFORMULA. Days in Status field information will be provided only when the issue was completed (is in Done status), otherwise, no information will be provided. if the issue is in Done status, but it didn't transition for a given status, this information will not be provided in the Days in Status string.
I am trying to use REGEXEXTRACT function to match a pattern for example:
=REGEXEXTRACT(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
and it returns an empty value, where I expect 10068. I brought my attention that when I use REGEXMATCH function it returns TRUE:
=REGEXMATCH(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
so the syntax is not clear. Google refers as a reference for Regular Expression to the following documentation. It seems to be an issue with the vertical bar |, per this documentation it is a special character that should be represented like this \v, but this doesn't work. The REGEXMATCH returns FALSE. I am trying to use some online RegEx tester, that implements Google Sheets syntax (RE2), I found ReGo, that I don't know if it is a valid one.
I was trying to use SPLITfunction like this:
=query(SPLIT(C2, "_*:*_"), "SELECT Col1")
but it seems to be a more complicated approach for getting all the values I need from Days in Status field string, but it separates well all the values from the previous pattern. In this case, I am getting the first Status ID. The number of columns returned by SPLITwill varies because it depends on the number of statuses the issues transitioned in order to get to DONE status.
It seems to be a complex task given all the issues I have encounter, but maybe some of you were dealing with this before and may advise about some ideas. It requires properly parsing the information and then extracting the information on specific columns using ARRAYFORMULA function when it applies for a given status from Status column.
Here is a google spreadsheet sample with the input information. I would like to populate the information for the following columns for Times In QA (C column) and Duration in QA (D column, the information is provided in seconds I would need in days but this is a minor task) for In QA status, then the same would apply for the rest of the other statuses. I added the tab Settings for mapping the Status ID to my Status, I would need to use a lookup function for matching the Status column in the Jira Issues tab. I would like to have a solution, without adding helper columns maybe it will require some script.
https://docs.google.com/spreadsheets/d/1ys6oiel1aJkQR9nfxWJsmEyd7XiNkVB-omcNL0ohckY/edit?usp=sharing
try:
=INDEX(IFERROR(1/(1/QUERY(1*IFNA(REGEXEXTRACT(C2:C, "10087.{5}(\d+).{5}(\d+)")),
"select Col1,Col2/86400 label Col2/86400''"))))
...so after we do REGEXEXTRACT some rows (which cannot be extracted from) will output as #N/A error so we wrap it into IFNA to remove those errors. then we multiply it by *1 to convert everything into numeric numbers (regex works & outputs always only plain text format). then we use QUERY to convert 2nd column into proper seconds in one go. at this point every row has some value so to get rid of zeros for rows we don't need (like row 2,3,5,8,9,etc) and keep the output numeric, we use IFERROR(1/(1/ wrapping. and finally, we use INDEX or ARRAYFORMULA to process our array.

Google Cloud Natural Language API - Sentence Extraction ( Python 2.7)

I am working with the Google Cloud Natural Language API . My goal is to extract the sentences and sentiment inside a larger block of text and run sentiment analysis on them.
I am getting the following "unexpected indent" error. Based on my research, its doesn't appear to be a "basic" indent error (such as an rogue space etc.).
print('Sentence {} has a sentiment score of {}'.format(index,sentence_sentiment)
IndentationError:unexpected indent
the following line of code inside the for loop (see full code below) is causing the problem. If I remove it the issue goes away.
print(sentence.content)
Also if I move this print statement outside the loop, I don't get an error, but only the last sentence of the large block of text is printed (as could be expected).
I am totally new to programming - so if someone can explain what I am doing wrong in very simple terms and point me in the right direction I would be really appreciative.
Full script below
Mike
from google.cloud import language
text = 'Terrible, Terrible service. I cant believe how bad this was.'
client = language.Client()
document = client.document_from_text(text)
sent_analysis = document.analyze_sentiment()
sentiment = sent_analysis.sentiment
annotations = document.annotate_text(include_sentiment=True, include_syntax=True, include_entities=True)
print ('this is the full text to be analysed:')
print(text)
print('Here is the sentiment score and magnitude for the full text')
print(sentiment.score, sentiment.magnitude)
#now for the individual sentence analyses
for index, sentence in enumerate(annotations.sentences):
sentence_sentiment = sentence.sentiment.score
print(sentence.content)
print('Sentence {} has a sentiment score of {}'.format(index, sentence_sentiment))
This looks completely correct, though there may be a tab/space issue lurking there that did not survive being posted in your question. Can you get your text editor to display whitespace characters? There is usually an option to that. If it is a Python-aware editor, there will be a option to change tabs to spaces.
You may be able to make the problem go away by deleting the line
print(sentence.content)
and changing the following one to
print('{}\nSentence {} has a sentiment score of {}'.format(sentence.content, index, sentence_sentiment))

Regular Expressions to extract certain information

I have a body of information. A very large text file that is roughly 200k lines. This text file was built by merging thousands of pages of PDF text (extracted via OCR obviously). This information is 'meeting minutes' from a medical board. Within this information is a reoccurring pattern of critical information that follows such as"
##-## (this is a numbered designation of the 'case')
ACTION: [.....] (this is a sentence that describes what procedure or action is being taken with this 'case')
DECISION [.....] (this is a sentence that describes the outcome or decision of a medical board about this specific case and action)
Here is a live example (with some data scrambled for obvious medical information reasons)
06-02 Cancer and bubblegum trials Primary Investigator:
"Dr. Strangelove, Ph.D."
"ACTION: At the January 4, 2015 meeting, request for review and approval of the Application for Initial Review"
and attachments for the above-referenced study.
"DECISION: After discussing the risks and safety of the human subjects that will take part in this study, the Board"
approved the submitted documents and initiation of the study. Waiver of Consent granted.
"Approval Period: January 4, 2015 – January 3, 2016"
"Total = 6. Vote: For = 6, Against = 0, Abstain = 0"
My need is to extract very simple key information that would end up looking like:
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
So the key criteria is the ##-## field and whatever sentence follows the keywords ACTION & DECISION
So far by using regular expression in TextWrangler I am able to match
(\d\d-\d\d) or (ACTION) or (DECISION).... what I am having a hard time doing is figuring out how to select all other text and delete it, or simply copy this grouping and put it into another file.
I plan to use regular expression and anything else in a Bash file that is ran inside text wrangler. Any help is so greatly appreciated as I am a noob with regular expression. Bash scripting I am novice with.
Assuming there is a minor mistake in your input file: DECISION: ... instead of DECISION ..., you could easily achieve this using awk. All we have to do is check if a line starts with either DECISION, ACTION or ##-##. A regular expression for this is /^(##-##)|^(ACTION)|^(DECISION)/. The resulting awk one-liner is as follows:
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' /path/to/file
Example usage:
$ head -n7 file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
Here is a live example (with some data scrambled for obvious medical
information reasons)
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
If the data of the action and decision is between square brackets you'll need another regex to extract the information, in that case leave a comment.

Get goal conversion rate by Time on Page by Page group?

I want to answer this question:
Does the average time on page A (or more accurately page group A) affect the conversion rate of goal B?
So far in the GUI I have:
A) Created an advanced segment of Time on Page >= 120 ("per hit" option):
http://grab.by/tKOA
B) Modified the segment to also add a filter for Page = regex matching my group:
http://grab.by/tKOU
...But I don't know if this gives me the results I'm after; that is, if they are accurate
I have some other ideas, including assigning the page group as a funnel step and then segmenting by the Time on Page; still waiting on data to come in for that one
Wanting to know if there was a better solution or if I'm on the right track
Drewdavid,
Your approach is quite smart and correct, I would say, however keep in mind that in this context, you are mixing different scopes:
Time on Page is page-level metric
Page seen is visit-level dimension
What you would get in your report is the average time on page calculated from all the pages there were seen during visits which met the regex condition set in the filter (that's what segment does, it included all the pages, not just those that you want to filter). I know this can be confusing, but see this article that gives more examples and goes into greater detail.
To achieve what you are after, remove the segment filter and simply use the advanced filtering above the report table (and choose exactly the same regex you mentioned in your question).
Hope this helps!

yahoo pipes trimming all item titles

After a lot of hard work, I have created two yahoo Pipes I will be using.
One of them has a minor problem however... I am trimming the title length down to leave enough room for a ... and a link to fit within a tweet.
It trims the first post correctly... however it trims all of the posts after that to 0 length (before adding a bit of extra text to the end).
The problem is I'm not using a loop for all items after a certain point, but the reason for that is the output is always items from a loop, and I need the output to be number at a certain point so that I can feed in that number asa variable to trim the length by. The pipe can be found here: http://pipes.yahoo.com/pipes/pipe.info?_id=3e6c3c6b2d23d8ce0cf66cb3efc5fb56
Typically, I am inserting any RSS feed in the top box, something like "new blog post:" in the middle and "#bussiness #hashtags" in the last box.
If you can see any way I can have this yahoo pipe work for all posts rather than just the top one, please let me know. its not a big deal as im only ever posting for the moment, the top post to twitter... however there may come a point where I need all of them looking the same.