Regex: Recognizing and removing lists and menus - regex

Before I write my own method, I am curious whether there is a regex that can help me.
The Context
I am cleaning raw text prior to running statistical analyses on the terms. The text is from websites and thus includes menus (many menus from many websites).
A typical list/menu appears as follows (Except with one line break between items):
STUDENT SERVICES
Guidance & Support
Core Services
Admissions & Records
Financial Aid
Counseling
Assessment Testing
Kickstart Orientation
Tutoring
Career & Transfer Center
Student Welcome Center
The Task at Hand
I want to remove all lists
I need to remove text blocks where there is a line break after every first second, third or fourth word, but only if this pattern repeats 3 or more times consecutively (I don't want to remove single short sentences such as "Students always succeed.")
Can a regex identify this pattern?
NOTE: I am working in java.
UPDATE with sample text
[[[I WANT TO REMOVE THIS LIST]]]
Offices & Services
Student Services
Activities & Athletics
Records & Registration
Costs & Financial Aid
Compliance & Diversity
Alumni
Faculty/Staff Resources
BMCC Foundation
Human Resources
BMCC Homepage>Academics>Health Education>Course Listings
[[[I WANT TO REMOVE THIS LIST]]]
Health Education Home
Course Listings
Faculty
[[[I WANT TO REMOVE THIS LIST]]]
Community Health Education
Gerontology
School Health Education
Public Health
Visit Admissions
Course Listings
[[[I WANT TO KEEP TEXT BELOW]]]
The following courses are offered by the Department of Health Education.
2CRS., 2HRS, 0 LAB HRS.
HED 100
Health Education
This is an introductory survey course to health education. The course provides students with the knowledge, skills, and behavioral models to enhance their physical, emotional, social, intellectual and spiritual health as well as facilitate their health decision-making ability. The primary areas of instruction include: health and wellness; stress; human sexuality; alcohol, tobacco and substance abuse; nutrition and weight management; and physical fitness. Students who have completed HED 110 - Comprehensive Health Education will not receive credit for this course.
3CRS., 3HRS, 0 LAB HRS.
HED 110
Comprehensive Health Education
This course in health educations offers a comprehensive approach that provides students with the knowledge, skills, and behavioral models to enhance their physical, emotional, social, intellectual and spiritual health as well as facilitate their health decision-making ability. Areas of specialization include: alcohol, tobacco and abused substances, mental and emotional health, human sexuality and family living, nutrition, physical fitness, cardiovascular health, environmental health and health care delivery. HED 110 fulfills all degree requirements for HE 100. Students who have completed HED 100 - Health Education will not receive credit for this course.

Assuming the part about the number of words is not important, try a regex pattern of (([A-Za-z& ])*(\n|\r|\r\n)){5,}, example here.
Change that five quantifier as needed, that is just an example. A five would not match two lines with an extra newline or a three line list without an ending new line.

Related

Google Cloud VideoIntelligence Speech Transcription - Transcription Size

I use Google Cloud Speech Transcription as following :
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]
operation = video_client.annotate_video(gs_video_path, features=features)
result = operation.result(timeout=3600)
And I present the transcript and store the transcript in Django Objects using PostgreSQL as following :
transcriptions = response.annotation_results[0].speech_transcriptions
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
if SpeechTranscript.objects.filter(text = transcript).count() == 0:
SpeechTranscript.objects.create(text = transcript,
confidence = confidence)
print(f"Adding -> {confidence:4.10%} | {transcript.strip()}")
else:
pass
For instance the following is the text that I receive from a sample video :
94.9425220490% | I refuse to where is it short sleeve dress shirt. I'm just not going there the president of the United States is a visit to Walter Reed hospital in mid-july format was the combination of weeks of cajoling by trump staff and allies to get the presents for both public health and political perspective wearing a mask to protect against the spread of covid-19 reported in advance of that watery trip and I quote one presidential aide to the president to set an example for a supporters by wearing a mask and the visit.
94.3865835667% | Mask wearing is because well science our best way to slow the spread of the coronavirus. Yes trump or Matthew or 3 but if you know what he said while doing sell it still anybody's guess about what can you really think about NASCAR here is what probably have a mass give you probably have a hospital especially and that particular setting were you talking to a lot of people I think it's but I do believe it. Have a a time and a place very special line trump saying I've never been against masks but I do believe they have a time and a place that isn't exactly a ringing endorsement for mask wearing.
94.8513686657% | Republican skip this isn't it up to four men over the perfumer's that wine about time and place should be a blinking red warning light for people who think debate over whether last for you for next coronavirus. They are is finally behind us time in a place lined everything you need to know about weird Trump is like headed next time he'll get watery because it was a hospital and will continue to express not so scepticism to wear masks in public house new CDC guidelines recommending that mask to be worn inside and one social this thing is it possible outside he sent this?
92.9862976074% | He wearing a face mask as agreed presidents prime minister's dictators Kings Queens and somehow. I don't see it for myself literally main door he responded this way back backstage, but they said you didn't need it trump went to Michigan to this later and he appeared in which personality approaching Mark former vice president Joe Biden
94.6677267551% | In his microwave fighting for wearing a mask and he walked onto the stage where it is massive mask there's nobody understands and there's any takes it off you like to have it hanging off you. I think it makes them feel good frankly if you want to know the truth who's got the largest basket together. Seen it because trump thinks that maths make him and people generally I guess what a week or something is resistant wearing one in public from 1 today which has had a correlation between the erosion of the public's confidence and trump have the corner coronavirus and his number is SE6 a second term in the 67.
94.9921131134% | The coronavirus pandemic in the heels of national and swings they both lots of them that show trump slipping further and further behind former vice president Joe Biden when it comes to General Election good policy would seem to make for good politics at all virtually every infectious disease expert believes that wearing masks in public is our best to contain the spread of coronavirus until a vaccine would do well to listen to buy on this one a mare is the point we make episode every Tuesday and Thursday make sure to check them all out.
What is the predicted size of a transcript that is generated within the speech transcription results. What decides the size of each transcript ? What is the max and minimum character length ? How should I design my SQL table column size, in order to be prepared for the expected transcript size ?
As I mentioned in the comments, the Video Intelligence transcripts are splits with roughly 50-60 seconds from the video.
I have created a Public Issue Tracker case, link, so the product team can clarify this information within the documentation. Although, I do not have an eta for this request, I encourage you to follow the case's thread.

How to mask credit card number mask in a text?

I have a form on my website and my customers send message to me with this form. Sometimes they write their credit card number on the message. So this is really critical. I want to mask these credit card numbers. But of course card numbers don't come on a regular basis.
Example 1: 1111222233334444
Example 2: 4444 3333 2222 1111
Example 3: 4444-3333-2222-1111
Example 4: 4444 - 3333 - 2222 - 1111
Example 5: 4444--3333--2222--1111
So I can mask for example 1, 2 and 3. But if there are more than one space or dash between numbers I can't.
And this is my last regex:
preg_replace("/(?:\b| )([3456]\d{3})([ -]+){0,1}\d{4}([ -]+){0,1}\d{4}([ -]+){0,1}(\d{0})/", "$1********$2", $a1);
And results for this regex:
Result 1: 4444********1111
Result 2: 4444******** 1111
Result 3: 4444********-1111
Result 4: 4444******** - 1111
Result 5: 4444********--1111
So what should I do in regex? Thanks.
May I suggest that you separate validation of your credit card number from the presentation of that number to your users via the UI? Assuming you have only stored valid credit card numbers, then it is probably safe to assume that every number has at least 8 digits. If so, then you can just use a blanket regex to only display the first 4 and last 8 digits:
$cc = "4444--3333--2222--1111";
echo preg_replace("/(\d{4}).*(\d{4})/", "$1********$2", $cc);
4444********1111
Demo
You might point out that this puts the same number of stars in between every card number. But, then again, this is a good thing, because it makes it even harder for a snooper to fish out what the real unmasked number actually is.
Edit:
Here is a smarter regex which will star out the middle portion of any number, leaving only the first and last 4 characters visible:
$cc = "4444--3333--2222--1111";
echo preg_replace("/(?<=.{4}).(?=.{4})/", "*", $cc);
4444**************1111
Note that this solution would not remove anything from 11114444 as a theoretical input.
How to mask credit card number mask in a text [with regex]?
Don't.
Sometimes they write their credit card number on the message.
They really shouldn't. Don't encourage this behavior. It is not PCI compliant:
What is PCI Compliance?
The Payment Card Industry Data Security Standard (PCI DSS) applies to companies of any size that accept credit card payments. If your company intends to accept card payment, and store, process and transmit cardholder data, you need to host your data securely with a PCI compliant hosting provider.
When you accept credit card data via a website, do so using an approved service provider like Stripe, PayPal, BlueSnap, SecurionPay, etc. These services are immensely popular not because it's hard to make payment systems, but because they're hard to make right (and legal). They all have PHP API's, so you can have people enter credit card data that you never see, and still charge them for amounts that you agree upon.
For example, if you were using Stripe and you wish to inform your customer what credit card they signed up with, their card object has a last4 property that gives the last four digits of the card: At this point you never knew the full credit card number, and you didn't even have to consider whether giving the first four and the last four was a security violation.
Further guidelines:
Never store electronic track data or the card security number in any form
While you may have a business reason for storing credit card information, processing regulations specifically forbid the storage of a card’s security code or any “track data” contained in the magnetic strip on the back of a credit card.
The card security number, called by many acronyms including CVV2, CID, and CSC, is the three digit number on the back of Visa/MasterCard/Discover cards or the 4 digit number on the front of American Express cards. It is designed to provide a way for merchants to know whether a customer authorizing a transaction over the phone or via the Internet actually has the card in their possession. This approach only works if the security code is never stored with the card number. Electronic storage makes this easy. You simply do not create a field for the security code. For paper storage, you need to redact (cross out with a dark pen to make unreadable) the security code after you successfully process the transaction and before you store a paper authorization form. [...]
Clearly you should store neither security codes nor track data purposely. But, you need to make sure you don’t store it inadvertently as well. To do this, be certain to use only approved hardware and software. [...]
Make sure all electronic storage of credit card account numbers is encrypted and all paper storage is secured
[...] Electronic storage of credit card numbers is also common if, for example, you process recurring or repeat transactions. If you do this, you need to make certain that you never store these files unencrypted. You need to make certain that any electronic storage is encrypted using a robust encryption algorithm. That way, if your computer is stolen or if someone in your office gains unauthorized access, you have some level of protection for the credit card numbers.
There are many service providers that offer secure storage—either as a standalone service or as part of a payment processing package. These services typically provide you with a “Token” for a card number they store. You can store the token in any unsecured file. When you’re ready to process a payment, you simply send the service provider the token and it retrieves the full card number for the sole purpose of processing the payment. (It’s technically more complicated than that, but you get the idea.) Just be certain to use a PCI DSS Verified provider [...]
Check the next regex \b([3-6]\d{3})(?: *-* *\d{4}){2} *-* *(\d{4})\b.

How can I restrict the output of an Amazon Machine Learning model? (Predicting cricket team results)

I am trying to predict match winner based on the historical data set as shown below,
The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)
Target is set as Match_winner_id.
Am I missing any configurations? Please help
The model is working perfectly correctly. There's just two problems:
Your input data is not very good
There's no way for the model to know that only one of those two teams should win
Data Quality
A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.
For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.
However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:
Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
Venue: Only relevant if a particular team always wins at a given venue
Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
Score: Again, not known until the actual game, so no good for future predictions.
Man of the Match: Not known for future games.
Umpire: How does an umpire influence the result of a game?
City: Yes, given that home teams often have an advantage.
You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.
Picking only one of the two teams
When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.
You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.
What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.
But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.

Understanding "Real world modelling" program

Few days now I've got new project to do related with a "real world modelling" program.
Here's how it looks like:
A visit to a psychologist (Use queue). Experts provides psychologist's advice, some of them (n) forms therapeutic groups of k people (GrT - duration of group therapy in hours), other experts (m) takes individual patients (InT - duration of individual therapy in hours). Each newly came patient (new patient's appearance probability is p1, recurring patients comes after period of time (h)) can choose to go to a psychologist providing individual therapies, or to group therapies. If group therapy session is full, patients who are wishing to participate in group sessions must wait. Recurring patients wishing to go to group sessions can start a session with smaller group, but can't go to same session with newly came patients. It has been observed that patients who took individual therapy are recovering faster than those, who chose group sessions(they will need less sessions), but there are exceptions - due to social interaction factor, some patients (probability p2) recover h percent faster than those, who choose individual therapy. Individual session costs InC, group session GrC. You need to assess what therapeutic approach patient should choose optimizing with their resources, and how many and what specialists should hire a health care facility.
Here's my approach to this problem:
Read text file containing Names, Surnames, money willing to spend and place everything in queue structure.
Find which group is better for patient by generating random number for p2probability and using it, we'll find if patient recover faster in individual or group therapy. IMO factor sequence here: Money(looking, if patient can afford individual therapy sessions) > p2 (should patient take group sessions if it's better for him).
By looking how many patients there are in queue, we can find how many psychologists we'll need. (Are there any other factors here? What if we are short of experts?)
Problems that I can't understand: how do I implement p1 probability of new patients appearance if I write every patient into a text file and put them in a queue? How many therapy sessions does it take for patient to recover (static number?)?
Am I missing something? Basically it's open question and there could be no right answer. If anyone have any suggestions how to build this program to better one, I'd be glad to take it!
Programming language I'm using: C++
If you want to break up a task, analyse it and prepare it for coding, you could :
Firstly make a Block diagram, representing program flow control.
Followed by Pseudo code implementation.
P.S. update the question following the above and when you reach the "code stage", there, definitely, will be more help.

user profiling and outlier detection

I have a dataset including 1 million customers. They are splitted into some categories like electronics customers,food and Beverage customers etc. Group names present customers' profiles.
each customer has different behaviours. For instance suppose that an electronic customer buys one electronic devices at least when he goes shopping. This transaction repeats randomly or continuously. So that I present each transaction by numerical codes.
(Value of transaction, volume of trans., transaction type, etc..) = (100,200,1)
for each transaction I have this vector above.it means every customer has a different trade behaviour.
I want to find out whether each customer has a pattern? Do we have outliers?
it is a profiling problem basically.
which analysis do you recommend?
Can you be more specific? What are you trying to get out of the analysis exactly? Buying patterns, customers that are outliers, purchases that are outliers?
If you want to determine which items are bought together, group the transactions together, just listing the items purchased at the same time and do shopping basket analysis, using the apriori algorithm or similar.
If you want to find similar customers, using k nearest neighbor or k means against a vector representing a customer's buy patterns (probably just the items bought). You can do this on individual transactions also to compare transactions.
To determine outliers, you can use a density based clustering algorithm (e.g. DBSCAN) to cluster customers together that are close to one another, and look at those customers that are not in clusters to determine outliers also.