How to find dates in any text with regex? [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a text extracted from an OCR program. I manage so far to get every element I wanted except the date. My date would be like this in some cases ASDICA>31.04.2019END($> and in others will be with spaces (which are easy to extract). My question:
Is there any quick function without nested for loops to parse through the text and extract dates?
My first amateur thought was to build a list with the common date separators, parse the text, save the position of the elements found in the text and then search their relatives to build a date.
This took a lot of time and proved troublesome because I'm hitting many escape chars due to OCR's behavior.
My ideal output would be 31/04/2019 but I can handle the symbol replacement as long as I got a list with the dates from the text.

To begin with SDICA>31.04.2019END($> is not a valid date :) April just has 30 days in a month.
But to answer your question, you can use dateutil module, especially the parser.parse function for the problem at hand
from dateutil import parser
#Parse date from the string, fuzzy parameter can find hidden datetime string around a wall of text
print(parser.parse('ASDICA>31.01.2019END($>', fuzzy=True))
The output will be 2019-01-31 00:00:00

Related

Conditional Regex for Percentage based values [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I've never been very good at regex, but I really need to grab the percentage information from these log entries; however, the warn/critical message moves around depending on where the warning was located in either the In or the Out utilization. I just can't figure out the regex. Here are two example entries that show both in and out issues:
["XXXXXXX"], (up), MAC: XX:XX:XX:XX:XX:XX, Speed: 2 GBit/s, In: 0 Bit/s (0%), Out: 6.53 GBit/s (warn/crit at 1.6 GBit/s/1.8 GBit/s) (326.45%)(!!)
["XXXXXXX"], (up), MAC: XX:XX:XX:XX:XX:XX, Speed: 2 GBit/s, In: 0 Bit/s (warn/crit at 1.6 GBit/s/1.8 GBit/s) (95.45%), Out: 6.53 GBit/s (32.00%)(!!)
Ultimately I need to use capture groups to capture both the in and out utilization percentage. But every regex I try only finds a single percentage. Help on this would be greatly appreciated. Thanks in advance.
EDIT SHOWING EXPECTED RESULT:
for each line the regex capture groups would identify in and out so the program can see both the in and out utilization. The program is expecting a key value pair from every log entry like the following:
IN:0% OUT:326.45%
IN:95.45% OUT:32.00%
Do you need something like this?
In:.+\(([0-9\.]+%)\).+Out:.+\(([0-9\.]+%)
If you just need to pull out values with percentage information, then it can help
https://regex101.com/r/B9pZeO/1

How to match regex pattern multiple times in Pyspark? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Below consists of email data present in the single column:
Requirement is to print from Call Example to additional details alone.
Input:
Summary:
Below are the details:
Call Example:
dialFromNumber:***** dialToNumber:***** date:*** time:*** additional details:xxxx
Please check out the call details.
Second Call Example:
dialFromNumber:*****
dialToNumber:*****
date:***
time:***
additional details:xxxx
Some random text.
Output:
Both of the call examples needs to be populated in the new column 'Calldetails1' in two different rows using Pyspark.
Call Example:
dialFromNumber:***** dialToNumber:***** date:*** time:*** additional details:xxxx
Call Example:
dialFromNumber:*****
dialToNumber:*****
date:***
time:***
additional details:xxxx
Regex_extract which i used to print from call example to additional details:
result = df.withColumn('result',regex_extract('comments','(?s)(?=Call Example)(.?additional details:\s[\w+])',1))
It's working for one group. Please suggest options to work globally in python
As mentioned in the chat:
(?=Call Example)([\w\s:\*]+?[\S])$
(?=Call Example) will assert whether there is a string that starts with Call Example
[\w\s:*]+? - Will do a lazy check of atleast 1 or more characters until the last occurence of a character till end of line.
Extracting multiple captured groups using pySpark
https://stackoverflow.com/questions/58930893/extracting-several-regex-matches-in-pyspark
https://stackoverflow.com/questions/54597183/i-have-an-issue-with-regex-extract-with-multiple-matches

Replace the words "can't, don't" by "can not, do not" using python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I need to replace words like "{can't, don't, won't }" by "{can not, do not, would not}" using python
The problem is:
"can't" can be detected by checking suffix "n't", so we can replace "n't" by "not"
But how can we transform "ca" to "can" as when we split "can't" it should be transformed to "can not"?
Since the rules of English are large and sometimes inconsistent, your best bet is probably just to set up full word maps rather than trying to figure out on the fly which letters are represented by the apostrophe.
In other words, a dictionary with values like:
can't -> can not
don't -> do not
won't -> will not
:
oughtn't -> ought not

Recognizing patterns given a set of sentences [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have a text file with lots of sentences. These sentences can occur in patterns. How do I recognize these patterns?
For example:
i woke up in the morning
i went to school
i played football
i came back home
i woke up in the morning
i went to school
i played basketball
At this point I want the program to say that "I played football" should have appeared.
This task seems to little bit complicate,but you can try this simple code for understanding or if finds it useful you can further implement it::
//the sentences/input input String
String sampleString1="xyz";
String[] sampleString2=sampleString1.split(".");
for(int i=1;i<=sampleString2.length;i++){
//The pattern which you can specify to match with the sentence
if(sampleString2[i].substring(0, 14).equals(sampleString2[0].substring(0,16))){
//code to execute the matched sentence.
System.out.println("Sentence matching with pattern ::" + sampleString2[i]);
}
}
If the pattern to be matched is the first line of the sequence ,then try this code.

RegExp Remove content outside of commas [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
Alright, so I have a database where you can get information from that'll show off in this kind of way:
ID, Display name, Likes Cake, Likes Coffee, Likes Dogs
So if you get the information, it would show something a little like to this:
1,anonymous,1,0,1
Now it's not very popular so I would like to show the people who has answered this so I would like the "1,!anonymous!,1,0,1" (anything outside the !'s) gone. I looked around and found a RegExp code that would remove stuff outside quotes, but it's rather hard and I'm rather impatient to put all the display names in quotes.
So if there was a RegExp that would erase the numbers so I could put the usernames up, would be delicious.
Well, you could do something like this:
Replae '^[^,]+([^,]+).*' With '$1'
How it looks exactly in your language may vary, of course.
But in your case this looks like CSV, so isn't parsing the CSV file easier in that case? E.g. in PowerShell you could do
Import-Csv foo.csv | select 'Display name'
and likewise for other languages that have such parsing built-in somewhere. Besides, most other options may break depending on the input because fields in CSV may contain commas too which breaks both above regex and a naïve splitting method.
You can split the database result string and then get the relevant array index.
string dbString = "1,anonymous,1,0,1";
string username = dbString.Split(',')[1];
//value of username will be "anonymous"