I have been struggling making a regex to extract the information in below divided in 3 part between the ",". Only the first and second sequence (Friday and the date has succeded).
Friday, 26 Apr 2013, 18:30
I hope someone has the experience.
Best regards
Why not simply split the string and trim the excess whitespace of the individual parts? For example, verbosely written in C#:
string input = "Friday, 26 Apr 2013, 18:30";
string[] parts = input.Split(',');
for (int i = 0; i < parts.Length; i++)
{
parts[i] = parts[i].Trim();
}
Console.WriteLine(parts[0]); // "Friday"
Console.WriteLine(parts[1]); // "26 Apr 2013"
Console.WriteLine(parts[2]); // "18:30"
If you really want to use a regular expression for this, ^(.*),(.*),(.*)$ should work:
string input = "Friday, 26 Apr 2013, 18:30";
Regex regex = new Regex("^(.*),(.*),(.*)$", RegexOptions.Singleline);
Match match = regex.Match(input);
Console.WriteLine(match.Groups[1].Value.Trim()); // "Friday"
Console.WriteLine(match.Groups[2].Value.Trim()); // "26 Apr 2013"
Console.WriteLine(match.Groups[3].Value.Trim()); // "18:30"
Adding appropriate error checking is left as an exercise for the reader.
The following Regex expression is matching this whole part :
, 18:30
I hope someone has the experience.
Best regards
,+\s[0-9]+:[0-9]+ \r*.*
But yeah, that's kind of ultra specific to this ", Hour:Minuts [...]" format. You should do a split if you're using PHP or the equivalent in your language.
I think what you really want is something like this:
from datetime import datetime
s="Friday, 26 Apr 2013, 18:30"
d=datetime.strptime(s, "%A, %d %b %Y, %H:%M")
d
Out[7]: datetime.datetime(2013, 4, 26, 18, 30)
See the strptime and date format docs for details :)
Edit: sorry, I was somehow assuming you were using Python. Other languages have similar idioms though, e.g. PHP's date_parse, C#'s DateTime.Parse, etc.
You didn't specify a language so I'm going to answer this with a standard REGEX approach.
(?<=(^|,\s+)).+?(?=(,|$)) Will work for you.
Let me break up what it's doing.
(?<=(^|,\s+) - Look ahead for the start of a string or a comma followed by whitespace, but don't include it in the match. All matches must have this in front of them.
.+? - Grab all characters, but don't be greedy.
(?=(,|$)) - Look behind for the end of string or a comma. All matches must have this behind them.
When ran on your test case of Friday, 26 Apr 2013, 18:30, I get 3 matches:
Friday
26 Apr 2013
18:30
Like m01's answer, you could try this approach with C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
namespace TestDate
{
class Program
{
static void Main(string[] args)
{
string dateString = "Friday, 26 Apr 2013, 18:30"; // Modified from MSDN
string format = "dddd, dd MMM yyyy, HH:mm";
DateTime dateTime = DateTime.ParseExact(dateString, format, CultureInfo.InvariantCulture);
Console.WriteLine(dateTime);
Console.Read();
}
}
}
This will print out the localized date and time that is configured on the user's machine. For me it printed out 4/16/2013 6:30:00 PM.
Related
In Google Sheets i want to reformat this datetime Mon, 08 Mar 2021 10:57:15 GMT into this 08/03/2021.
Using RegEx i achieve the goal with
=to_date(datevalue(REGEXEXTRACT("Mon, 08 Mar 2021 10:57:15 GMT","\b[0-9]{2}\s\D{3}\s[0-9]{4}\b")))
But how can i do it without RegEx? This datetime format seems to be a classic one - can it really be, that no onboard formula can't do it? I rather think, i miss the right knowledge here...
Please try the following formula and format as date
=TRIM(LEFT(INDEX(SPLIT(K13,","),,2),12))*1
(do adjust according to your locale)
Another option is to use Custom Script.
Example:
Code:
function formatDate(date) {
return Utilities.formatDate(new Date(date), "GMT", "dd/MM/YYYY")
}
Formula in B1: =formatDate(A1)
Output:
Reference:
Custom Functions in Google Sheets
Here is what I need to do (for clarity)
Take a PDF file (link on the bottom)
Then parse only the information under each header into a DataFridView.
I couldn't think of a way to do this (seeing as there is no native way to handle PDFs)
So my only thought was to convert it to a txt document then (somehow) take the txt from the text document and put it into the datagridview.
So, using Itextsharp I first convert the PDF to a text file; Which keeps "most" of its formatting (see link below)
This is the source for that
Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
Dim mPageCount As Integer = mPDFreader.NumberOfPages()
Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
'Create the text file.
Dim fs As FileStream = File.Create(mTXT)
Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
For i As Integer = 1 To mPageCount
strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
fs.Write(info, 0, info.Length)
Next
fs.Close()
however I only need the "lines" of information. So everything should look like this
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
In order to do that now I needed to use RegEx to remove everything I didn't want
here is the RegEx I Used
The RegEx is
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";
Here is the code I used.
Private Sub Fixtext()
Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
While (True)
Dim line As String = reader.ReadLine()
If line = Nothing Then
Return
End If
Dim match As Match = regex.Match(line)
If match.Success Then
Dim value As String = match.Groups(1).Value
Console.WriteLine(line)
End If
End While
End Using
End Sub
The results are "close" but not exactly the way I need it. In some cases they are "crammed" together and there are still parts left behind. An example would be
90 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS
491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS
Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS
493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS
the format I actually need is (again) a format I can use to import the data later into a datagridview
so for each line it needs to be
[number][ID][ID2][Date][Notes]
[number][ID][ID2][Date][Notes]
[number][ID][ID2][Date][Notes]
[number][ID][ID2][Date][Notes]
using this "Concept" This is an example of what I need (though i know this doesn't work, but something along these lines that will work)
Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
While (True)
Dim line As String = reader.ReadLine()
If line = Nothing Then
Return
End If
Dim match As Match = regex.Match(line)
If match.Success Then
Dim value As String = match.Groups(1).Value
Dim s As String = value
s = s.Replace(" Tracking Id Forms Required Order Id RTS done on Notes", Nothing)
s = s.Replace("EXPRESS ", "EXPRESS")
s = s.Replace("EXPRESS", "EXPRESS" & vbCrLf)
Console.WriteLine(line)
End If
End While
End Using
Here is a "brief" explanation with files included.
Copy of the original PDF (This is the PDF being converted to .txt using itext)
I am only doing this because I can't think of a way (outside of paying for a 3rd party tool to convert a pdf to XLS)
https://drive.google.com/file/d/1iHMM_G4UBUlKaa44-Wb00F_9ZdG-vYpM/view?usp=sharing
using the above "itext method" I mentioned this is the outputted converted file
https://drive.google.com/file/d/10dgJDFW5XlhsB0_0QAWQvtimsDoMllx-/view?usp=sharing
I then use the above Regex (mentioned above) to parse out what I don't need.
however it isn't working.
So my Questions are (for "clarity")
Is this the only or best method to do what I need done? (Convert PDF to text, Remove what I don't need then input that information into a DataGridView; Or is there another , Cleaner , Better method?
(if not 1) How can I make this work? Is something wrong with my RegEx or My Logic? Am I missing something better/cleaner that someone can help me see.
(if 2 ^ Not 1) What is the best way to take the results and place them in the proper DataGridView Column.
Final Statement: It doesn't have to be this method. I will take "ANY" method that will allow me to do what I need to be done, the cleaner the better however I have to do this avoiding 3rd party libraries that are free with limitations; Paid 3rd party libraries. That leaves me with limitations. IE: PDFBox, itext,itextsharp) And this has to be able to lead me from a PDF (like the above sample) to that table information in a Datagridview or even a listview.
I will take any help and I am more then appreciative. Also I did re-Ask this question because a mod closed my original question "Stating it wasn't clear what I needed" I did try in both cases to make the question as "thorough" as possible but I do hope this is "Clearer" so it doesn't get closed abruptly.
I cheated a bit by correcting the text file. It goes a little wonky at page breaks and misses starting a new line. Perhaps you can correct that with Itextsharp or the hard to maintain regex.
I made a class to hold the data. The property names become the column headers in the DataGridView.
I read all the lines in the text file into an array. I checked the first character of the line to see if it was a digit then split the line into another array based on the space. Next I created a new Tracking object, fleshing it out with all its properties with the parameterized constructor.
Finally, I checked it the line contained a comma and added that bit of text to the notes parameter. The completed object is added to the list.
After the loop the lst is bound to the grid.
Public Class Tracking
Public Property Number As Integer
Public Property ID As String
Public Property ID2 As String
Public Property TrackDate As Date
Public Property Notes As String
Public Sub New(TNumber As Integer, TID As String, TID2 As String, TDate As DateTime, TNotes As String)
Number = TNumber
ID = TID
ID2 = TID2
TrackDate = TDate
Notes = TNotes
End Sub
End Class
Private Sub OPCode()
Dim lst As New List(Of Tracking)
Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
For Each line In lines
If Char.IsDigit(line(0)) Then
Dim parts = line.Split(" "c)
Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
If line.Contains(",") Then
T.Notes &= line.Substring(line.IndexOf(","))
End If
lst.Add(T)
End If
Next
DataGridView1.DataSource = lst
End Sub
EDIT
To pinpoint the error let's try...
Private Sub OPCode()
Dim lst As New List(Of Tracking)
Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
For Each line In lines
If Char.IsDigit(line(0)) Then
Dim parts = line.Split(" "c)
If parts.Length < 9 Then
Debug.Print(line)
MessageBox.Show($"We have a line that does not include all fields.")
Exit Sub
End If
Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
If line.Contains(",") Then
T.Notes &= line.Substring(line.IndexOf(","))
End If
lst.Add(T)
End If
Next
DataGridView1.DataSource = lst
End Sub
Try this regex and see if this works according to your requirement:
\b[0-9].*(FMPC|OD).*(EXPRESS|Replacement\sOrder)\b
Imagine having 500 string like this with different dates:
The certificate has expired on 02/05/2014 15:43:01 UTC.
Given that this is a String and I'am using powershell. I need to treat the date (02/05/2014) as an object, so I can use operatators (-lt -gt).
Is the only way doing this is using RegEx, and in this case - can anyone help me finding the first 6 numbers (which change every time) using regEx.
>$regexStr = "(?<date>\d{2}\/\d{2}\/\d{4})"
>$testStr = "The certificate has expired on 02/05/2014 15:43:01 UTC."
>$testStr -match $regexStr
# $Matches will contain the regex group called "date"
>$Matches.date
02/05/2014
>$date = Get-Date ($Matches.date)
>$date
Wednesday, February 5, 2014 12:00:00 AM
If you need to parse the date string with another format you can do:
>$dateObj = [datetime]::ParseExact($Matches.date,”dd/MM/yyyy”,$null)
>$dateObj.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True DateTime System.ValueType
Hope that helps
I wanted to separate the time and date from this string using REGEX because I feel like it is the only way I can separate it. But I am not really familiar on how to do it maybe someone can help me out here.
The original string: Your item was delivered in or at the mailbox at 3:34 pm on September 1, 2016 in TEXAS, MT 59102
The output i want to achieve/populate:
lv_time = 3:34 pm
lv_date = September 1, 2016
Here's the code I was trying to do but I am only able to cut it like this:
lv_status = Your item was delivered in or at the mailbox at
lv_time = 3
lv_date = :34 pm on September 1, 2016 in TEXAS, MT 59102.
Here's the code I have so far:
DATA: lv_status TYPE string,
lv_time TYPE string,
lv_date TYPE string,
lv_off TYPE i.
lv_status = 'Your item was delivered in or at the mailbox at 3:34 pm on September 1, 2016 in TEXAS, MT 59102.'.
FIND REGEX '(\d+)\s*(.*)' IN lv_status SUBMATCHES lv_time lv_date MATCH OFFSET lv_off.
lv_status = lv_status(lv_off).
You asked for it, here it comes:
\b((1[0-2]|0?[1-9]):([0-5][0-9]) ([AaPp][Mm])) on (January|February|March|April|May|June|July|August|September|October|November|December)\D?(\d{1,2}\D?)?\D?((?:19[7-9]\d|20\d{2})|\d{2})
This accepts time in HH:MM am/pm format, and dates in Jan-Dec, dd 1970-2999.
Each part is captured in its own group.
The demo shows a version that allows abbreviated month names:
Demo
The query
SELECT REGEXP_SUBSTR('Outstanding Trade Ticket Report_08 Apr 14.xlsx', '\_(.*)\.') AS FILE_DATE FROM DUAL
gives the OUTPUT:
_08 Apr 14.
Please advise the correct regex to be used for getting the date without the characters.
I can use RTRIM and LTRIM but want to try it using regex.
You can use:
SELECT REGEXP_SUBSTR('Outstanding Trade Ticket Report_08 Apr 14.xlsx', '\_(.*)\.',
1, 1, NULL, 1) from dual
The last argument is used to determine which matched group to return.
Link to Fiddler