grep first n lines only

grep first n lines only - regex

I'm facing a problem greping the right date within a letter as a document.
Reason is to grep the date of document creation and not any further date within the text.
Usaly the dokument hold information about the company, my address, customer number, bill number....
and the date by when it was created.
Mayby a greeting and/or text maybe within dates again.
Often the date at begin of the document has different look as following.
December 1999 instead of 3.12.1999 as example.
If I grep the date in case of pattern
'(([0-9][0-9]{,1}\.)\s+('Januar'|'Februar'|'März'|'April'|'Mai'|'Juni'|'Juli'|'August'|'September'|'Oktober'|'November'|'Dezember')\s+([1-9][0-9][0-9][0-9]{1,}))'
sometimes get the wrong date as creation date. Reason is the different writing of dates in the documents.
Example 1 is what I usualy get and it works fine as I search for the date (creation date) with correct pattern.
Example 2 is in problem as I get a date, but it's NOT creation date which would be the 1st date. I get instead another date matching the pattern out from the text.
Example 1
Example 2
I could use different pattern '(([0-9][0-9]{,1}\.)([0-9][0-9]{,1}\.)([1-9][0-9][0-9][0-9]{1,}))' grepping the correct date in example 2 but then I would get same issue for example 1.
My idea was to search in first n lines only if pattern match take the date otherwise use different pattern.
I don't get the rule for pdfgrep using the first n lines only what would give me the possibility to use different pattern.
Has anybody an idea how to fix it?
Cheers, bdream

With GNU grep:
-m NUM: Stop reading a file after NUM matching lines.

Alternatively to GNU grep learn to use GNU gawk, specifically designed for such tasks.
Consider also learning python or GNU guile (then read SICP).

Related

Issues while processing zeroes found in CSV input file with Perl

Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.

I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.

Extract all data in between two double quotes

I'm trying to use a powershell regex to pull version data from the AssemblyInfo.cs file. The regex below is my best attempt, however it only pulls the string [assembly: AssemblyVersion(". I've put this regex into a couple web regex testers and it LOOKS like it's doing what I want, however this is my first crack at using a powershell regex so I could be looking at it wrong.
$s = '[assembly: AssemblyVersion("1.0.0.0")]'
$prog = [regex]::match($s, '([^"]+)"').Groups[1].Value

You also need to include the starting double quotes otherwise it would start capturing from the start until the first " is reached.
$prog = [regex]::match($s, '"([^"]+)"').Groups[1].Value
^

Try this regex "([^"]+)"
Regex101 Demo

Regular expressions can get hard to read, so best practice is to make them as simple as they can be while still solving all possible cases you might see. You are trying to retrieve the only numerical sequence in the entire string, so we should look for that and bypass using groups.
$s = '[assembly: AssemblyVersion("1.0.0.0")]'
$prog = [regex]::match($s, '[\d\.]+').Value
$prog
1.0.0.0

For the generic solution of data between double quotes, the other answers are great. If I were parsing AssemblyInfo.cs for the version string however, I would be more explicit.
$versionString = [regex]::match($s, 'AssemblyVersion.*([0-9].[0-9].[0-9].[0-9])').Groups[1].Value
$version = [version]$versionString
$versionString
1.0.0.0
$version
Major Minor Build Revision
----- ----- ----- --------
1 0 0 0
Update/Edit:
Related to parsing the version (again, if this is not a generic question about parsing text between double quotes) is that I would not actually have a version in the format of M.m.b.r in my file because I have always found that Major.minor are enough, and by using a format like 1.2.* gives you some extra information without any effort.
See Compile date and time and Can I generate the compile date in my C# code to determine the expiry for a demo version?.
When using a * for the third and fourth part of the assembly version, then these two parts are set automatically at compile time to the following values:
third part is the number of days since 2000-01-01
fourth part is the number of seconds since midnight divided by two (although some MSDN pages say it is a random number)
Something to think about I guess in the larger picture of versions, requiring 1.2.*, allowing 1.2, or 1.2.3, or only accepting 1.2.3.4, etc.

Using RegEx to Find a Block of Text

I'm attempting to block a long string of unnecessary text that's on every page of a document.
Ex: "36075 This is another page and this is the date March 4 2013"
I know this must be very simple, but I'm hoping there is a way to block text verbatim. Is the only way to block this text by using a lot of /d/s/w+/+ etc or is there is a way to say, "match 36075 This is another page and this is the date March 4 2013".
This would be SO HELPFUL to know. Thank you for helping!

From what you wrote I assume you need to get leading numbers from string, to do it you just need to use this pattern: ^\d+ which from this input:
36075 This is another page and this is the date March 4 2013
will return this:
36075
For future, in case of such questions please provide example string and expected output. As well as what you have tried.

I realized the issue I was having. I didn't need to use RegEx. The program I was using has the functionality to match specific words or groups of words and pronounce them differently. What I discovered is that it will not match the words unless the word groups are input exactly the way the program typically reads them.
Ergo --> The channel saw
the end of the British hold over
Would have to be listed as one group for, "The channel saw" and a second group for "the end of the British hold over"
In addition, there were some numbers --> 11960_30_o_ho_
and if the program naturally read 119 and then 60_3 and then _o_ho_ then three strings would need to be input for each section.
A few frustrating hours later, problem solved :) Thank you for your assistance.

RegEx Date Validation for M/D/YYYY and MM/DD/YYYY (InfoPath)

I'm looking for some RegEx for a custom pattern validation for a date field in InfoPath 2010. The accepted date format is m/d/yyyy or mm/dd/yyyy.
Attempt 1: (\d{1,2})/(\d{1,2})/(\d{4})
Attempt 2: (0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)
Had better luck with attempt 1, and not much at all with attempt 2.

I've been having some date and time validation issues with InfoPath 2010 and regex pattern matching can be a useful approach. A basic regex for validating m/d/yyyy (without catering for the specific days in a month and allowing for '0' prefix to month or day) would be something like the following (untested):
(0?[1-9]|1[012])\/(0?[1-9]|[12][0-9]|3[01])\/\d{4}
For something more sophisticated you could have a look at this SO answer.
However, in InfoPath the format of the date displayed can be completely different to the internal format and it is this internal format that your regex needs to match. If you drop a calculated field on your form and set it to the date field you want to validate you will see something like:
2013-05-08T12:13:14
So a regular expression (again ignoring specific days per month) required to validate the date component of this is:
\d{4}-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])
But this won't match against the example date because it doesn't account for the time portion following the "T". So the trick is to use an expression to perform the match against the date substring only, e.g. in my case the following works:
not(xdUtil:Match(substring-before(dfs:dataFields/my:SharePointListItem_RW/my:DateCreated, "T"), "\d{4}-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])"))

I tried the following and it worked:
\d{4}-\d{1,2}-\d{1,2}
As David pointed out the internal format might be different than the one displayed because when I tried \d\d/\d\d/\d\d\d\d it didn't work even though it caters to the displayed format of the date.

I had the same problem.
I used a rule on the date field to set another hidden text field to
string(datefield).
That always came out YYYY-MM-DD which is not too hard to create a regex against. I used this one.
((19|20)\d\d)-(0?[1-9]|1[012])-(0?[1-9]|[12][0-9]|3[01])
Remember that it has to be an XML Regex which has some restrictions.
Then I set another rule on the hidden field to set a Boolean IsDateValid.

Using regex to eliminate chunks in a file (categorized events in iCal file)

I have one .ics file from which I would like to create individual new .ics files depending on the event categories (I can't get egroupware to export only events of one category, I want to create new calendars depending on category). My intended approach is to repeatedly eliminate all events but those of one category and then save the file using EditPad Lite 7 (Windows).
I am struggling to get the regular expression right. .+? is still too greedy and negating the string (e.g. to eliminate all but events from one category) doesn't work either.
Sample
BEGIN:VEVENT
DESCRIPTION:Event 2
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 3
CATEGORIES:Sports
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 4
END:VEVENT
The regex BEGIN:VEVENT.+?CATEGORIES:Sports.+?END:VEVENT should only match sports events but it catches everything from the first BEGINto the first ENDfollowing the category.
Edit: negating doesn't work either: BEGIN:VEVENT.+?((?!CATEGORIES:Sports).).+?END:VEVENT.
What am I missing? Any pointers are highly appreciated.

I guess newlines are removed or ignored, because your regex does not care about them.
I only have a correction to the match after CATEGORIES
BEGIN:VEVENT.+?CATEGORIES:Sports.*?END:VEVENT
^
Zero or more
The first part of your regex looks good, maybe the regex engine in EditPad is not so good.
Try it with a different editor or scripting language (like Eclipse or perl or Notepad+ or Notepad2)
You could split the input and then grep the matching Sports events
#sportevents = grep /Sports/, split /END:VEVENT/, $input
map $_.="END:VEVENT", #sportevents
This was perl, maybe you can launch a script from EditPad to do it.
The second line just restores the END:VEVENT that was stripped during split.

OK. Solved it. I found something here which can be used to split ics files. I tweaked it to use the category rather than the summary in the file name and then merged the individually generated files according to category. I added the usual ics header and footer to all files and, voilà, I had individual calendar files.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js