Find specific date match use of regular expression in shell? - regex

I have bunch of file name in one consolidated file (i.e. main_file), those file name are having different naming conversions but they all have one in common that is date of format date +%Y%m%d example : 20151202 and it will come some where in middle of the name.
main_file file contain will look as :
DTC_by_PV_201511220000_raw_out.snappy
Belle_Tire_201511230000_raw_out.snappy
Goodyear_Tire_201511220200_raw_out.snappy
Sams_Club_201511230000_raw_out.snappy
eTire_All_201511230200_raw_out.snappy
I want to figure out the regular expression that i can use in shell script to read the main_file and generate the different file for each date we would found in file.
In this case we should have 2 files -
1. for date 20151122 and will contain :
DTC_by_PV_201511220000_raw_out.snappy
Goodyear_Tire_201511220200_raw_out.snappy
2. for date 20151123 and will contain :
Belle_Tire_201511230000_raw_out.snappy
Sams_Club_201511230000_raw_out.snappy
eTire_All_201511230200_raw_out.snappy
Note - Convention followed for file name is date with hour and mins (like
-201511230200 here 20151123 is date and 0200 is 2 am )

An awk oneliner:
awk -F_ '{i=substr($(NF-2),1,8);dates[i]=dates[i] $0 "\n"}END{for(d in dates)print(dates[d]) > d}' main_file
This will create a file with the date as the name that has only the lines with that date.
The expanded details... The first action runs on every line:
{
i=substr($(NF-2),1,8);
dates[i]=dates[i] $0 "\n"
}
what it does is find the date part of the line (which is two fields back from the end if the field separator is underscores). It only uses the date part (not the time) by cutting it with substr. Then it appends the whole line to an array element for the given date.
Then, at the end it prints to a file that has the date as the name, for each date found.
END {
for(d in dates)
print(dates[d]) > d
}

Related

Using sed command with delimiter character

I have a text file as below containing multiple lines with = in the middle of each line.
User name = user1
Date expire = Oct 20, 2019
I want to find Date expire and replace the right side of = which is the date with something else via sed. For example, Oct 25, 2019.
I know basic usage of sed 's/foo/bar/g' but that is used for fixed strings. I want to change part of the sentence by detecting a special character.
How can I do that?
Could you please try following.
sed '/Date expire/s/\(.*= \).*/\1 your_new_text_here/' Input_file
Using sed mechanism of storing matched regex values into tempraory buffer. Taking everything into 1st buffer till = and then keeping rest of the line's value without storing onto buffer. Finally substituting whole line with 1st value and new value

Get part of a string based on conditions using regex

For the life of me, I can't figure out the combination of the regular expression characters to use to parse the part of the string I want. The string is part of a for loop giving a line of 400 thousand lines (out of order). The string I have found by matching with the unique number passed by an array for loop.
For every string I'm trying to get a date number (such as 20151212 below).
Given the following examples of the strings (pulled from a CSV file with 400k++ lines of strings):
String1:
314513,,Jr.,John,Doe,652622,U51523144,,20151212,A,,,,,,,
String2:
365422,johnd#blankity.com,John,Doe.,Jr,987235,U23481,z725432,20160221,,,,,,,,
String3:
6231,,,,31248,U51523144,,,CB,,,,,,,
There are several complications here...
Some names have a "," in them, so it makes it more than 15 commas.
We don't know the value of the date, just that it is a date format such as (get-date).tostring("yyyyMMdd")
For those who can think of a better way...
We are given two CSV files to match. Algorithmic steps:
Look in the CSV file 1 for the ID Number (found on the 2nd column)
** No ID Numbers will be blank for CSV file 1
Look in the CSV file 2 and match the ID number from CSV file 1. On this same line, get the date. Once have date, append in 5th column on CSV file 1 with the same row as ID number
** Note: CSV file 2 will have $null for some of the values in the ID
number column
I'm open to suggestions (including using the Import-Csv cmdlet in which I am not to familiar with the flags and syntax of for loops with those values yet).
You could try something like this:
,(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),
This will match all dates in the given format from 1900 - 2099. It is also specific enough to rule out most other random numbers, although without a larger sample of data, it's impossible to say.
Then in PowerShell:
gc data.csv | where { $_ -match ",((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }
In the PowerShell match we added capturing parenthesis around what we want, and reference the group via the group number in the $matches index.
If you are only interested in matching one line based on a preceding id you could use a lookbehind. For example,
$id=314513; # Or maybe U23481
gc c:\temp\reg.txt | where { $_ -match "(?<=$id.*),((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }

Ordering output files

I have a file containing a large number of protein sequences. Each sequence is headed up by an initial "protein ID number" (GI number for those that know). I am using a awk command that allows me to print between two regular expressions. Using this, I can enter a list of GI numbers into one regex field where each GI number is separated by a "|". The second regex is a regex I added in after every protein, allowing me to perform the awk function (ABC123).
Therefore the code I am using is as follows
awk '/GI1|GI2|GI3|GI4|GIX.../,/ABC123/' database.txt > output.txt
As you can see from the above code, I am searching within database.txt and writing a new file. The problem is, when I open output.txt the list of GI's is in the wrong order. In output.txt I need them to occur in the same order as they occur in the first regex field i.e
GI1
GI2
GI3...
Instead, they occur in the order which they are found in database.txt, so in output.txt they look all jumbled i.e
Gi3
GI4
GI1
GI2
GI5
Does anyone know how I can get the list of GIs in the output file to match the same order as the list of GIs I input in the 1st regex field?
Try this command,
awk '/GI1|GI2|GI3|GI4|GIX.../,/ABC123/' database.txt | sort -k1.3,1.3 > output.txt
Now your output.txt contains the sorted list.
The specification 1.3,1.3 says that the sort key must starts at field 1 position 3 and ends at the same place.

How would I parse in a bash script date_value _space_ date_value

I am trying to import a tsv file into a mysql db but I am having trouble since the file has no unique delimiters to identify where a new row starts. The only unique identifier is a date followed by a space followed by time. Example: 6/19/2010 16:04:43
Could someone please point me in the right direction or help me make a bash script that puts a semicolon ";" in front of that string. So the end result will be ;6/19/2010 16:04:43
The tricky part is that in this file there will be other date fields and other time fields but this is the only string that will have a space in between the two.
cat file | sed 's#[0-9]\{1,2\}/[0-9]\{1,2\}/[0-9]\{4\} #;&#g' >resultfile. Test before using.

Regular Expression to replace a pattern at runtime(C#3.0)

I have a requirement.
I have some files in a folder among which some file names looks like say
**EUDataFiles20100503.txt, MigrateFiles20101006.txt.**
Basically these are the files that I need to work upon.
Now I have a config file where it is mentioned as the file pattern type as
EUDataFilesYYYYMMDD, MigrateFilesYYYYMMDD.
Basically the idea is that, the user can configure the file pattern and based on the pattern mentioned, I need to search for those files that are present in the folder.
i.e. at runtime the YYYYMMDD will get replaced by the Year Month and Date Values. It does not matter what dates will be there(but not with time stamp ; only dates)).
And the EUDataFiles or MigrateFiles names will be there.(they are fixed)
i.e. If the folder has a file name as EUDataFile20100504.txt(i.e. Year 2010, Month 05, Day 04) , I should ignore this file as it is not EUDataFiles20100504.txt (kindly note that the name is plural - File(s) and not file for which the system will ignore the file).
Similarly, if the Pattern given as EUDataFilesYYYYMMDD and if the file present is of type EUDataFilesYYYYDDMM then also the system should ignore.
How can I solve this problem? Is it doable using regular expression(Replacing the pattern at runtime)?
If so can anyone be good enough in helping me out?
I am using C#3.0 and dotnet framework 3.5.
Thanks
You could construct a regex from your basic file name plus (depending on the pattern) sub-regexes.
The sub-regexes could be
yyyy = #"\d{4}"
(unless you want to restrict a certain year range)
mm = #"(1[0-2]|0[1-9])"
dd = #"(3[01]|[12][0-9]|0[1-9])"
Build your regex by adding them in the correct order:
re = #"\AEUDataFiles" + yyyy + mm + dd + #"\.txt\Z"
Then you can check whether the filename(s) you've found match the regex:
foundMatch = Regex.IsMatch(subjectString, re);
Of course, this isn't a validation for correct dates (20100231 would pass), but that's probably not a problem in this case.