I have a small C++ program that reads arguments from the bash.
Let's say that I have a folder with some files with two different extensions.
Example: file1.ext1 file1.ext2 file2.ext1 file2.ext2 ...
if I execute the program with this command: ./readargs *.ext1
it will read all the files with the ext1.
if I execute ./readargs *.ext1 *.ext2 it will read all the .ext1 files and then all the .ext2 files.
My question is how can I execute the program in a way that reads with this order: file1.ext1 file1.ext2 file2.ext1 file2.ext2 ... Can I handle that from the command line or I need to handle it in the code?
If the names of your files is really in the form file1.ext1 file1.ext2 file2.ext1 file2.ext2, then you can sort it with
echo *.{ext1,ext2}|sort -u
the above gives the output:
$ ll | grep ext
23880775 0 -rw-r--r-- 1 user users 0 Apr 29 13:28 file1.ext1
23880789 0 -rw-r--r-- 1 user users 0 Apr 29 13:28 file1.ext2
23880787 0 -rw-r--r-- 1 user users 0 Apr 29 13:28 file2.ext1
23880784 0 -rw-r--r-- 1 user users 0 Apr 29 13:28 file2.ext2
$ echo *.{ext1,ext2} | sort -u
file1.ext1 file2.ext1 file1.ext2 file2.ext2
Then you copy the output and call your program. But if you do in fact need the files with .ext1 before the files of .ext2, then you have to either make sure that the .ext1 is alphabetically inferior to .ext2 or use another sorting criterion.
Optionally you could also adapt your executable to handle command line arguments in the correct order, but if you do already have an executable I'd recommend the first solution as work-around.
edit: this command does also sort lexically:
$echo *.ext[1,2]
$file1.ext1 file1.ext2 file2.ext1 file2.ext2
Related
$ ls -l
total 4
drwxr-xr-x 2 t domain users 4096 Nov 3 17:55 original
lrwxrwxrwx 1 t domain users 8 Nov 3 17:56 symbolic -> original
Here symbolic is a symbolic link pointing to original folder.
Contents of original folder.
$ ls -l original/
total 8
-rw-r--r-- 2 t domain users 4096 Nov 3 17:55 mydoc.docx
I have a file path in my code like:
std::string fileName = "/home/Downloads/symbolic/mydoc.docx";
path filePath(fileName);
How to check if fileName is a symbolic link?
is_symlink(filePath) is returning false and read_symlink(filePath) is returning empty path.
I want to use canonical only if it is symbolic link. Like this:
if(is_symlink(filePath)) --> This is returning false.Any other alternative ?
{
newFilePath = canonical(filePath);
}
According to the man page, is_symlink need a filesystem::path as parameter, not a std::string. You may want to try that.
I have three csv files containing different data for a common object. These represent data about distinct collections of items at work. These objects have unique codes. The number of files is not important so I will set this problem up with two. I have a handy recipe for joining these files using join -- but the cleaning part is killing me.
File A snippet - contains unique data. Also the cataloging error E B.
B 547
J 65
EB 289
E B 1
CO 8900
ZX 7
File B snippet - unique data about a different dimension of the objects.
B 5
ZX 67
SD 4
CO 76
J 54
EB 10
Note that file B contains a code not in common with file A.
Now I submit to you the "official" canon of codes designated for this set of objects:
B
CO
ZX
J
EB
Note that File B contains a non-canonical code with data. It needs to be captured and documented. Same with bad code in file A.
End goal: run trend and stats on the collections using the various fields from the multiple reports. They mostly match the canon but there are oddballs due to cataloging errors and codes that are no longer in use.
End goal result after merge/join:
B 547 5
J 65 54
EB 289 10
CO 8900 76
ZX 7 67
So my first idea was to use grep -F -f for this, using the canonical codes as a search list then merge with join. Problem is, with one letter codes it's too inclusive. It would seem like a job for awk where it can work with tab delimiters and REGEX the oddball codes. I'm not sure though, how to get awk to use a list to sift other files. Will join alone handle all this? Maybe I merge with join or paste, then sift out the weirdos? Which method is the least brittle and more likely to handle edge cases like the drunk cataloger?
If you're thinking, "Dude, this is better done with Perl or Python ...etc.". I'm all ears. No rules, I just need to deliver!
Your question says the data is csv, but based on your samples I'm assuming it's tsv. I'm also assuming E B should end up in the outlier output and that NA values should be filled with 0.
Given those assumptions, the following may be sufficient:
sort -t $'\t' -k 1b,1 fileA > fileA.sorted && sort -t $'\t' -k 1b,1 fileB > fileB.sorted
join -t $'\t' -a1 -a2 -e0 -o auto fileA.sorted fileB.sorted > out
grep -f codes out > out-canon
grep -vf codes out > out-oddball
The content of file codes:
^B\s
^CO\s
^ZX\s
^J\s
^EB\s
Result:
$ cat out-canon
B 547 5
CO 8900 76
EB 289 10
J 65 54
ZX 7 67
$ cat out-oddball
E B 1 0
SD 0 4
Try this(GNU awk):
awk 'BEGIN{FS=OFS="\t";}ARGIND==1{c[$1]++;}ARGIND==2{b[$1]=$2}ARGIND==3{if (c[$1]) {print $1,$2,b[$1]+0; delete b[$1];} else {if(tolower($1)~"[a-z]+ +[a-z]+")print>"error.fileA"; else print>"oddball.fileA";}}END{for (i in b) {print i,0,b[i] " (? maybe?)";print i,b[i] > "oddball.fileB";}}' codes fileB fileA
It will create error.fileA, oddball.fileA if such lines exists, oddball.fileB.
Normal output didn't write to file, you can write with > yourself when results are ok:
B 547 5
J 65 54
EB 289 10
CO 8900 76
ZX 7 67
SD 0 4 (? maybe?)
Had a hard time reading your description, not sure if this is what you want.
Anyway it's easy to improve this awk code.
You can change to FILENAME=="file1", or FILENAME==ARGV[1] if ARGIND is not working.
I am working with DHS data, which involves various data files with a consistent naming located in different folders. Each folder contains data for a specific country and survey year.
I would like to import datasets whose name consists of the component 'HR' for example I have ETHR41FL.DTA. The 'HR' part is consistent but other components of the name vary depending on country and survey year. I need to work with one dataset at a time and then move to the next so I believe an automated search would be helpful.
Running the command below gives:
dir "*.dta"
42.6M 5/17/07 10:49 ETBR41FL.dta
19.4M 7/17/06 12:32 ETHR41FL.DTA
60.5M 7/17/06 12:33 ETIR41FL.DTA
10.6M 7/17/06 12:33 ETKR41FL.DTA
234.4k 4/05/07 12:36 ETWI41FL.DTA
I have tried the following approach which did not go through as desired and might not be the best or most direct approach:
local datafiles : dir . files "*.dta" //store file names in a macro
di `datafiles'
etbr41fl.dtaethr41fl.dtaetir41fl.dtaetkr41fl.dtaetwi41fl.dta
The next step I think would be to store the value of the macro datafiles above into a variable (since strupper does not seem to work with macros but variables) and then convert to uppercase and extract the string ETHR41FL.dta. However, I encounter a problem when I do this:
local datafiles : dir . files "*.dta" //store file names in a macro
gen datafiles= `datafiles'
invalid '"ethr41fl.dta'
If I try the command below it works but gives a variable of empty values:
local datafiles : dir . files "*.dta" //store file names in a macro
gen datafiles= "`datafiles'"
How can I store the components of datafiles into a new variable?
If this works I could then extract the required string using a regular expression and import the dataset:
gen targetfile= regexs(0) if(regexm(`datafiles', "[A-Z][A-Z][H][R][0-9][0-9][A-Z][A-Z]"))
However, I would also appreciate a different approach.
Following Nick's advice to continue working with local macros rather than putting filenames into Stata variables, here is some technique to accomplish your stated objective. I agree with Nick to ignore the capitalization of the filenames provided by Windows, which is a case-insensitive filesystem. My example will work with case-sensitive filesystems, but will match any upper- or lower- or mixed-case filenames.
. dir *.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 a space.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etbr41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 ethr41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etir41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etkr41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etwi41fl.dta
. local datafiles : dir . files "*.dta"
. di `"`datafiles'"'
"a space.dta" "etbr41fl.dta" "ethr41fl.dta" "etir41fl.dta" "etkr41fl.dta" "etwi41fl.dta"
. foreach file of local datafiles {
2. display "`file' testing"
3. if regexm(upper("`file'"),"[A-Z][A-Z][H][R][0-9][0-9][A-Z][A-Z]") {
4. display "`file' matched!"
5. // process file here
. }
6. }
a space.dta testing
etbr41fl.dta testing
ethr41fl.dta testing
ethr41fl.dta matched!
etir41fl.dta testing
etkr41fl.dta testing
etwi41fl.dta testing
You can use filelist (from SSC) to create a dataset of file names. You can then leverage the full set of Stata data management tools to identify the file you want to target. To install filelist, type in Stata's command window:
ssc install filelist
Here's a quick example with datasets that follow the example provided:
. filelist, norecur
Number of files found = 6
. list if strpos(upper(filename),".DTA")
+---------------------------------+
| dirname filename fsize |
|---------------------------------|
1. | . ETBR41FL.dta 12,207 |
2. | . ETHR41FL.DTA 12,207 |
3. | . ETIR41FL.DTA 12,207 |
4. | . ETKR41FL.DTA 12,207 |
5. | . ETWI41FL.DTA 12,207 |
+---------------------------------+
. keep if regexm(upper(filename), "[A-Z][A-Z][H][R][0-9][0-9][A-Z][A-Z]")
(5 observations deleted)
. list
+---------------------------------+
| dirname filename fsize |
|---------------------------------|
1. | . ETHR41FL.DTA 12,207 |
+---------------------------------+
.
. * with only one observation in memory, use immediate macro expansion
. * to form the file name to read in memory
. use "`=filename'", clear
(1978 Automobile Data)
. describe, short
Contains data from ETHR41FL.DTA
obs: 74 1978 Automobile Data
vars: 12 18 Jan 2016 11:58
size: 3,182
Sorted by: foreign
I find the question very puzzling as it is about extracting a particular filename; but if you know the filename you want, you can just type it directly. You may need to revise your question if the point is different.
However, let's discuss some technique.
Putting Stata variable names inside Stata variables (meaning, strictly, columns in the dataset) is possible in principle, but it is only rarely the best idea. You should keep going in the direction you started, namely defining and then manipulating local macros.
In this case the variable element can be extracted by inspection, but let's show how to remove some common elements:
. local names etbr41fl.dta ethr41fl.dta etir41fl.dta etkr41fl.dta etwi41fl.dta
. local names : subinstr local names ".dta" "", all
. local names : subinstr local names "et" "", all
. di "`names'"
br41fl hr41fl ir41fl kr41fl wi41fl
That's enough to show more technique, which is that you can loop over such names. In fact with the construct you illustrate you can do that any way, and neither regular expressions nor anything else is needed:
. local datafiles : dir . files "*.dta"
. foreach f of local datafiles {
... using "`f'"
}
. foreach n of local names {
... using "et`n'.dta"
}
The examples here show a detail when giving literal strings, namely that " " are often needed as delimiters (and rarely harmful).
Note. Upper case and lower case in file names is probably irrelevant here. Stata will translate.
Note. You say that
. gen datafiles = "`datafiles'"
gives empty values. That's likely to be because you executed that statement in a locale where the local macro was invisible. Common examples are: executing one command from a do-file editor window and another from the main command window; executing commands one by one from a do-file editor window. That's why local macros are so named; they are only visible within the same block of code.
In this particular case you do not really need to use a regular expression.
The strmatch() function will do the job equally well:
local datafiles etbr41fl.dta ethr41fl.dta etir41fl.dta etkr41fl.dta etwi41fl.dta
foreach x of local datafiles {
if strmatch(upper("`x'"), "*HR*") display "`x'"
}
ethr41fl.dta
The use of the upper() function is optional.
i would like to gzip log files but i cannot work out how to run a regex expression in my command.
My Log file look like this, they roll every hour.
-rw-r--r-- 1 aus nds 191353 Sep 28 01:59 fubar.log.20150928-01
-rw-r--r-- 1 aus nds 191058 Sep 28 02:59 fubar.log.20150928-02
-rw-r--r-- 1 aus nds 190991 Sep 28 03:59 fubar.log.20150928-03
-rw-r--r-- 1 aus nds 191388 Sep 28 04:59 fubar.log.20150928-04
script.
FUBAR_DATE=$(date -d "days ago" +"%Y%m%d ")
fubar_file="/apps/fubar/logs/fubar.log."$AUS_DATE"-^[0-9]"
/bin/gzip $fubar_file
i have tried a few varients on using the regex but without success, can you see the simple error in my code.
Thanks in advace
I did:
$ fubar_file="./fubar.log."${FUBAR_DATE%% }"-[0-9][0-9]"
and it worked for me.
Why not make fubar_file an array to hold the matching log file names, and then use a loop to gzip them individually. Then presuming AUS_DATE contains 20150928:
# FUBAR_DATE=$(date -d "days ago" +"%Y%m%d ") # not needed for gzip
fubar_file=( /apps/fubar/logs/fubar.log.$AUS_DATE-[0-9][0-9] )
for i in ${fubar_file[#]}; do
gzip "$i"
done
or if you do not need to preserve the filenames in the array for later use, just gzip the files with a for loop:
for i in /apps/fubar/logs/fubar.log.$AUS_DATE-[0-9][0-9]; do
gzip "$i"
done
or, simply use find to match the files and gzip them:
find /apps/fubar/logs -type f -name "fubar.log.$AUS_DATE-[0-9][0-9]" -execdir gzip '{}' +
Note: all answers presume AUS_DATE contains 20150928.
I am currently storing below lines in a file named google.txt. I want to seperate these lines and store those seperated strings in arrays.
Like for first line
#qf_file= q33AgCEv006441
#date = Tue Apr 3 16:12
#junk_message = User unknown
#rf_number = ngandotra#nkn.in
the line ends at the #rf_number at last emailadress
q33AgCEv006441 1038 Tue Apr 3 16:12 <test10-list-bounces#lsmgr.nic.in>
(User unknown)
<ngandotra#nkn.in>
q33BDrP9007220 50153 Tue Apr 3 16:43 <karuvoolam-list-bounces#lsmgr.nic.in>
(Deferred: 451 4.2.1 mailbox temporarily disabled: paond.tndt)
<paond.tndta#nic.in>
q33BDrPB007220 50153 Tue Apr 3 16:43 <karuvoolam-list-bounces#lsmgr.nic.in>
(User unknown)
paocorp.tndta#nic.in>
<dtocbe#tn.nic.in>
<dtodgl#nic.in>
q33BDrPA007220 50153 Tue Apr 3 16:43 <karuvoolam-list-bounces#lsmgr.nic.in>
(User unknown)
<dtokar#nic.in>
<dtocbe#nic.in>
q2VDWKkY010407 2221878 Sat Mar 31 19:37 <dhc-list-bounces#lsmgr.nic.in>
(host map: lookup (now-india.net.in): deferred)
<arjunpan#now-india.net.in>
q2VDWKkR010407 2221878 Sat Mar 31 19:31 <dhc-list-bounces#lsmgr.nic.in>
(host map: lookup (aaplawoffices.in): deferred)
<amit.bhagat#aaplawoffices.in>
q2U8qZM7026999 360205 Fri Mar 30 14:38 <dhc-list-bounces#lsmgr.nic.in>
(host map: lookup (now-india.net.in): deferred)
<arjunpan#now-india.net.in>
<amit.bhagat#aaplawoffices.in>
q2TEWWE4013920 2175270 Thu Mar 29 20:30 <dhc-list-bounces#lsmgr.nic.in>
(host map: lookup (now-india.net.in): deferred)
<arjunpan#now-india.net.in>
<amit.bhagat#aaplawoffices.in>
Untested Perl script:
Let's call this script parser.pl:
$file = shift;
open(IN, "<$file") or die "Cannot open file: $file for reading ($!)\n";
while(<IN>) {
push(#qf_file, /^\w+/g);
push(#date, /(?:Sat|Sun|Mon|Tue|Wed|Thu|Fri)[\w\s:]+/g);
push(#junk_message, /(?<=\().+(?=\)\s*<)/g);
push(#rf_number, /(?<=<)[^>]+(?=>\s*$)/g);
}
close(IN);
This assumes the last email on the line should be the "rf_number" for that line. Note that emails may be tricky to print, as they have an # character, and perl is more than happy to print a non-existent list for you :-)
To call this in a command line:
parser.pl google.txt
See this working here.