How to print a geological legend using Generic Mapping Tools? - gmt

I want to print a geological legend for a map on which I draw the lithology.
I use this for drawing the lihtology:
gmt psxy "${map_repo}lithology.gmt" $reg $proj -L -aZ=Code -C$colornames -Zzval -O -K >> $1
With lithology.gmt:
# #R119.31526783/122.10616999/21.89310323/25.63488771
# #Je4326
# #Jp"+proj=longlat +datum=WGS84 +no_defs "
# #Jw"GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]]"
# #NGid|Code|Name|Note|Time|Abbrev
# #Tinteger|integer|string|string|string|string
# #D1|7010|安山岩|安山岩|更新世|α4
# #P
122.07853538 25.62400028
122.07684956 25.62493342
122.07546161 25.62601781
122.07289488 25.62674301
122.07229883 25.62762748
122.07266463 25.63064419
122.07335464 25.63202019
122.07432764 25.63344375
122.07726916 25.63393434
122.07953347 25.63447953
122.07987601 25.63488771
122.08179566 25.6348216
122.08337494 25.6345549
122.08500313 25.63346865
122.08510312 25.63203494
122.08292455 25.62846982
122.08222477 25.62601977
122.08153516 25.62469572
122.08011729 25.62404049
122.07853538 25.62400028
# #D2|7010|安山岩|安山岩|更新世|α4
# #P
122.10424213 25.48121473
122.10308919 25.48213736
122.10247849 25.4831236
122.10257125 25.48480864
122.10320436 25.48627749
122.10372775 25.48781035
And what I would like is to draw rectangles filled with the code color for each lithology appearing on my map (there are lots of other polygons not appearing on my map, ma map is a subset).
I was thinking of selecting the polygons appearing on my map with:
gmt select "${map_repo}lithology.gmt" $reg > test.txt
But I end up with only the points and no attributes:
120.72416514 24.16392437
120.72699135 24.15006498
120.72789158 24.13938717
120.72816222 24.13157955
120.72753278 24.1287816
120.72716142 24.1241218
120.72501945 24.11549495
120.72299921 24.10943261
120.71692221 24.09594517
120.71198933 24.08603074
120.7046417 24.07681128
120.70084365 24.07121021
120.69854872 24.06590545
120.69589904 24.05832681
120.69312775 24.04864854
120.69237669 24.04398786
And even after having the subset of polygons appearing on my map, I am not sure how to proceed to give the correct color value to one rectangle.


Comparing filenames and determine their incremental digits

Imagine i have a sequence of files, e.g.:
When the filenames are known, i can match against the filenames with a regular expression like:
Because i know the incremental pattern.
But what would be a generic approach to this? I mean how can i take two file names out of the list, compare them and find out where in the file name the counting part is, taking into account any other digits that can occur in the filename (the 400 in this case)?
Goal: What i want to do is to run the script against various file sequences to check for example for missing files, so this should be the first step to find out the numbering scheme. File sequences can occur in many different fashions, e.g.:
test_1.jpg (simple counting suffix)
segment9_400_av.ts (counting part inbetween, with other static digits)
01_trees_00008.dpx (padded with zeros)
Edit 2: Probably my problem can be described more simple: With a given set of files, i want to:
Find out, if they are a numbered sequence of files, with the rules below
Get the first file number, get the last file number and file count
Detect missing files (gaps in the sequence)
As melpomene summarized in his answer, the file names only differ in one substring, which consists only of digits
The counting digits can occur anywhere in the filename
The digits can be padded with 0's (see example above)
I can do #2 and #3, what i am struggling with is #1 as a starting point.
You tagged this question regex, so here's a regex-based solution:
use strict;
use warnings;
my $name1 = 'segment12_400_av.ts';
my $name2 = 'segment10_400_av.ts';
if (
"$name1\0$name2" =~ m{
( \D*+ (?: \d++ \D++ )* ) # prefix
( \d++ ) # numeric segment 1
( [^\0]* ) # suffix
\0 # separator
\1 # prefix
( \d++ ) # numeric segment 2
\3 # suffix
) {
print <<_EOT_;
Result of comparing "$name1" and "$name2"
Common prefix: $1
Common suffix: $3
Varying numeric parts: $2 / $4
Position of varying numeric part: $-[2]
Result of comparing "segment12_400_av.ts" and "segment10_400_av.ts"
Common prefix: segment
Common suffix: _400_av.ts
Varying numeric parts: 12 / 10
Position of varying numeric part: 7
It assumes that
the strings are different (guard the condition with $name1 ne $name2 && ... if that's not guaranteed)
there's only one substring that's different between the input strings (otherwise it won't find any match)
the differing substring consists of digits only
all digits surrounding the first point of difference are part of the varying increment (e.g. the example above recognizes segment as the common prefix, not segment1)
The idea is to combine the two names into a single string (separated by NUL, which is unambiguous because filenames can't contain \0), then let the regex engine do the hard work of finding the longest common prefix (using greediness and backtracking).
Because we're in a regex, we can get a bit more fancy than just finding the longest common prefix: We can make sure that the prefix doesn't end with a digit (see the segment1 vs. segment case above) and we can verify that the suffix is also the same.
See if this works for you:
use strict;
use warnings;
sub compare {
my ( $f1, $f2 ) = #_;
my #f1 = split /(\d+)/sxm, $f1;
my #f2 = split /(\d+)/sxm, $f2;
my $i = 0;
my $out1 = q{};
my $out2 = q{};
foreach my $p (#f1) {
if ( $p eq $f2[$i] ) {
$out1 .= $p;
$out2 .= $p;
else {
$out1 .= sprintf ' ((%s)) ', $p;
$out2 .= sprintf ' ((%s)) ', $f2[$i];
print $out1 . "\n";
print $out2 . "\n";
print "Test1:\n";
compare( 'segment8_400_av.ts', 'segment9_400_av.ts' );
print "\n\nTest2:\n";
compare( 'segment999_8_400_av.ts', 'segment999_9_400_av.ts' );
You basically split strings by starting/ending digits, the loop through the items and compare each of the 'pieces'. If they are equal, you accumulate. If not, then you highlight the differences and accumulate.
Output (I'm using ((number)) for the highlight)
segment ((8)) _400_av.ts
segment ((9)) _400_av.ts
segment999_ ((8)) _400_av.ts
segment999_ ((9)) _400_av.ts
I assume that only the counter differs across the strings
use warnings;
use strict;
use feature 'say';
my ($fn1, $fn2) = ('segment8_400_av.ts', 'segment12_400_av.ts');
# Collect all numbers from all strings
my #nums = map { [ /([0-9]+)/g ] } ($fn1, $fn2);
my ($n, $pos); # which number in the string, at what position
# Find which differ
for my $j (1..$#nums) { # strings
for my $i (0..$#{$nums[0]}) { # numbers in a string
if ($nums[$j]->[$i] != $nums[0]->[$i]) { # it is i-th number
$n = $i;
$fn1 =~ /($nums[0]->[$i])/g; # to find position
$pos = $-[$i];
say "It is $i-th number in a string. Position: $pos";
last NUMS;
We loop over the array with arrayrefs of numbers found in each string, and over elements of each arrayref (eg [8, 400]). Each number in a string (0th or 1st or ...) is compared to its counterpart in the 0-th string (array element); all other numbers are the same.
The number of interest is the one that differs and we record which number in a string it is ($n-th).
Then its position in the string is found by matching it again and using #- regex variable with (the just established) index $n, so the offset of the start of the n-th match. This part may be unneeded; while question edits helped I am still unsure whether the position may or not be useful.
Prints, with position counting from 0
It is 0-th number in a string. Position: 7
Note that, once it is found that it is the $i-th number, we can't use index to find its position; an number earlier in strings may happen to be the same as the $i-th one, in this string.
To test, modify input strings by adding the same number to each, before the one of interest.
Per question update, to examine the sequence (for missing files for instance), with the above findings you can collect counters for all strings in an array with hashrefs (num => filename)
use Data::Dump qw(dd);
my #seq = map { { $num[$_]->[$n] => $fnames[$_] } } 0..$#fnames;
dd \#seq;
where #fnames contains filenames (like two picked for the example above, $fn1 and $fn2). This assumes that the file list was sorted to begin with, or add the sort if it wasn't
my #seq =
sort { (keys %$a)[0] <=> (keys %$b)[0] }
map { { $num[$_]->[$n] => $fnames[$_] } }
The order is maintained by array.
Adding this to the above example (with two strings) adds to the print
{ 8 => "segment8_400_av.ts" },
{ 12 => "segment12_400_av.ts" },
With this all goals in "Edit 2" should be straighforward.
I suggest that you build a regex pattern by changing all digit sequences to (\d+) and then see which captured values have changed
For instance, with segment8_400_av.ts and
segment9_400_av.ts you would generate a pattern /segment(\d+)_(\d+)_av\.ts/. Note that s/\d+/(\d+)/g will return the number of numeric fields, which you will need for the subsequent check
The first would capture 8 and 400 which the second would capture 9 and 400. 8 is different from 9, so it is in that region of the string where the number varies
I can't really write much code as you don't say what sort of result you want from this process

Reading mailing addresses of varying length from a text file using regular expressions

I am trying to read a text file and collect addresses from it. Here's an example of one of the entries in the text file:
Electrical Vendor Contact: John Smith Phone #: 123-456-7890
Address: 1234 ADDRESS ROAD Ship To:
Suite 123 ,
Nowhere, CA United States 12345
Phone: 234-567-8901 E-Mail:
Fax: 345-678-9012 Web Address:
Acct. No: 123456 Monthly Due Date: Days Until Due
Tax ID: Fed 1099 Exempt Discount On Assets Only
G/L Liab. Override:
G/L Default Exp:
I cannot wrap my head around how to search for and store the address for each of these entries when the amount of lines in the address varies. Currently, I have a generator that reads each line of the file. Then the get_addrs() method attempts to capture markers such as the Address: and Ship keywords in the file to signify when an address needs to be stored. Then I use a regular expression to search for zip codes in the line following a line with the Address: keyword. I think I've figured out how successfully save the second line for all addresses using that method. However, in a few addresses,es there is a suite number or other piece of information that causes the address to become three lines instead of two. I'm not sure how to account for this and I tried expanding my save_previous() method to three lines, but I can't get it quite right. Here's the code that I was able to successfully save all of the two line addresses with:
import re
class GetAddress():
def __init__(self):
self.line1 = []
self.line2 = []
self.s_line1 = []
self.addr_index = 0
self.ship_index = 0
self.no_ship = False
self.addr_here = False
self.prev_line = []
self.us_zip = ''
# Check if there is a shipping address.
def set_no_ship(self, line):
self.no_ship = line.index(',') == len(line) - 1
except ValueError:
# Save two lines at a time to see whether or not the previous
# line contains 'Address:' and 'Ship'.
def save_previous(self, line):
self.prev_line += [line]
if len(self.prev_line) > 2:
del self.prev_line[0]
def get_addrs(self, line):
self.addr_here = 'Address:' in line and 'Ship' in line
self.po_box = False
self.no_ship = False
self.addr_index = 0
self.ship_index = 0
self.zip1_index = 0
# Check if 'Address:' and 'Ship' are in the previous line.
self.prev_addr = (
'Address:' in self.prev_line[0]
and 'Ship' in self.prev_line[0])
if self.addr_here:
self.po_box = 'Box' in line or 'BOX' in line
self.addr_index = line.index('Address:') + 1
self.ship_index = line.index('Ship')
# Get the contents of the line between 'Address:' and
# 'Ship' if both words are present in this line.
if self.addr_index is not self.ship_index:
self.line1 += [' '.join(line[self.addr_index:self.ship_index])]
elif self.addr_index is self.ship_index:
self.line1 += ['']
if len(self.prev_line) > 1 and self.prev_addr:
self.po_box = 'Box' in line or 'BOX' in line
self.us_zip ='(\d{5}(\-\d{4})?)', ' '.join(line))
if self.us_zip and not self.po_box:
self.zip1_index = line.index(
if self.no_ship:
self.line2 += [' '.join(line[:line.index(',')])]
elif self.zip1_index and not self.no_ship:
self.line2 += [' '.join(line[:self.zip1_index + 1])]
elif len(self.line1) > 0 and not self.line1[-1]:
self.line2 += ['']
# Create a generator to read each line of the file.
def read_gen(infile):
with open(infile, 'r') as file:
for line in file:
yield line.split()
infile = 'Vendor List.txt'
info = GetAddress()
for i, line in enumerate(read_gen(infile)):
I am still a beginner in Python so I'm sure a lot of my code may be redundant or unnecessary. I'd love some feedback as to how I might make this simpler and shorter while capturing both two and three line addresses.
I also posted this question to Reddit and u/Binary101010 pointed out that the text file is a fixed width, and it may be possible to slice each line in a way that only selects the necessary address information. Using this intuition I added some functionality to the generator expression, and I was able to produce the desired effect with the following code:
infile = 'Vendor List.txt'
# Create a generator with differing modes to read the specified lines of the file.
def read_gen(infile, mode=0, start=0, end=0, rows=[]):
lines = list()
with open(infile, 'r') as file:
for i, line in enumerate(file):
# Set end to correct value if no argument is given.
if end == 0:
end = len(line)
# Mode 0 gives all lines of the file
if mode == 0:
yield line[start:end]
# Mode 1 gives specific lines from the file using the rows keyword
# argument. Make sure rows is formatted as [start_row, end_row].
# rows list should only ever be length 2.
elif mode == 1:
if rows:
# Create a list for indices between specified rows.
for element in range(rows[0], rows[1]):
lines += [element]
# Return the current line if the index falls between the
# specified rows.
if i in lines:
yield line[start:end]
class GetAddress:
def __init__(self):
# Allow access to infile for use in set_addresses().
global infile
self.address_indices = list()
self.phone_indices = list()
self.addresses = list()
self.count = 0
def get(self, i, line):
# Search for appropriate substrings and set indices accordingly.
if 'Address:' in line[18:26]:
self.address_indices += [i]
if 'Phone:' in line[18:24]:
self.phone_indices += [i]
# Add address to list if both necessary indices have been collected.
if i in self.phone_indices:
def set_addresses(self):
self.address = list()
start = self.address_indices[self.count]
end = self.phone_indices[self.count]
# Create a generator that only yields substrings for rows between given
# indices.
self.generator = read_gen(
rows=[start, end])
# Collect each line of the address from the generator and remove
# unnecessary spaces.
for element in range(start, end):
self.address += [next(self.generator).strip()]
# This document has a header on each page and a portion of that is
# collected in the address substring. Search for the header substring
# and remove the corresponding elements from self.address.
if len(self.address) > 3 and not self.address[-1]:
self.address = self.address[:self.address.index('header text')]
self.addresses += [self.address]
self.count += 1
info = GetAddress()
for i, line in enumerate(read_gen(infile)):
info.get(i, line)

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
for (i=1;i<=NF;i++) {
if ($i == " ") {
else {
FNR==1 {
for (i in nonSpace) {
delete space[i]
for (i in space) {
$i = ","
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Sequentially replace multiple places matching single pattern in a string with different replacements

Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
to different replacements, e.g. increasing numbers
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be and ore.subst, the latter of which can accept a function as the replacement value.
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted"(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
#split x into single chars
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
#replace run lengths by 1
#replace run values by increasing number
#use inverse.rle on the modified rle object and collapse string
#[1] "1,2??3,4|[]5,6,7"

Perl regex & data extraction/manipulation

I'm not sure where to start with this one... my client gets stock figures from his supplier but they are now being sent in a different format, here is a sample snippet:
[["BLK",[["Black","0F1315"]],[["S","813"],["M","1378"],["L","1119"],["XL","1069"],["XXL","412"],["3XL","171"]]],["BOT",[["Bottle","15451A"]],[["S","226"],["M","425"],["L","772"],["XL","509"],["XXL","163"]]],["BUR",[["Burgundy","73002E"]],[["S","402"],["M","530"],["L","356"],["XL","257"],["XXL","79"]]],["DNA",[["Deep Navy","000F33"]],[["S","699"],["M","1161"],["L","1645"],["XL","1032"],["XXL","350"]]],["EME",[["Emerald","0DAB5E"]],[["S","392"],["M","567"],["L","613"],["XL","431"],["XXL","97"]]],["HEA",[["Heather","C0D4D7"]],[["S","374"],["M","447"],["L","731"],["XL","386"],["XXL","115"],["3XL","26"]]],["KEL",[["Kelly","0FFF00"]],[["S","167"],["M","285"],["L","200"],["XL","98"],["XXL","45"]]],["NAV",[["Navy","002466"]],[["S","451"],["M","1389"],["L","1719"],["XL","1088"],["XXL","378"],["3XL","177"]]],["NPU",[["Purple","560D55"]],[["S","347"],["M","553"],["L","691"],["XL","230"],["XXL","101"]]],["ORA",[["Orange","FF4700"]],[["S","125"],["M","273"],["L","158"],["XL","98"],["XXL","98"]]],["RED",[["Red","FF002E"]],[["S","972"],["M","1186"],["L","1246"],["XL","889"],["XXL","184"]]],["ROY",[["Royal","1500CE"]],[["S","1078"],["M","1346"],["L","1102"],["XL","818"],["XXL","135"]]],["SKY",[["Sky","91E3FF"]],[["S","567"],["M","919"],["L","879"],["XL","498"],["XXL","240"]]],["SUN",[["Sunflower","FFC700"]],[["S","843"],["M","1409"],["L","1032"],["XL","560"],["XXL","53"]]],["WHI",[["White","FFFFFF"]],[["S","631"],["M","2217"],["L","1666"],["XL","847"],["XXL","410"],["3XL","74"]]]]
Firstly the inital [ and end ] can be removed
Then it needs be be broken down into segments of colours, i.e.:
The BLK is needed here, the next block [["Black","0F1315"]] can be disregarded.
Next I need to take the stock data for each size ["S","813"] etc
Therefore I should have a data such as:
$col = BLK
$size = S
$qty = 813
$col = BLK
$size = M
$qty = 1278
and repeat this segment for every colour seqment in the data.
The amount of colour segments in the data will vary, as will the amount of sizing segements within. Also the amount of sizing segments will vary colour to colour, i.e. there maybe 6 sizes for BLK but only 5 for RED
The data will be written out while in the loop for these so something like print "$col:$size:$qty" will be fine as this would then be in a format ready to be processed.
Sorry for the long message, I just can't seem to get my head round this today!!
This looks like valid JSON to me, why not use a JSON parser instead of trying to solve this with a regex?
use JSON;
my $json_string = '[["BLK",[["Black","0F1315"]],[["S","813"...<snip>';
my $deserialized = from_json( $json_string );
Then you can iterate over the array and extract the pieces of information you need.
Building on Tim Pietzcker's answer:
my $deserialized = from_json( $json_string );
foreach my $group ( #$deserialized ) {
my ( $color, undef, $sizes ) = #$group;
print join( ":", $color, #$_ ), "\n" for #$sizes;
(And yes, for this particular format, eval should do as well as from_json, although the latter is safer. However, you should really try to find an official spec for the format: is it really JSON or something else?)
Assuming you have your data in $str, then eval(EXPR) (Danger Will Robinson!) and process the resulting data structure:
my $struct = eval $str;
foreach my $cref (#$struct) {
my($color, undef, $sizerefs) = #$cref; # 3 elements in each top level
foreach my $sizeref (#$sizerefs) {
my($size, $qty) = #$sizeref;
print "$color:$size:$qty\n";