Regular Expression to deal with text issue - regex

I have the following text sample:
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
It's from a text file with 21,000 lines, which lists invoices against a supplier.
The pattern is always the same on each block of invoices against each supplier, where:
The supplier name starts at the beginning of a line
The invoices being to be listed 2 rows down from the supplier name, indented by one space.
I wondered if I can use a Regular Expression (I'm using TextPad as a Text Editor on a Windows PC) to:
Append each invoice line with a tab (\t)
Append the supplier name in front of the tab so each invoice line now starts with the supplier name, and a tab, where the supplier name is taken from 2 rows above the start of each block of invoices
Delete the supplier name line from above the invoice block.
Expected output:
Sample Supplier 123 AP Invoices 123456 -229.00
Sample Supplier 123 AP Invoices 235435 337.00
Sample Supplier 123 AP Invoices 444323 228.00
Sample Supplier 123 AP Invoices 576432 248.00
I realise I am probably asking for "the moon on a stick" here, but the alternative is to go through a 21,000 line text file and copy and paste the data into Excel, which might not be a very good use of my time.
Maybe I can't do it using a simple regular expression, or maybe it's simply not possible at all.
Any advice would be much appreciated.
Thanks

I would use a simple Python script to solve this issue:
currentheader = ""
with open("yourfile.txt") as f:
with open("newfile.txt","w") as fw:
for line in f:
if len(line.strip()) == 0:
continue
elif line[0] != " ": #new header
currentheader = line[:-1]
else:
fw.write(currentheader + "\t" + line[1:])
For this to work, on Windows you will have to install Python. Python 2 or 3 should both work with this script. After installing Python, you open a command line (Win+R, cmd, Enter), navigate to the folder your file is in using cd foldername, if necessary, and then type python dealWithTextIssue.py (after having saved the script as "dealWithTextIssue.py" there.

I think this isn't solvable just with regex, you'll have to do some programming. I made a little script in PHP:
$string = <<<EOL
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
Second Supplier
A B C D
B F
EOL;
$array = preg_split("~[\n\r]+~", $string);
foreach ($array as $value) {
if (strpos($value, " ") == 0) {
if (strlen(trim($value)) > 0) {
echo "\t".$header.rtrim($value).PHP_EOL;
}
}
else {
$header = $value;
}
}
You can see it at work for example here after clicking on execute code.

Related

Need to extract data which is in tabular format from a python list

Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
Here above , I have 2 teams with names, address and phone number of each player in a list . However, there are no tables as such, but the data whiloe i tried to read is in Tabular format, where In team A Team B are 2nd and 3rd columns and the 1st column is where the tags name,address phone comes.
My objective is to fetch only the names of the players grouped by team name. In this example, there are 2 players each team. it can be between 1 and 2.Is there a way someone can help to share a solution using Regular Expressions. I tried a bit, however that is giving me random results , such as team B players in Team A.Can someone help?
This should work for you, in future I would give more detail on your input string, I have assumed spaces. If it uses tabs, try replacing them with four spaces. I have added an extra row which included a more difficult case.
Warning: If Team B has more players than Team A, it will probably put the extra players in Team A. But it will depend on the exact formatting.
import re
pdf_string = ''' Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
name forename surname
addres 345,ab colony
Phone 7666666 '''
lines_untrimmed = pdf_string.split('\n')
lines = [line.strip() for line in lines_untrimmed]
space_string = ' ' * 3 # 3 spaces to allow spaces between names and teams
# This can be performed as a one liner below, but I wrote it out for an explanation
lines_csv = []
for line in lines:
line_comma_spaced = re.sub(space_string + '+', ',', line)
line_item_list = line_comma_spaced.split(',')
lines_csv.append(line_item_list)
# lines_csv = [re.sub(space_string + '+', ',', line).split(',') for line in lines]
teams = lines_csv[0]
team_dict = {team:[] for team in teams}
for line in lines_csv:
if 'name' in line:
line_abbv = line[1:] # [1:] to remove name
for i, team in enumerate(teams):
if i < len(line_abbv): # this will prevent an error if there are fewer names than teams
team_dict[team].append(line_abbv[i])
print(team_dict)
This will give the output:
{'Team A': ['xyz', 'pqr', 'forename surname'], 'Team B': ['abc', 'ijk', 'ijk']}

Rename files in Powershell with a reference file

Sorry for previous confusion...
I've spent several hours today trying to write a powershell script that will pull a client ID off a PDF from system #1 (example, Smith,John_H123_20171012.pdf where the client ID is the H#### value), then look it up in an Excel spreadsheet that contains the client ID in system 1 and system 2, then rename the file to the format needed for system 2 (xxx_0000000123_yyy.pdf).
One gotcha is that client # is 2-4 digits in system 2 and always preceeded by 0's.
Using Powershell and regular expressions.
This is the first part I am trying to use for my initial rename:
Get-ChildItem -Filter *.pdf | Foreach-Object{
$pattern = "_H(.*?)_2"
$OrionID = [regex]::Match($file, $pattern).Groups[1].value
Rename-Item -NewName $OrionID
}
It is not accepting "NewName" because it states it is an empty string. I have run:
Get-Variable | select name,value,Description
And new name shows up as a name but with no value. How can I pass the output from the Regex into the rename?
Run this code line by line in debugger, you will understand how this works.
#Starts an Excel process, you can see Excel.exe as background process
$processExcel = New-Object -com Excel.Application
#If you set it to $False you wont see whats going on on Excel App
$processExcel.visible = $True
$filePath="C:\somePath\file.xls"
#Open $filePath file
$Workbook=$processExcel.Workbooks.Open($filePath)
#Select sheet 1
$sheet = $Workbook.Worksheets.Item(1)
#Select sheet with name "Name of some sheet"
$sheetTwo = $Workbook.Worksheets.Item("Name of some sheet")
#This will store C1 text on the variable
$cellString = $sheet.cells.item(3,1).text
#This will set A4 with variable value
$sheet.cells.item(1,4) = $cellString
#Iterate through all the sheet
$lastUsedRow = $sheet.UsedRange.Rows.count
$LastUsedColumn = $sheet.UsedRange.Columns.count
for ($i = 1;$i -le $lastUsedRow; $i++){
for ($j = 1;$j -le $LastUsedColumn; $j++){
$otherString = $sheet.cells.item($i,$j).text
}
}
#Create new Workbook and add sheet to it
$newWorkBook = $processExcel.Workbooks.Add()
$newWorkBook.worksheets.add()
$newSheet = $newWorkBook.worksheets.item(1)
$newSheet.name="SomeName"
#Close the workbook, if you set $False it wont save any changes, same as close without save
$Workbook.close($True)
#$Workbook.SaveAs("C:\newPath\newFile.xls",56) #You can save as the sheet, 56 is format code, check it o internet
$newWorkBook.close($False)
#Closes Excel app
$processExcel.Quit()
#This code is to remove the Excel process from the OS, this does not always work.
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($processExcel)
Remove-Variable processExcel
I ended up using a utility called "Bulk Rename Utility" and Excel. I can run the various renaming regex's through BRU and add the reference .txt file after some Excel formatting.

Print line if any of these words are matched

I have a text file with 1000+ lines, each one representing a news article about a topic that I'm researching. Several hundred lines/articles in this dataset are not about the topic, however, and I need to remove these.
I've used grep to remove many of them (grep -vwE "(wordA|wordB)" test8.txt > test9.txt), but I now need to go through the rest manually.
I have a working code that finds all lines that do not contain a certain word, prints this line to me, and asks if it should be removed or not. It works well, but I'd like to include several other words. E.g. let's say my research topic is meat eating trends. I hope to write a script that prints lines that do not contain 'chicken' or 'pork' or 'beef', so I can manually verify if the lines/articles are about the relevant topic.
I know I can do this with elif, but I wonder if there is a better and simpler way? E.g. I tried if "chicken" or "beef" not in line: but it did not work.
Here's the code I have:
orgfile = 'text9.txt'
newfile = 'test10.txt'
newFile = open(newfile, 'wb')
with open("test9.txt") as f:
for num, line in enumerate(f, 1):
if "chicken" not in line:
print "{} {}".format(line.split(',')[0], num)
testVar = raw_input("1 = delete, enter = skip.")
testVar = testVar.replace('', '0')
testVar = int(testVar)
if testVar == 10:
print ''
os.linesep
else:
f = open(newfile,'ab')
f.write(line)
f.close()
else:
f = open(newfile,'ab')
f.write(line)
f.close()
Edit: I tried Pieter's answer to this question but it does not work here, presumeably because I am not working with integers.
you can use any or all and a generator. For example
>>> key_word={"chicken","beef"}
>>> test_texts=["the price of beef is too high", "the chicken farm now open","tomorrow there is a lunar eclipse","bla"]
>>> for title in test_texts:
if any(key in title for key in key_words):
print title
the price of beef is too high
the chicken farm now open
>>>
>>> for title in test_texts:
if not any(key in title for key in key_words):
print title
tomorrow there is a lunar eclipse
bla
>>>

AWStats multiple columns in extra section

I have an AWStats running and the reports are built from IIS logfiles.
I have an extra section to view all the actions of the executed perlscripts on the site.
The config looks like this:
ExtraSectionName1="Actions"
ExtraSectionCodeFilter1="200 304"
ExtraSectionCondition1="URL,\/cgi\-bin\/.+\.pl"
ExtraSectionFirstColumnTitle1="Action"
ExtraSectionFirstColumnValues1="QUERY_STRING,action=([a-zA-Z0-9]+)"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionStatTypes1=HPB
ExtraSectionAddAverageRow1=0
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=20
MinHitExtra1=1
The output looks like this:
Action Pages Hits
foo 1234 1234
bar 5678 5678
But there are some actions with the same name in different perl scripts.
I would need this:
Script Action Pages Hits
foo.pl foo 1234 1234
bar.pl foo 1234 1234
foo.pl bar 5678 5678
bar.pl bar 5678 5678
Does anyone know how to create such a report?
EDIT:
I did some more research and all forum posts I've found say that it is not possible to have two columns in an extra section without hacking in awstats.pl
Now I am trying to put it into one column using URLWITHQUERY to output someting like this:
Action Pages Hits
foo.pl?action=foo 1234 1234
foo.pl?action=bar 1234 1234
bar.pl?action=foo 5678 5678
...
The new problem is that the query has more parameters than action, which are unordered.
I tried this
ExtraSectionFirstColumnValues1="URLWITHQUERY,([a-zA-Z0-9]+\.pl\?).*(action=[a-zA-Z0-9]+)"
but AWStats only gets the value from the first bracket pair and ignores the rest. I think it internally works with $1 provided by the perl regex 'magic'.
Any ideas?
maybe?
ExtraSectionFirstColumnTitle1="Script"
ExtraSectionFirstColumnValues1="URL,\/cgi\-bin\/(.+\.pl)`enter code here`"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionFirstColumnTitle2="Action"
ExtraSectionFirstColumnValues2="QUERY_STRING,action=([a-zA-Z0-9]+)"
ExtraSectionFirstColumnFormat2="%s"
I've found a solution.
awstats.pl fetches the data for the specified extra sections in line 19664 - 19750
This is my modification:
# Line 19693 - 19701 in awstats.pl (AWStats version 7 Revision 1.971)
elsif ( $rowkeytype eq 'URLWITHQUERY' ) {
if ( "$urlwithnoquery$tokenquery$standalonequery" =~
/$rowkeytypeval/ )
{
$rowkeyval = "$1$2"; # I simply added a $2 for the second capture group
$rowkeyok = 1;
last;
}
}
This will get the first and the second capture group specified in the ExtraSectionFirstColumnValuesX regex.
Example:
ExtraSectionFirstColumnValues1="URLWITHQUERY,([a-zA-Z0-9]+\.pl\?).*(action=[a-zA-Z0-9]+)"
Needless to say that you need to add a $3 $4 $5 ... if you need more groups.

How to get Place ID of city from Latitude/Longitude using Facebook API

I need to find the Facebook place for the city for many lat/long points. The actual points refer to personal addresses, so there are no exact place ID's to look for as in the case of a business.
For testing, I was looking for the town of Red Feather Lakes, CO.
The graph search function will return a lot of places, but does not return cities Example
Raw FQL does not let you search by lat/long, and has no concept of "nearby" anyway. Example
An FQL query by ID reveals that there is an least a "Display Subtext" field which indicates that object is a city. Example
Thanks for any help. I have over 80 years of dated and geotagged photos of my dad that he would love to see on his timeline!
EDIT
Cities are not in the place table, they are only in the page table.
There is an undocumented distance() FQL function, but it only works in the place table. (Via this SO answer.)
This works:
SELECT name,description,geometry,latitude,longitude, display_subtext
FROM place
WHERE distance(latitude, longitude, "40.801985", "-105.593719") < 50000
But this gives an error "distance is not valid in table page":
SELECT page_id,name,description,type,location
FROM page
WHERE distance(
location.latitude,location.longitude,
"40.801985", "-105.593719") < 50000
It's a glorious hack, but this code works. The trick is to make two queries. First we look for places near our point. This returns a lot of business places. We then take the city of one of these places, and use this to look in the page table for that city's page. There seems to be a standard naming conventions for cities, but different for US and non-US cities.
Some small cities have various spellings in the place table, so the code loops through the returned places until it finds a match in the page table.
$fb_token = 'YOUR_TOKEN';
// Red Feather Lakes, Colorado
$lat = '40.8078';
$long = '-105.579';
// Karlsruhe, Germany
$lat = '49.037868';
$long = '8.350124';
$states_arr = array('AL'=>"Alabama",'AK'=>"Alaska",'AZ'=>"Arizona",'AR'=>"Arkansas",'CA'=>"California",'CO'=>"Colorado",'CT'=>"Connecticut",'DE'=>"Delaware",'FL'=>"Florida",'GA'=>"Georgia",'HI'=>"Hawaii",'ID'=>"Idaho",'IL'=>"Illinois", 'IN'=>"Indiana", 'IA'=>"Iowa", 'KS'=>"Kansas",'KY'=>"Kentucky",'LA'=>"Louisiana",'ME'=>"Maine",'MD'=>"Maryland", 'MA'=>"Massachusetts",'MI'=>"Michigan",'MN'=>"Minnesota",'MS'=>"Mississippi",'MO'=>"Missouri",'MT'=>"Montana",'NE'=>"Nebraska",'NV'=>"Nevada",'NH'=>"New Hampshire",'NJ'=>"New Jersey",'NM'=>"New Mexico",'NY'=>"New York",'NC'=>"North Carolina",'ND'=>"North Dakota",'OH'=>"Ohio",'OK'=>"Oklahoma", 'OR'=>"Oregon",'PA'=>"Pennsylvania",'RI'=>"Rhode Island",'SC'=>"South Carolina",'SD'=>"South Dakota",'TN'=>"Tennessee",'TX'=>"Texas",'UT'=>"Utah",'VT'=>"Vermont",'VA'=>"Virginia",'WA'=>"Washington",'DC'=>"Washington D.C.",'WV'=>"West Virginia",'WI'=>"Wisconsin",'WY'=>"Wyoming");
$place_search = json_decode(file_get_contents('https://graph.facebook.com/search?type=place&center=' . $lat . ',' . $long . '&distance=10000&access_token=' . $fb_token));
foreach($place_search->data as $result) {
if ($result->location->city) {
$city = $result->location->city;
$state = $result->location->state;
$country = $result->location->country;
if ($country=='United States') {
$city_name = $city . ', ' . $states_arr[$state]; // e.g. 'Chicago, Illinois'
}
else {
$city_name = $city . ', ' . $country; // e.g. 'Rome, Italy'
}
$fql = 'SELECT name,page_id,name,description,type,location FROM page WHERE type="CITY" and name="' .$city_name. '"';
$result = json_decode(file_get_contents('https://graph.facebook.com/fql?q=' . rawurlencode($fql) . '&access_token=' . $fb_token));
if (count($result->data)>0) {
// We found it!
print_r($result);
break;
}
else {
// No luck, try the next place
print ("Couldn't find " . $city_name . "\n");
}
}
}
I found this solution worked for me when looking for a page for the closest city to the specified latitude/longitude. For some reason LIMIT 1 didn't return the closest city so I bumped up the limit and then took the first result.
SELECT page_id
FROM place
WHERE is_city and distance(latitude, longitude, "<latitude>", "<longitude>") < 100000
ORDER BY distance(latitude, longitude, "<latitude>", "<longitude>")
LIMIT 20