Extract hive count string using regex - regex

I am trying to get total number of records in a hive table using paramiko. I know we can use Pyhive or pyhs2 but it requires certain configuration and it will take alot of time to get that done from my IT team.
So I am using paramiko to execute the below command and get count:
beeline -u jdbc:hive2://localhost:10000 -n hive -e 'select count(*) from table_name'
And i get following result,
+----------+--+
| _c0 |
+----------+--+
| 1232322 |
+----------+--+
I need to extract this count from the output.
I have tried the following code and RE but its not working,
pattern="""
+----------+--+
| _c0 |
+----------+--+
| [0-9]* |
+----------+--+
"""
import paramiko
si, so, se=ssh_con.exec_command("beeline -u jdbc:hive2://localhost:10000 -n hive -e 'select count(*) from table_name'")
print(so.read().decode())
print(re.match(pattern,so.read().decode()))
I am able to retrieve count and print it. Just looking for regular expression to extract count.

In Beeline, the result can be displayed in different formats. By default the result is being printed in a table with header. You can remove header and table, no need in parsing result using regexp. Add these options: --showHeader=false --outputformat=tsv2
beeline --showHeader=false --outputformat=tsv2 -u jdbc:hive2://localhost:10000 -n hive -e 'select count(*) from table_name'
Read more details about Output Formats.

You mean to match the whole string and just extract the number below a table name.
Here is a regex that fixes your approach:
^\+-+\+--\+\n\| *\w+ *\|\n\+-+\+--\+\n\| *(\d+) *\|\n\+-+\+--\+$
See the regex demo. The \w+ matches one or more word chars and matches any table name.
However, it seems all you need is a regex to match a number between | ... |.
Use
result = ''
m = re.search(r'\|\s*(\d+)\s*\|', so.read().decode())
if m:
result = m.group(1)
See this regex demo.
Details
\| - a | char
\s* - 0+ whitespaces
(\d+) - Group 1: one or more digits
\s*\| - 0+ whitespaces and a | char.

Related

How do I get the regex to return more accurately?

I am trying to get the regex right to return one or the other but not both:
When I run for example the following:
aws secretsmanager list-secrets | jq -r ".SecretList[] | select(.Name|match(\"example-*\")) | .Name "
it returns
example-secret_key
as well as
examplecompany-secret_key
how can I modify the command to return one and not the other? Thanks
example-* matches strings that contain example followed by zero or more -.
^example- matches strings that start with example-.
jq -r '.SecretList[].Name | select( test( "^example-" ) )'
A regular expression is not a shell glob/wildcard. * in a regex does not mean "anything", but rather "whatever came before is repeated 0 or more times". . is a single arbitrary character. .* are 0 or more arbitrary characters.
If you want to match "example-" and don't care what comes after, simply use the regex example-. If you want to match "example-", then anything or nothing, then "_key", use the regex example-.*_key.
jq -r '.SecretList[].Name | select(test("example-"))'

Splunk query not endswith

I am just into learning of Splunk queries, I'm trying to grab a data from myfile.csv file based on the regex expression.
In particular, I'm looking forward, print only the rows where column fqdn not endswith udc.net and htc.com.
Below is my query which is working but i'm writing it twice.
| inputlookup myfile.csv
| regex support_group="^mygroup-Linux$"
| regex u_sec_dom="^Normal Secure$"
| regex fqdn!=".*?udc.net$"
| regex fqdn!=".*?htc.com$"
| where match(fqdn,".")
I am trying them to combine with | separeted but not working though...
| regex fqdn!="(.*?udc.net | ".*?htc.com)$"
You can do this with a search and where clause:
| inputlookup myfile.csv
| search support_group="mygroup-Linux" u_sec_dom="Normal Secure"
| where !match(fqdn,"udc.net$") AND !match(fqdn,"htc.com$")
Or just a single search clause:
| inputlookup myfile.csv
| search support_group="mygroup-Linux" u_sec_dom="Normal Secure" NOT (fqdn IN("*udc.net","*htc.com")
You can also rewrite the IN() thusly:
(fqdn="*udc.net" OR fqdn="*htc.com")
The combined regex will work if you omit the spaces on either side of the |. The extra spaces become part of the regex and prevent matches.
There's no need for the final where command. Splunk by default will display all events that match ..

Trying to filter out unique post_ids from a CSV

So I have a multi-column CSV which is basically a Facebook dataset and it has a column which has the post ID's in this format:
pageid_postid (eg: 943554_3942952 or sometimes _29472_2847847)
I've been tasked to count the unique "postids" in the column as there are multiple posts for a single page. To give you some context, these are some lines of the column:
post_id
86680728811_272953252761568
86680728811_273859942672742
86680728811_10150499874478812
86680728811_244555465618151
86680728811_252342804833247
_22228735667216_1015116180247221722
_22228735667216_1015116223698221722
_22228735667216_1015179722271221722
_22228735667216_1015179767034221722
_22228735667216_1015179907764721722
_22228735667216_1015194803861221722
As you can see above, there are 2 "pageids" and then several "postids" corresponding to the page and I want to grab the postids (numbers after the underscore).
To achieve this, I whipped up the following command:
cat FB_Dataset.csv | cut -f2 -d , | grep "/_?[0-9]+_[0-9]+\gm" | wc -l
("f2" because the postid's are in the second column)
My regex gives me 0 results found and I think I am not using "grep-friendly" regex. I did try it on an line regex tester and it worked properly. Also I do not know how to tackle this for multiple pageids so any help would be wonderful.
grep does not support regex literals, /pattern/gmi like notation.
In order to extract matches, and not just return matching lines, you need to pass the -o option that is placed right after grep.
Besides, here you want to extract chunks of digits at the end of lines only, so the pattern you need is a [0-9]+$ POSIX ERE one (add -E option to enable it, else, + will be treated as a literal + symbol). As you need to get unique occurrences only, pipe the uniq command:
cat FB_Dataset.csv | cut -f2 -d , | grep -oE '[0-9]+$' | uniq
Adding | wc -l will return the unique match count.

string extraction and dupes filtering mac OS X

I have a bunch of files with sql log. I'm looking out to extract all occurrences of the following pattern
SQL log has sql that looks something like this
sel *
from DB.T1;
update DB.T1;
delete from DB.T2;
collect stats on
DB.T3 index (a,b,c);
sel count(*) from Db.T1;
sel count(*) from db . T2;
sel count(*) from db.t2;
I want to scan through the files starting with logs_ and extract all the unique tables followed by the string DB./db./Db./dB.
As you can see there is white space after db in few instances
The output I'm expecting is a deduped list
T1, T2, T3
I'm on Mac OS X.
This is what I was able to get. I could not get past this
grep -o -i 'tb.*\Z' *logs_* | uniq
This gives empty results. I was using \Z as I want till the end of the string (and not end of the line)
Need help to build the right command.
Something like:
grep -E -o -i 'DB ?\. ?[A-Z0-9$_]+' | cut -d . -f 2 | tr -d ' ' | sort -u
\Z is not supported by grep, as far as I can tell. And in languages that do support it, it really means until the end of the string, not the end of some "word" in the string. So you need to explicitly match the table name in your grep.
I use -E to use grep's extended regular expressions, which makes + and ? recognized as regex metacharacters. This isn't absolutely necessary; you could leave off the -E and use \+ and \? instead.
The regular expression DB ?\. ?[A-Z0-9$_]+ (or DB \?\. \?[A-Z0-9$_]\+ if you leave off the -E flag) matches:
the literal characters "DB" (case insensitively, because of -i)
an optional space
a literal "."
an optional space
one or more of any ascii letters, digits, $ or _ (the characters that can appear in an unquoted mysql table name)
cut removes the database name, tr removes spaces before the table name, and sort -u returns just the unique table names. (uniq by itself does not do this; it only removes lines that are duplicates of the previous line, so only would have done what you want if you sorted first.)

Regex Match and Replace path to file

I am trying to do a regex match and replace for hotfile.com links to mp3 files i have in my database (Wordpress).
I used to use hotfile for streaming mp3 files on my site, now i switched to a CDN, could someone kindly help me out with this:
Replace: http//hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3
With: http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3
I have been trying a simple Search and Replace like this: http//hotfile.com/dl/%/%/, but its not working.
It would have been easier to perform a search and replace if hotfile.com didn't have different folders for all files, below is 2 examples of the problem:
http//hotfile.com/dl/155490069/c7932d4/
http//hotfile.com/dl/165490070/c8745e7/
I have over 500 files to replace.
Thanks
Since you must be using mysql for your wordpress database, you can do this replacement either by regex as you asked:
Regex pattern : #http://(www.)?hotfile.com/\w+/\w+/\w+/#
Replacement pattern: http//p.music.cdndomain.com/vod/music.folder/2010/
An alternate simpler solution would be to extract the mp3 file name using simple string functions of mysql e.g.
Use SUBSTRING or SUBSTRING_INDEX to extract file name of your mp3 file i.e. find the string after last occurence of "/" in your hotfiles url.
Use CONCAT to append the file name retreived to new url prefix and update it in the database.
Here is an example, you can appropriately change it for your database:
mysql> select * from test_songs;
+---------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------+
| http://hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3 |
| http://www.hotfile.com/dl/123412312/dd732d4/mp3_song_name.mp3 |
+---------------------------------------------------------------+
Taking substrings:
mysql> select SUBSTRING_INDEX(song_url,"/",-1) from test_songs;
+----------------------------------+
| SUBSTRING_INDEX(song_url,"/",-1) |
+----------------------------------+
| mp3_file_name.mp3 |
| mp3_song_name.mp3 |
+----------------------------------+
2 rows in set (0.03 sec)
Creating final update query:
mysql> Update test_songs set song_url =
CONCAT("http//p.music.cdndomain.com/vod/music.folder/2010/",
SUBSTRING_INDEX(song_url,"/",-1)) ;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2 Changed: 2 Warnings: 0
Checking the results :
mysql> select * from test_songs;
+---------------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------------+
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3 |
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_song_name.mp3 |
+---------------------------------------------------------------------+
2 rows in set (0.00 sec)
Done !
Something as simple as http://regex101.com/r/lK9wH4 should work:
/^.+\/(.+)$/ and replace with <your_new_url>\1.
Good luck.
you can use notepad++ to search and replace all your files
for this particular sample:
search and replace in regex mode:
search "http//hotfile.com/(.)/(..mp3) "
replace "http//p.music.cdndomain.com/vod/music.folder/2010/\2 "
remove the quote mark but keep the space at the end
updated: screencapture for notepad++