Friday, July 19, 2019

Importance of ? in Regex

It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1* is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?.+?.{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).



Find the text between two characters using Regex

scala> import scala.util.matching.Regex
import scala.util.matching.Regex

scala> val keyValPattern: Regex = "(?<=\\().*?(?=\\))".r
keyValPattern: scala.util.matching.Regex = (?<=\().*?(?=\))

scala> val input: String ="heelo(geloo))lodk"
input: String = heelo(geloo))lodk

scala> println(keyValPattern findFirstIn input)
Some(geloo)

scala> val keyValPattern: Regex = "(?<=\\().*?(?<=\\))".r
keyValPattern: scala.util.matching.Regex = (?<=\().*?(?<=\))

scala> println(keyValPattern findFirstIn input)
Some(geloo))

Note: Here in above case (?<='character') represents a group. we used \\( or \\) as starting and ending characters. .*? represents all the text between 2 groups. 

Monday, July 15, 2019

StructType and StructFields

StructType — Data Type for Schema Definition

StructType is a built-in data type that is a collection of StructFields.
StructType is used to define a schema or its part.
You can compare two StructType instances to see whether they are equal.
import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()
  .add("a", "int")
  .add("b", "string")

import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
  .add("a", IntegerType)
  .add("b", StringType)

scala> schemaUntyped == schemaTyped
res0: Boolean = true

Monday, July 8, 2019

Hive Partitioning


//create a stage table with out partition.
hive> drop table emp_det_stage;
OK
Time taken: 0.079 seconds

hive> create table emp_det_stage(name string,dept string,exp int, loc string) row format delimited fields terminated by ',';
OK
Time taken: 0.088 seconds

hive> Load data local Inpath "/home/hadoop/Partition.csv" overwrite into table emp_det_stage;
Loading data to table default.emp_det_stage
OK
Time taken: 0.42 seconds
//view the loaded data
hive> select * from emp_det_stage;
OK
emp_det_stage.name      emp_det_stage.dept      emp_det_stage.exp       emp_det_stage.loc
Kartheek        BI      5       Hyd
Raj     Apps    5       Mas
mahesh  BI      5       Hyd
Denesh  BI      6       Hyd
Rajesh  Frontend        7       KOL
Time taken: 0.125 seconds, Fetched: 5 row(s)

//create an actual table with static partition:

hive> create table emp_det_part(name string,dept string,exp int) partitioned by (loc string);
OK
Time taken: 0.067 seconds
hive>  insert overwrite table emp_det_part partition(loc='Hyd') select name,dept,exp from emp_det_stage where loc='Hyd';

//verify data
hive> dfs -ls /user/hive/warehouse/emp_det_part/loc=Hyd;
Found 1 items
-rwxrwxrwt   1 hadoop hadoop         38 2019-07-08 12:08 /user/hive/warehouse/emp_det_part/loc=Hyd/000000_0

//create an actual table with dynamic partition:

hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert overwrite table emp_det_part partition(loc) select * from emp_det_stage;

//verify data files
hive> dfs -ls /user/hive/warehouse/emp_det_part/
    > ;
Found 3 items
drwxrwxrwt   - hadoop hadoop          0 2019-07-08 12:10 /user/hive/warehouse/emp_det_part/loc=Hyd
drwxrwxrwt   - hadoop hadoop          0 2019-07-08 12:10 /user/hive/warehouse/emp_det_part/loc=KOL
drwxrwxrwt   - hadoop hadoop          0 2019-07-08 12:10 /user/hive/warehouse/emp_det_part/loc=Mas




Functions in Hive

hive> desc T_UNSTRUCTURE;
OK
col_name        data_type       comment
emp_id                  int
name                    map<string,string>
addr                    struct<City:string,Pin:int>
skill_set               array<string>
Time taken: 0.027 seconds, Fetched: 4 row(s)


hive> select size(skill_set),array_contains(skill_set,'Hadoop'),sort_array(skill_set),concat_ws("$",skill_set) from T_UNSTRUCTURE;
OK
_c0     _c1     _c2     _c3
2       false   ["'Hadoop'","'OBIEE'"]  'Hadoop'$'OBIEE'
2       false   ["'Chocolate'","'oracle'"]      'oracle'$'Chocolate'
Time taken: 0.201 seconds, Fetched: 2 row(s)

Explode

Explode: it's a UDTF which can be used outside the select statement with "Lateral" keyword for flattening the collection objects.

hive> desc T_UNSTRUCTURE;
OK
col_name        data_type       comment
emp_id                  int
name                    map<string,string>
addr                    struct<City:string,Pin:int>
skill_set               array<string>
Time taken: 0.027 seconds, Fetched: 4 row(s)


hive> select * from T_UNSTRUCTURE;
OK
t_unstructure.emp_id    t_unstructure.name      t_unstructure.addr      t_unstructure.skill_set
10      {"first":"Amit","Last":"Mishra"}        {"city":"Blr","pin":1}  ["'Hadoop'","'OBIEE'"]
20      {"first":"Ramesh","Last":"Nayak"}       {"city":"Mas","pin":2}  ["'oracle'","'Chocolate'"]
Time taken: 0.32 seconds, Fetched: 2 row(s)


hive> select emp_id,skill from T_UNSTRUCTURE Lateral view explode(skill_set) temp_table as skill;
OK
emp_id  skill
10      'Hadoop'
10      'OBIEE'
20      'oracle'
20      'Chocolate'
Time taken: 0.104 seconds, Fetched: 4 row(s)

Hive configuration to show table header

set hive.cli.print.header=True;

Create table with Collections columns

 sample Unstructure.csv

10 first:Amit,Last:Mishra Blr,1 'Hadoop','OBIEE'
20 first:Ramesh,Last:Nayak Mas,2 'oracle','Chocolate'

move the above to the local file system.

create a table in Hive shell.

hive> create table T_UNSTRUCTURE (emp_id int,name map<string,string>,addr struct<City:String,Pin:int>,skill_set array<string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';


Load data into the table with the below statement.

LOAD DATA LOCAL INPATH '/home/hadoop/Unstructure.csv' OVERWRITE INTO table T_UNSTRUCTURE;

Look at the description of the table

hive> desc T_UNSTRUCTURE;
OK
col_name        data_type       comment
emp_id                  int
name                    map<string,string>
addr                    struct<City:string,Pin:int>
skill_set               array<string>

Access the elements of table.

hive> select * from T_UNSTRUCTURE;
OK
t_unstructure.emp_id    t_unstructure.name      t_unstructure.addr      t_unstructure.skill_set
10      {"first":"Amit","Last":"Mishra"}        {"city":"Blr","pin":1}  ["'Hadoop'","'OBIEE'"]
20      {"first":"Ramesh","Last":"Nayak"}       {"city":"Mas","pin":2}  ["'oracle'","'Chocolate'"]
Time taken: 0.216 seconds, Fetched: 2 row(s)



Saturday, June 15, 2019

Convert a python dictionary to comma separated key list and comma separated value list

I have a dictionary as follows:-
dict={'a':'1','b':'2', 'c':'3'}
To convert it into into a comma separated keys string key_string and a comma separated values string val_string I do the following:-
key_list=[]
val_list=[]

 for key,value in dict.iteritems():
     key_list.append(key)
     val_list.append(value)

 key_string = ','.join(key_list)
 val_string = ','.join(val_list)
The result is
 key_string = "a,b,c"
 val_string = "1,2,3" 

Tuesday, June 4, 2019

Java Compiler

Javac is Java Compiler -- Compiles your Java code into Bytecode
JVM is Java Virtual Machine -- Runs/ Interprets/ translates Bytecode into Native Machine Code
JIT is Just In Time Compiler -- Compiles the given bytecode instruction sequence to machine code at runtime before executing it natively. It's main purpose is to do heavy optimizations in performance.
So now, Let's find answers to your questions..
1)JVM: is it a compiler or an interpreter? -- Ans: Interpreter
2)what about JIT compiler that exist inside the JVM? -- Ans: If you read this reply completly, you probably know it now
3)what exactly is the JVM? -- Ans:
  • JVM is a virtual platform that resides on your RAM
  • Its component, Class loader loads the .class file into the RAM
  • The Byte code Verifier component in JVM checks if there are any access restriction violations in your code. (This is one of the principle reasons why java is secure)
  • Next, the Execution Engine component converts the Bytecode into executable machine code
Hope this helped you..

Sunday, June 2, 2019

Switcing python versions in Anaconda


You can easily maintain separate environments for Python 2 programs and Python 3 programs on the same computer, without worrying about the programs interacting with each other. Switching to an environment is called activating it.
Please follow below link
https://docs.anaconda.com/anaconda/user-guide/tasks/switch-environment/

Wednesday, February 20, 2019

Dictionary Update values

d = {1: "one", 2: "three"}
d1 = {2: "two"}

# updates the value of key 2
d.update(d1)
print(d)

d1 = {3: "three"}

# adds element with key 3
d.update(d1)
print(d)

O/P:


Removing Punctuation's from a string

Problem
You have a text and you want to remove punctuations from it. Example:
in:
"Hello! It is time to remove punctuations here. It is easy, you will see."

out:
"Hello It is time to remove punctuations here It is easy you will see"
Solution
Let’s see a Python 3 solution:
1
2
3
4
5
>>> import string
>>> tr = str.maketrans("", "", string.punctuation)
>>> s = "Hello! It is time to remove punctuations here. It is easy, you will see."
>>> s.translate(tr)
'Hello Its time to remove punctuations here Its easy youll see'