Bigdata and data science by Kartheek Dachepalli

Friday, July 19, 2019

Importance of ? in Regex

It is the difference between greedy and non-greedy quantifiers.

Consider the input 101000000000100.

Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.

All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.

In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).

Reference:https://stackoverflow.com/questions/3075130/what-is-the-difference-between-and-regular-expressions

Find the text between two characters using Regex

scala> import scala.util.matching.Regex
import scala.util.matching.Regex

scala> val keyValPattern: Regex = "(?<=\$).*?(?=\$)".r
keyValPattern: scala.util.matching.Regex = (?<=$).*?(?=$)

scala> val input: String ="heelo(geloo))lodk"
input: String = heelo(geloo))lodk

scala> println(keyValPattern findFirstIn input)
Some(geloo)

scala> val keyValPattern: Regex = "(?<=\$).*?(?<=\$)".r
keyValPattern: scala.util.matching.Regex = (?<=$).*?(?<=$)

scala> println(keyValPattern findFirstIn input)
Some(geloo))

Note: Here in above case (?<='character') represents a group. we used \$ or \$ as starting and ending characters. .*? represents all the text between 2 groups.

Monday, July 15, 2019

StructType and StructFields

StructType — Data Type for Schema Definition

StructType is a built-in data type that is a collection of StructFields.

StructType is used to define a schema or its part.

You can compare two StructType instances to see whether they are equal.

import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()
  .add("a", "int")
  .add("b", "string")

import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
  .add("a", IntegerType)
  .add("b", StringType)

scala> schemaUntyped == schemaTyped
res0: Boolean = true

Monday, July 8, 2019

Hive Partitioning

//create a stage table with out partition.
hive> drop table emp_det_stage;
OK
Time taken: 0.079 seconds

hive> create table emp_det_stage(name string,dept string,exp int, loc string) row format delimited fields terminated by ',';
OK
Time taken: 0.088 seconds

hive> Load data local Inpath "/home/hadoop/Partition.csv" overwrite into table emp_det_stage;
Loading data to table default.emp_det_stage
OK
Time taken: 0.42 seconds
//view the loaded data
hive> select * from emp_det_stage;
OK
emp_det_stage.name emp_det_stage.dept emp_det_stage.exp emp_det_stage.loc
Kartheek BI 5 Hyd
Raj Apps 5 Mas
mahesh BI 5 Hyd
Denesh BI 6 Hyd
Rajesh Frontend 7 KOL
Time taken: 0.125 seconds, Fetched: 5 row(s)

//create an actual table with static partition:

hive> create table emp_det_part(name string,dept string,exp int) partitioned by (loc string);

Time taken: 0.067 seconds

hive> insert overwrite table emp_det_part partition(loc='Hyd') select name,dept,exp from emp_det_stage where loc='Hyd';

//verify data

hive> dfs -ls /user/hive/warehouse/emp_det_part/loc=Hyd;

Found 1 items

-rwxrwxrwt 1 hadoop hadoop 38 2019-07-08 12:08 /user/hive/warehouse/emp_det_part/loc=Hyd/000000_0

//create an actual table with dynamic partition:

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> insert overwrite table emp_det_part partition(loc) select * from emp_det_stage;

//verify data files

hive> dfs -ls /user/hive/warehouse/emp_det_part/

> ;

Found 3 items

drwxrwxrwt - hadoop hadoop 0 2019-07-08 12:10 /user/hive/warehouse/emp_det_part/loc=Hyd

drwxrwxrwt - hadoop hadoop 0 2019-07-08 12:10 /user/hive/warehouse/emp_det_part/loc=KOL

drwxrwxrwt - hadoop hadoop 0 2019-07-08 12:10 /user/hive/warehouse/emp_det_part/loc=Mas

Functions in Hive

hive> desc T_UNSTRUCTURE;
OK
col_name data_type comment
emp_id int
name map<string,string>
addr struct<City:string,Pin:int>
skill_set array<string>
Time taken: 0.027 seconds, Fetched: 4 row(s)

hive> select size(skill_set),array_contains(skill_set,'Hadoop'),sort_array(skill_set),concat_ws("$",skill_set) from T_UNSTRUCTURE;
OK
_c0 _c1 _c2 _c3
2 false ["'Hadoop'","'OBIEE'"] 'Hadoop'$'OBIEE'
2 false ["'Chocolate'","'oracle'"] 'oracle'$'Chocolate'
Time taken: 0.201 seconds, Fetched: 2 row(s)

Explode

Explode: it's a UDTF which can be used outside the select statement with "Lateral" keyword for flattening the collection objects.

hive> desc T_UNSTRUCTURE;
OK
col_name data_type comment
emp_id int
name map<string,string>
addr struct<City:string,Pin:int>
skill_set array<string>
Time taken: 0.027 seconds, Fetched: 4 row(s)

hive> select * from T_UNSTRUCTURE;
OK
t_unstructure.emp_id t_unstructure.name t_unstructure.addr t_unstructure.skill_set
10 {"first":"Amit","Last":"Mishra"} {"city":"Blr","pin":1} ["'Hadoop'","'OBIEE'"]
20 {"first":"Ramesh","Last":"Nayak"} {"city":"Mas","pin":2} ["'oracle'","'Chocolate'"]
Time taken: 0.32 seconds, Fetched: 2 row(s)

hive> select emp_id,skill from T_UNSTRUCTURE Lateral view explode(skill_set) temp_table as skill;
OK
emp_id skill
10 'Hadoop'
10 'OBIEE'
20 'oracle'
20 'Chocolate'
Time taken: 0.104 seconds, Fetched: 4 row(s)

Hive configuration to show table header

set hive.cli.print.header=True;

Create table with Collections columns

sample Unstructure.csv

10 first:Amit,Last:Mishra Blr,1 'Hadoop','OBIEE'
20 first:Ramesh,Last:Nayak Mas,2 'oracle','Chocolate'

move the above to the local file system.

create a table in Hive shell.

hive> create table T_UNSTRUCTURE (emp_id int,name map<string,string>,addr struct<City:String,Pin:int>,skill_set array<string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';

Load data into the table with the below statement.

LOAD DATA LOCAL INPATH '/home/hadoop/Unstructure.csv' OVERWRITE INTO table T_UNSTRUCTURE;

Look at the description of the table

hive> desc T_UNSTRUCTURE;

col_name data_type comment

emp_id int

name map<string,string>

addr struct<City:string,Pin:int>

skill_set array<string>

Access the elements of table.

hive> select * from T_UNSTRUCTURE;

t_unstructure.emp_id t_unstructure.name t_unstructure.addr t_unstructure.skill_set

10 {"first":"Amit","Last":"Mishra"} {"city":"Blr","pin":1} ["'Hadoop'","'OBIEE'"]

20 {"first":"Ramesh","Last":"Nayak"} {"city":"Mas","pin":2} ["'oracle'","'Chocolate'"]

Time taken: 0.216 seconds, Fetched: 2 row(s)

Saturday, June 15, 2019

Convert a python dictionary to comma separated key list and comma separated value list

I have a dictionary as follows:-

dict={'a':'1','b':'2', 'c':'3'}

To convert it into into a comma separated keys string key_string and a comma separated values string val_string I do the following:-

key_list=[]
val_list=[]

 for key,value in dict.iteritems():
     key_list.append(key)
     val_list.append(value)

 key_string = ','.join(key_list)
 val_string = ','.join(val_list)

The result is

 key_string = "a,b,c"
 val_string = "1,2,3"

Tuesday, June 4, 2019

Java Compiler

Javac is Java Compiler -- Compiles your Java code into Bytecode

JVM is Java Virtual Machine -- Runs/ Interprets/ translates Bytecode into Native Machine Code

JIT is Just In Time Compiler -- Compiles the given bytecode instruction sequence to machine code at runtime before executing it natively. It's main purpose is to do heavy optimizations in performance.

So now, Let's find answers to your questions..

1)JVM: is it a compiler or an interpreter? -- Ans: Interpreter

2)what about JIT compiler that exist inside the JVM? -- Ans: If you read this reply completly, you probably know it now

3)what exactly is the JVM? -- Ans:

JVM is a virtual platform that resides on your RAM
Its component, Class loader loads the .class file into the RAM
The Byte code Verifier component in JVM checks if there are any access restriction violations in your code. (This is one of the principle reasons why java is secure)
Next, the Execution Engine component converts the Bytecode into executable machine code

Hope this helped you..

Sunday, June 2, 2019

Switcing python versions in Anaconda

You can easily maintain separate environments for Python 2 programs and Python 3 programs on the same computer, without worrying about the programs interacting with each other. Switching to an environment is called activating it.
Please follow below link
https://docs.anaconda.com/anaconda/user-guide/tasks/switch-environment/

Wednesday, February 20, 2019

Dictionary Update values

d = {1: "one", 2: "three"}
d1 = {2: "two"}

# updates the value of key 2
d.update(d1)
print(d)

d1 = {3: "three"}

# adds element with key 3
d.update(d1)
print(d)

O/P:

Removing Punctuation's from a string

Problem
You have a text and you want to remove punctuations from it. Example:

in:
"Hello! It is time to remove punctuations here. It is easy, you will see."

out:
"Hello It is time to remove punctuations here It is easy you will see"

Solution
Let’s see a Python 3 solution:

>>> import string

>>> tr = str.maketrans("", "", string.punctuation)

>>> s = "Hello! It is time to remove punctuations here. It is easy, you will see."

>>> s.translate(tr)

'Hello Its time to remove punctuations here Its easy youll see'

Docs: str.maketrans(), str.translate().