Bigdata and data science by Kartheek Dachepalli: 2023

Wednesday, October 18, 2023

pyspark code to get estimated size of dataframe in bytes

from pyspark.sql import SparkSession

import sys
# Initialize a Spark session
spark = SparkSession.builder.appName("DataFrameSize").getOrCreate()

# Create a PySpark DataFrame
data = [(1, "John"), (2, "Alice"), (3, "Bob")]
columns = ["id", "name"]
df = spark.createDataFrame(data, columns)

# Get the size of the DataFrame in bytes
size_in_bytes = df.rdd.flatMap(lambda x: x).map(lambda x: sys.getsizeof(x) if x is not None else 0).sum()
print(f"Size of the DataFrame: {size_in_bytes} bytes")

# Stop the Spark session
spark.stop()

Wednesday, July 19, 2023

replaceWhere

If we want to replace content of table or file in a path below can be possible.

The replaceWhere option atomically replaces all records that match a given predicate.
You can replace directories of data based on how tables are partitioned using dynamic partition overwrites.

Python:

replace_data.write
  .mode("overwrite")
  .option("replaceWhere", "start_date >= '2017-01-02' AND end_date <= '2017-01-30'")
  .save("/tmp1/delta/events")

SQL:

INSERT INTO TABLE events REPLACE WHERE start_data >= '2017-01-01' AND end_date <= '2017-01-31' SELECT * FROM replace_data

Friday, July 14, 2023

Following are possible magic commands available in Databricks

%python
%sql
%scala
%sh
%fs → Alternatively, one can use dbutils.fs
%md

Note:

The first line in the cell must be the magic command
One cell allows only one magic command
Magic commands are case sensitive
When you change the default language, all cells in that notebook automatically add a magic command of the previous default language.

Saturday, March 11, 2023

Tensorflow general methods

#method to get shape of tensor flow element

#method to apply a method to all elements

#defining variables, constants in Tensorflow for a matrix of elements.

# Casting, activation functions

Wednesday, March 8, 2023

Synchronously shuffle X,Y

import numpy as np

np.random.seed(seed)

m = X.shape[1] # number of training examples

permutation = list(np.random.permutation(m))

shuffled_X = X[:, permutation]

shuffled_Y = Y[:, permutation].reshape((1, m))

Saturday, March 4, 2023

Dropout

DROPOUT is a widely used regularization technique that is specific to deep learning.

It randomly shuts down some neurons in each iteration.

At each iteration, you shut down (= set to zero) each neuron of a layer with probability 1−keep_prob or keep it with probability keep_prob (50% here).

The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

Dropout is a regularization technique.
You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

L2 Regulerization

m= # of training examples

l= layer

k , j=shape of weight matrix

Friday, March 3, 2023

python - Initialization of weights

The main difference between Gaussian variable (numpy.random.randn()) and uniform random variable is the distribution of the generated random numbers:

numpy.random.rand() produces numbers in a uniform distribution.
and numpy.random.randn() produces numbers in a normal distribution.

When used for weight initialization, randn() helps most the weights to Avoid being close to the extremes, allocating most of them in the center of the range.

An intuitive way to see it is, for example, if you take the sigmoid() activation function.

You’ll remember that the slope near 0 or near 1 is extremely small, so the weights near those extremes will converge much more slowly to the solution, and having most of them near the center will speed the convergence.

Initialization of weights

The weights $W^{[l]}$ should be initialized randomly to break symmetry.
However, it's okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly.
Initializing weights to very large random values doesn't work well.
Initializing with small random values should do better.

Wednesday, March 1, 2023

python code to plot cost

import matplotlib.pyplot as plt

%matplotlib inline

def plot_costs(costs, learning_rate=0.0075):

plt.plot(np.squeeze(costs))

plt.ylabel('cost')

plt.xlabel('iterations (per hundreds)')

plt.title("Learning rate =" + str(learning_rate))

plt.show()

#Assuming "costs" is a list of costs obtained during training iterations per hundred

#calling the method with some learning rate

plot_costs(costs, learning_rate)

output:

Deep Learning methodology using gradient descent

Usual Deep Learning methodology to build the model:

Initialize parameters / Define hyperparameters
Loop for num_iterations:

a. Forward propagation

b. Compute cost function

c. Backward propagation

d. Update parameters (using parameters, and grads from backprop)

3. Use trained parameters to predict labels