from pyspark.sql import SparkSession
Bigdata and data science by Kartheek Dachepalli
Wednesday, October 18, 2023
pyspark code to get estimated size of dataframe in bytes
Wednesday, July 19, 2023
replaceWhere
If we want to replace content of table or file in a path below can be possible.
The
replaceWhere
option atomically replaces all records that match a given predicate.You can replace directories of data based on how tables are partitioned using dynamic partition overwrites.
Python:
replace_data.write .mode("overwrite") .option("replaceWhere", "start_date >= '2017-01-02' AND end_date <= '2017-01-30'") .save("/tmp1/delta/events")
SQL:
INSERT INTO TABLE events REPLACE WHERE start_data >= '2017-01-01' AND end_date <= '2017-01-31' SELECT * FROM replace_data
Friday, July 14, 2023
Magic commands
Following are possible magic commands available in Databricks
- %python
- %sql
- %scala
- %sh
- %fs → Alternatively, one can use dbutils.fs
- %md
Note:
- The first line in the cell must be the magic command
- One cell allows only one magic command
- Magic commands are case sensitive
- When you change the default language, all cells in that notebook automatically add a magic command of the previous default language.
Saturday, March 11, 2023
Tensorflow general methods
#method to get shape of tensor flow element
Wednesday, March 8, 2023
Synchronously shuffle X,Y
import numpy as np
np.random.seed(seed)
m = X.shape[1] # number of training examples
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1, m))
Saturday, March 4, 2023
Dropout
- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Friday, March 3, 2023
python - Initialization of weights
The main difference between Gaussian variable (numpy.random.randn()
) and uniform random variable is the distribution of the generated random numbers:
- numpy.random.rand() produces numbers in a uniform distribution.
- and numpy.random.randn() produces numbers in a normal distribution.
When used for weight initialization, randn() helps most the weights to Avoid being close to the extremes, allocating most of them in the center of the range.
An intuitive way to see it is, for example, if you take the sigmoid() activation function.
You’ll remember that the slope near 0 or near 1 is extremely small, so the weights near those extremes will converge much more slowly to the solution, and having most of them near the center will speed the convergence.
Initialization of weights
- The weights
𝑊[𝑙] should be initialized randomly to break symmetry. - However, it's okay to initialize the biases
𝑏[𝑙] to zeros. Symmetry is still broken so long as𝑊[𝑙] is initialized randomly. - Initializing weights to very large random values doesn't work well.
- Initializing with small random values should do better.
Wednesday, March 1, 2023
python code to plot cost
import matplotlib.pyplot as plt
%matplotlib inline
def plot_costs(costs, learning_rate=0.0075):
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
#Assuming "costs" is a list of costs obtained during training iterations per hundred
#calling the method with some learning rate
plot_costs(costs, learning_rate)
output:
Deep Learning methodology using gradient descent
Usual Deep Learning methodology to build the model:
- Initialize parameters / Define hyperparameters
- Loop for num_iterations:
Sunday, December 18, 2022
split data set to train, cross validation and test sets
print(f"the shape of the original set (input) is: {x.shape}")
print(f"the shape of the original set (target) is: {y.shape}\n")
Tuesday, December 13, 2022
Epochs and batches
We provide epoch value while fitting/training the model as below.
Example: model.fit(X,Y,epoch=100)
Epochs and batches
In the fit
statement above, the number of epochs
was set to 100. This specifies that the entire data set
should be applied during training 100 times. During training, you see output describing the progress of
training that looks like this:
Epoch 1/100
157/157 [==============================] - 0s 1ms/step - loss: 2.2770
The first line, Epoch 1/100
, describes which epoch the model is currently running. For efficiency,
the training data set is broken into 'batches'. The default size of a batch in Tensorflow is 32.
if given an model has are 5000 examples(X_train) it will set or roughly to 157 batches.
The notation on the 2nd line 157/157 [====
is describing which batch has been executed.