Bigdata and data science by Kartheek Dachepalli: December 2022

Sunday, December 18, 2022

split data set to train, cross validation and test sets

print(f"the shape of the original set (input) is: {x.shape}")

print(f"the shape of the original set (target) is: {y.shape}\n")

from sklearn.model_selection import train_test_split

# Get 60% of the dataset as the training set. Put the remaining 40% in temporary variables.

x_train, x_, y_train, y_ = train_test_split(x, y, test_size=0.40, random_state=1)

# Split the 40% subset above into two: one half for cross validation and the other for the test set

x_cv, x_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

# Delete temporary variables

del x_, y_

print(f"the shape of the training set (input) is: {x_train.shape}")

print(f"the shape of the training set (target) is: {y_train.shape}\n")

print(f"the shape of the cross validation set (input) is: {x_cv.shape}")

print(f"the shape of the cross validation set (target) is: {y_cv.shape}\n")

print(f"the shape of the test set (input) is: {x_test.shape}")

print(f"the shape of the test set (target) is: {y_test.shape}")

Tuesday, December 13, 2022

Epochs and batches

We provide epoch value while fitting/training the model as below.

Example: model.fit(X,Y,epoch=100)

Epochs and batches

In the fit statement above, the number of epochs was set to 100. This specifies that the entire data set

should be applied during training 100 times. During training, you see output describing the progress of

training that looks like this:

Epoch 1/100
157/157 [==============================] - 0s 1ms/step - loss: 2.2770

The first line, Epoch 1/100, describes which epoch the model is currently running. For efficiency,

the training data set is broken into 'batches'. The default size of a batch in Tensorflow is 32.

if given an model has are 5000 examples(X_train) it will set or roughly to 157 batches.

The notation on the 2nd line 157/157 [==== is describing which batch has been executed.

Loss (cost)

Ideally, the cost will decrease as the number of iterations of the algorithm increases. Tensorflow refers to

the cost as loss.

Monday, December 12, 2022

Derivative using python

Libraries for derivative

from sympy import symbols, diff

Let's try this out. Let's look at the derivative of the function

$J (w) = w^{2}$ its derivative is,

$J (w) = w^{2}$

Sunday, December 11, 2022

SparseCategorialCrossentropy or CategoricalCrossEntropy

Tensorflow has two potential formats for target values and the selection of the loss defines which is expected.

SparseCategorialCrossentropy: expects the target to be an integer corresponding to the index. For example, if there are 10 potential target values, y would be between 0 and 9.
CategoricalCrossEntropy: Expects the target value of an example to be one-hot encoded where the value at the target index is 1 while the other N-1 entries are zero. An example with 10 potential target values, where the target is 2 would be [0,0,1,0,0,0,0,0,0,0].

Friday, December 9, 2022

Get the output of each layer in Neural network

Lets Consider following simple neural network

import keras.backend as K

from keras.models import Model
from keras.layers import Input, Dense

input_layer = Input((10,))

layer_1 = Dense(10)(input_layer)
layer_2 = Dense(20)(layer_1)
layer_3 = Dense(5)(layer_2)

output_layer = Dense(1)(layer_3)

model = Model(inputs=input_layer, outputs=output_layer)

# some random input
import numpy as np
features = np.random.rand(100,10)

and consider this model is trained

# With a Keras function get the ouputs of all the layers
get_all_layer_outputs = K.function([model.layers[0].input],
                                  [l.output for l in model.layers[0:]])

layer_output = get_all_layer_outputs([features]) # return the same thing

#layer_output is a list of all layers outputs

#if the model is trained you will get the output for input with trained weights other wise 
it will give outout with initial weights

Wednesday, December 7, 2022

Output layer of Neural network for Regression and classification problems

Regression output layer:

When developing a neural network to solve a regression problem, the output layer should have exactly one node. Here we are not trying to map inputs to a variety of class labels, but rather trying to predict a single continuous target value for each sample. Therefore, our network should have one output node to return one – and exactly one – output prediction for each sample in our dataset.

The activation function for a regression problem will be linear. This can be defined by using activation = ‘linear’ or leaving it unspecified to employ the default parameter value activation = None

Linear activation function: The linear activation function, also known as "no activation," or "identity function" (multiplied x1. 0), is where the activation is proportional to the input. The function doesn't do anything to the weighted sum of the input, it simply spits out the value it was given

Evaluation metrics for regression: Mostly use MSE loss function and other available options as below.

Root Mean Squared Error (RMSE) – a good option if you’d like the error to be in the same units as the target variable
Mean Absolute Error (MAE) – useful for when you need an error that scales linearly
Median Absolute Error (MdAE) – de-emphasizes outliers

Classification output layer:

If your data has a target that resides in a single vector, the number of output nodes in your neural network will be 1 and the activation function used on the final layer should be sigmoid. On the other hand, if your target is a matrix of One-Hot-Encoded vectors, your output layer should have 2 nodes and the activation function on the final layer should be softmax. Usually for binary classification, the last layer is logistic regression(as its single node/sigmoid) for deciding the class output.

Example: if Y has category values of (Yes,no) then one hot encoding give 2 columns encoded_yes, encoded_no. for these cases we need to consider 2 neurons in output layer for 2 outputs.

Evaluation metrics for Classfication:

The loss function used for binary classification problems is determined by the data format as well. When dealing with a single target vector of 0s and 1s, you should use BinaryCrossentropy as the loss function. When your target variable is stored as One-Hot-Encoded values, you should use the CategoricalCrossentropy loss function.

Reference:

https://www.enthought.com/blog/neural-network-output-layer/

Monday, December 5, 2022

Addressing overfitting problem

Options:

1. Collect more data.

2. Select features(select only important features contributing for prediction). It is called "Feature selection".

3. Reduce size of the parameters(model reduces the feature importance). It is called "Regularization"

Saturday, December 3, 2022

Logistic regression

Friday, December 2, 2022

Field of Artificial intelligence

Spark cluster mode parameters

We can divide these options into two categories.

The first category is data file, data files means spark only add the specified files into containers, no further commands will be executed. There are two options in this category:

--archives: with this option, you can submit archives, and spark will extract files in it for you, spark support zip, tar ... formats.
--files: with this option, you can submit files, spark will put it in container, won't do any other things. sc.addFile is the programming api for this one.

The second category is code dependencies. In spark application, code dependency could be JVM dependency or python dependency for pyspark application.

--jars ：this option is used to submit JVM dependency with Jar file, spark will add these Jars into CLASSPATH automatically, so your JVM can load them.
--py-files: this option is used to submit Python dependency, it can be .py, .egg or .zip. spark will add these file into PYTHONPATH, so your python interpreter can find them.
sc.addPyFile is the programming api for this one.
PS: for single .py file, spark will add it into a __pyfiles__ folder, others will add into CWD.

All these four options can specified multiple files, splitted with "," and for each file, you can specified an alias through {URL}#{ALIAS} format. Don't specify alias in --py-files option, cause spark won't add alias into PYTHONPATH.

Example:

-- archives abc.zip#new_abc,cde.zip#new_cde

spark will extract abc.zip, cde.zip and creates new_abc, new_cde folders in container

Data science/ Data analytics life cycle

Training a sign wave with feed forward neural network

Lets create some sample sign wave data and add some noise to it.

import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.utils import shuffle

#lets take some 5000 points
n=5000

#lets consider 0.2% of 5000 as test data
test_per=0.2

#lets consider 0.2% of 5000 as validation data
val_per=0.2

#generate some 5000 points in 2 pi(full cycle) of sign wave
x=np.random.uniform(low=0,high=2*math.pi,size=n)
y=np.sin(x)+0.1*np.random.randn(n)

#lets shuffle the dataset to get variety of data for train, test and validation data sets
x,y=shuffle(x,y)
test_num=int(test_per*n)
val_num=test_num+int(val_per*n)
x_test,x_val,x_train=np.split(x,[test_num,val_num])
y_test,y_val,y_train=np.split(y,[test_num,val_num])

#lets plot train, test and validation data to understand the data size
plt.plot(x_train,y_train,"r.",label="train")
plt.plot(x_test,y_test,"b.",label="test")
plt.plot(x_val,y_val,"g.",label="val")

plt.show()



import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense

#lets train the model with feed forward neural network

model = Sequential()

#I am taking 40 neurons in single hidden layer. we can also implement the same with multiple hidden layers with reduced neurons(16)
model.add(Dense(40, input_dim=1, activation='sigmoid'))
model.add(Dense(1))
model.compile(loss='mae',
                       optimizer='adam',
                       metrics=['mae'])
model.fit(x_train, y_train,batch_size=100, epochs=800)
scores = model.evaluate(x_val, y_val)

#lets print the accuracy of model
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

#lets predict the model with test data
y_pred=model.predict(x_test)

#lets plot the actual vs predicted test values 
plt.scatter(x_test,y_test,marker=".",c="r")
plt.scatter(x_test,y_pred,marker=".",c='b')
plt.show()


Neural network trained well and actual vs predicted almost fit.

But wait a minute, lets try to add some future values and test the same.

#lets add next cycle data points with some noise
x_extra=np.random.uniform(low=2*math.pi,high=4*math.pi,size=int(n/8))
y_extra=np.sin(x_extra)+0.1*np.random.randn(int(n/8))
#lest add next cycle points to existing test data
x_future=np.append(x_test,x_extra)
y_future=np.append(y_test,y_extra)
#lets predict over all values
y_pred=model.predict(x_future)
#plot actual vs predict
plt.scatter(x_future,y_future,marker=".",c="r")
plt.scatter(x_future,y_pred,marker=".",c='b')
plt.show()


Ooops ! neural network learnt only a range of data but failed to predict future data. 
Don't worry we can achieve this by training this with algorithm's which learn the pattern 
of sequence example: RNN, LSTM. We will perform it in next blog.



Information: One can tune neural network to any number of hidden layers and number of 
neurons per hidden layer. If we have more layers/neurons per layer the 
learning will be quicker.

scatter, plot

# Plot the linear fit

plt.plot(x_train, predicted, c = "b")

# Create a scatter plot of the data.

plt.scatter(x_train, y_train, marker='x', c='r')

# Set the title

plt.title("Profits vs. Population per city")

# Set the y-axis label

plt.ylabel('Profit in $10,000')

# Set the x-axis label

plt.xlabel('Population of City in 10,000s')

Thursday, December 1, 2022

Scaling

Three different techniques for feature scaling:

Feature scaling, essentially dividing each positive feature by its maximum value, or more generally, rescale each feature by both its minimum and maximum values using (x-min)/(max-min). Both ways normalizes features to the range of -1 and 1, where the former method works for positive features which is simple and serves well for the lecture's example, and the latter method works for any features.
Mean normalization: $x_{i} := \frac{x_{i} - μ_{i}}{m a x - m i n}$
$x_{i} := \frac{x_{i} - μ_{i}}{m a x - m i n}$

To implement z-score normalization, adjust your input values as shown in this formula:

(𝑖) 𝑗 = 𝑥 ( 𝑖 ) 𝑗 - 𝜇 𝑗 𝜎 𝑗 where