Bigdata and data science by Kartheek Dachepalli: split data set to train, cross validation and test sets

Sunday, December 18, 2022

print(f"the shape of the original set (input) is: {x.shape}")

print(f"the shape of the original set (target) is: {y.shape}\n")

from sklearn.model_selection import train_test_split

# Get 60% of the dataset as the training set. Put the remaining 40% in temporary variables.

x_train, x_, y_train, y_ = train_test_split(x, y, test_size=0.40, random_state=1)

# Split the 40% subset above into two: one half for cross validation and the other for the test set

x_cv, x_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

# Delete temporary variables

del x_, y_

print(f"the shape of the training set (input) is: {x_train.shape}")

print(f"the shape of the training set (target) is: {y_train.shape}\n")

print(f"the shape of the cross validation set (input) is: {x_cv.shape}")

print(f"the shape of the cross validation set (target) is: {y_cv.shape}\n")

print(f"the shape of the test set (input) is: {x_test.shape}")

print(f"the shape of the test set (target) is: {y_test.shape}")

Bigdata and data science by Kartheek Dachepalli