Social Media Means
Photo by cottonbro studio Pexels Logo Photo: cottonbro studio

What happens when K is 1 in KNN?

At K=1, the KNN tends to closely follow the training data and thus shows a high training score. However, in comparison, the test score is quite low, thus indicating overfitting.

What does rap C mean?
What does rap C mean?

Print. Sep 29, 2021

Read More »
Which entrepreneurs are shy and lazy?
Which entrepreneurs are shy and lazy?

Fabian entrepreneurs are very shy, lazy, cautious and do not venture or take risk. Industrial Entrepreneur. Sep 28, 2019

Read More »

K-Nearest Neighbors

All you need to know about KNN.

“A man is known for the company he keeps.”

A perfect opening line I must say for presenting the K-Nearest Neighbors. Yes, that's how simple the concept behind KNN is. It just classifies a data point based on its few nearest neighbors. How many neighbors? That is what we decide. Looks like you already know a lot of there is to know about this simple model. Let’s dive in to have a much closer look. Before moving on, it’s important to know that KNN can be used for both classification and regression problems. We will first understand how it works for a classification problem, thereby making it easier to visualize regression.

KNN Classifier

The data we are going to use is the Breast Cancer Wisconsin(Diagnostic) Data Set. There are 30 attributes that correspond to the real-valued features computed for a cell nucleus under consideration. A total of 569 such samples are present in this data, out of which 357 are classified as ‘benign’ (harmless) and the rest 212 are classified as ‘malignant’ (harmful). The diagnosis column contains ‘M’ or ‘B’ values for malignant and benign cancers respectively. I have changed these values to 1 and 0 respectively, for better analysis. Also, for the sake of this post, I will only use two attributes from the data → ‘mean radius’ and ‘mean texture’. This will later help us visualize the decision boundaries drawn by KNN. Here’s how the final data looks like (after shuffling):

Let’s code the KNN:

# Defining X and y

X = data.drop('diagnosis',axis=1)

y = data.diagnosis # Splitting data into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42) # Importing and fitting KNN classifier for k=3

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train,y_train) # Predicting results using Test data set

pred = knn.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(pred,y_test)

The above code should give you the following output with a slight variation.

0.8601398601398601

What just happened? When we trained the KNN on training data, it took the following steps for each data sample: Calculate the distance between the data sample and every other sample with the help of a method such as Euclidean. Sort these values of distances in ascending order. Choose the top K values from the sorted distances. Assign the class to the sample based on the most frequent class in the above K values. Let’s visualize how KNN drew a decision boundary on the train data set and how the same boundary is then used to classify the test data set.

Which gender uses Instagram more?
Which gender uses Instagram more?

In September 2022, 55.6 percent of Instagram users in the United States were women, and 44.4 percent were men. Oct 21, 2022

Read More »
What platform is best for self-publishing?
What platform is best for self-publishing?

What Are the Best Self-Publishing Companies? Kindle Direct Publishing. One of the world's biggest self-publishing retailers. ... Barnes & Noble...

Read More »

With the training accuracy of 93% and the test accuracy of 86%, our model might have shown overfitting here. Why so?

When the value of K or the number of neighbors is too low, the model picks only the values that are closest to the data sample, thus forming a very complex decision boundary as shown above. Such a model fails to generalize well on the test data set, thereby showing poor results. The problem can be solved by tuning the value of n_neighbors parameter. As we increase the number of neighbors, the model starts to generalize well, but increasing the value too much would again drop the performance. Therefore, it’s important to find an optimal value of K, such that the model is able to classify well on the test data set. Let’s observe the train and test accuracies as we increase the number of neighbors.

Best results are observed at k=11

The above result can be best visualized by the following plot.

The plot shows an overall upward trend in test accuracy up to a point, after which the accuracy starts declining again. This is the optimal number of nearest neighbors, which in this case is 11, with a test accuracy of 90%. Let’s plot the decision boundary again for k=11, and see how it looks. We have improved the results by fine-tuning the number of neighbors. Also, the decision boundary by KNN now is much smoother and is able to generalize well on test data.

Let’s now understand how KNN is used for regression.

KNN Regressor

While the KNN classifier returns the mode of the nearest K neighbors, the KNN regressor returns the mean of the nearest K neighbors. We will use advertising data to understand KNN’s regression. Here are the first few rows of TV budget and sales.

# Defining X and Y

X_ad = ad.TV.values.reshape(-1,1)

y_ad = ad.sales # Splitting data into train and test

train_x, test_x, train_y, test_y = train_test_split(X_ad, y_ad, test_size=0.25, random_state=42) # Running KNN for various values of n_neighbors and storing results

knn_r_acc = [] for i in range(1,17,1):

knn = KNeighborsRegressor(n_neighbors=i)

knn.fit(train_x,train_y) test_score = knn.score(test_x,test_y)

train_score = knn.score(train_x,train_y) knn_r_acc.append((i, test_score ,train_score)) df = pd.DataFrame(knn_r_acc, columns=['K','Test Score','Train Score'])

print(df)

How can I make money online immediately?
How can I make money online immediately?

How to make money online Pick up freelance work online. ... Test websites and apps. ... Pick up tasks on Amazon's Mechanical Turk. ... Take surveys...

Read More »
How can I earn money by offline?
How can I earn money by offline?

113 Ways to Make Extra Money Offline House sit in your local area. Make extra money by house sitting for people in your local area. ... Turn your...

Read More »

The above code will run KNN for various values of K (from 1 to 16) and store the train and test scores in a Dataframe. Let’s see how these scores vary as we increase the value of n_neighbors (or K).

Best results at K=4

At K=1, the KNN tends to closely follow the training data and thus shows a high training score. However, in comparison, the test score is quite low, thus indicating overfitting. Let’s visualize how the KNN draws the regression path for different values of K. As K increases, the KNN fits a smoother curve to the data. This is because a higher value of K reduces the edginess by taking more data into account, thus reducing the overall complexity and flexibility of the model. As we saw earlier, increasing the value of K improves the score to a certain point, after which it again starts dropping. This can be better understood by the following plot. As we see in this figure, the model yields the best results at K=4. I have used R² to evaluate the model, and this was the best we could get. This is because our dataset was too small and scattered.

Some other points are important to know about KNN are:

KNN classifier does not have any specialized training phase as it uses all the training samples for classification and simply stores the results in memory. KNN is a non-parametric algorithm because it does not assume anything about the training data. This makes it useful for problems having non-linear data. KNN can be computationally expensive both in terms of time and storage, if the data is very large because KNN has to store the training data to work. This is generally not the case with other supervised learning models. KNN can be very sensitive to the scale of data as it relies on computing the distances. For features with a higher scale, the calculated distances can be very high and might produce poor results. It is thus advised to scale the data before running the KNN. That’s all for this post. I hope you had a good time learning KNN. For more, stay tuned.

Is marketing a science or business?
Is marketing a science or business?

But as data sources become more ingrained in your efficient marketing strategy, it's important to remember that marketing is an art as well as a...

Read More »
How much does YouTube pay?
How much does YouTube pay?

YouTube channel usually gets $18 per 1000 ad views, which is equal to about $5 per video views. Sep 20, 2022

Read More »
What degree do I need for marketing?
What degree do I need for marketing?

bachelor's degree Marketing roles typically require a bachelor's degree at minimum. An advanced degree such as an MBA in marketing can enable...

Read More »
What is 1 lakh called in us?
What is 1 lakh called in us?

As per the Indian numbering system, one lakh is a unit which is equal to 100,000 (one hundred thousand) in the international unit system. 10 lakh...

Read More »