kNN

 

K-Nearest Neighbors 

You are the average of your k closest friends

Can be used in classification and regression problems but more widely used in classification. KNN is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function (or mode if categorical data).

The fact that k-NN doesn’t require a pre-defined parametric function f(X) relating Y to X makes it well-suited for situations where the relationship is too complex to be expressed with a simple linear model.

  • Does more computation on test time rather than train time
  • Performs much better if all of the data have the same scale
  • Works well with a small number of input variables but struggles with many inputs
  • Makes no assumptions about the functional form of the problem being solved
  • Can be used for imputing missing value of both categorical and continuous variables
  • Areas to use:
    • Works great for fraud detection (the model update instantly)
    • Predicting house prices (literally near neighbor) or domains where physical proximity matters
    • Filling missing values

 

knn K- Nearest Neighbors classification continous regression problem knn simplified

 

  • These distance functions can be Euclidean, Manhattan, Minkowski for continuos func
  • Hamming distance for categorical variables

euclidean manhattan distance formula

 

 

Solving in terms of c, we find the length of the hypotenuse by taking the square root of the sum of squared lengths of a and b, where a and b are orthogonal sides of the triangle (i.e. they are at a 90-degree angle from one another, going in perpendicular directions in space) Green line = Euclidean distance. Blue line = Manhattan distance

 

 

 

 

 

 

 

 

 

 

This idea of finding the length of the hypotenuse given vectors in two orthogonal directions generalizes to many dimensions, and this is how we derive the formula for Euclidean distance d(p,q) between points p and q in n-dimensional space:

 

 

  • Variables should be normalized else higher range variables can bias it
  • Works on pre-processing stage more before going for KNN like outlier, noise removal
  • Choosing the best k with cross-validation
  • When you increase the k the bias will be increases and variance decrease
  • Small k can lead to overfitting

 

 

 

 

 

 

 

 


R:

library(knn)
fit <- knn(y_train ~ ., data = df, k = 5)
summary(fit)
predicted = predict(fit, x_test)
r
re
rea
read
readg

 

 

 

 

 


P:

#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)