K-Nearest Neighbors
You are the average of your k closest friends
Can be used in classification and regression problems but more widely used in classification. KNN is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function (or mode if categorical data).
The fact that k-NN doesn’t require a pre-defined parametric function f(X) relating Y to X makes it well-suited for situations where the relationship is too complex to be expressed with a simple linear model.
- Does more computation on test time rather than train time
- Performs much better if all of the data have the same scale
- Works well with a small number of input variables but struggles with many inputs
- Makes no assumptions about the functional form of the problem being solved
- Can be used for imputing missing value of both categorical and continuous variables
- Areas to use:
- Works great for fraud detection (the model update instantly)
- Predicting house prices (literally near neighbor) or domains where physical proximity matters
- Filling missing values
- These distance functions can be Euclidean, Manhattan, Minkowski for continuos func
- Hamming distance for categorical variables
- Variables should be normalized else higher range variables can bias it
- Works on pre-processing stage more before going for KNN like outlier, noise removal
- Choosing the best k with cross-validation
- When you increase the k the bias will be increases and variance decrease
- Small k can lead to overfitting
R:
P:
#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)