K Nearest Neighbours
What is K-nearest neighbours?
K-nearest neighbours is a supervised machine learning algorithm used for both classification and regression tasks. It assigns a category or a numerical value to a new, unknown data point based on the known outputs of the data points that are spatially closest to it within a structured dataset.
What does the "K" represent in the algorithm?
The "K" represents a predefined integer that dictates exactly how many neighboring data points the algorithm must evaluate when making a prediction. For example, if a data scientist sets K to 5, the algorithm will identify the 5 closest existing data points to the new input. It then uses the majority category or the average numerical value of those 5 specific points to compute the final output.
How does the algorithm computationally determine which data points are the closest?
The algorithm determines proximity by calculating the exact geometric distance between the new data point and every existing data point in the dataset. This is typically achieved using standard mathematical distance metrics, primarily Euclidean distance, which measures the direct straight-line distance between points in a multi-dimensional space, or Manhattan distance, which measures distance along grid-like axes.
What is the theoretical classification of this algorithm?
Theoretically, K-nearest neighbours is classified as a "lazy learning" and "non-parametric" algorithm. It is considered "lazy" because it does not construct an abstract mathematical model during a dedicated training phase. Instead, it memorizes the entire dataset and performs all distance computations only at the exact moment a prediction is requested. It is "non-parametric" because it makes zero underlying statistical assumptions about how the data is distributed.
Which programming languages and libraries are utilized to execute K-nearest neighbours?
Python is the dominant programming language for executing this algorithm. Developers rely on the scikit-learn machine learning library. Depending on the objective, they import and utilize either the KNeighborsClassifier module to predict discrete categories or the KNeighborsRegressor module to predict continuous numerical values.
How is K-nearest neighbours utilized in the field of Data Science?
In data science, KNN is frequently applied to missing data imputation and pattern recognition tasks, such as credit risk assessment. For example, a data scientist at a bank receives an application from a new client and needs to predict the probability of loan default. The algorithm plots the new client into a dataset based on numerical variables such as annual income, current debt amount, and years of employment. If K is set to 15, the algorithm locates the 15 historical clients with the most mathematically similar financial profiles. If 12 of those 15 closest clients successfully repaid their loans, the algorithm classifies the new client as a low credit risk based on strict proximity.