Given a start-up with a business model of selling rooms (not houses nor whole apartments) to clients and has a portfolio of 12,000 potential clients interested in this purchasing model.
Python, Pandas, Scikit-learn, Numpy, Seaborn, Matplotlib, Plotly, Streamlit.
Recommender Systems: Recommender Systems are algorithms that attempt to "predict" the next items (products, songs, etc.) which a particular user will want to purchase. These systems are based on machine learning and are used in applications ranging from e-commerce to streaming platforms, to help users find relevant content and improve their experience. All recommendation systems have one thing in common: in order to make predictions, they need to define and quantify the similarity between items or users. The term "distance" is used within the context of recommenders to quantify the similarity or difference between observations. If observations are represented in a k-dimensional space, where k is the number of variables associated with each observation (item or user), the more similar two observations are, the closer they will be; hence the term "distance". From a mathematical perspective, this series of variables associated with each observation is called "vector". There are several types of recommendation systems, such as Popularity-Based, Content-based, Collaborative Filtering, or Hybrid.
Popularity-Based: Recommendations are based on the "popularity" of products. For example, global "best sellers" will be offered equally to all users without taking advantage of personalization. It is easy to implement and, in some cases, effective.
Content-Based: Based on products visited by the user, an attempt is made to "guess" what the user is looking for and offer similar products. Our case applies to this type of recommendation system.
Collaborative Filtering: This is the most innovative approach, as it uses information from the "crowd" to identify similar profiles and learns from the data to recommend products individually.
Similarity: It is a measure used to calculate the similarity between two or more vectors. Where vectors represent any type of data such as text, points in an image, or a song, or products, etc. Each of these data within the vector is called a "dimension". So, for example, a two-element vector is a two-dimensional vector. There are multiple distances that are used to calculate the similarity between vectors in different types of scenarios, among them the following stand out: Euclidean Distance, Cosine Distance and Similarity, Pearson Correlation, Spearman Correlation, and Jaccard Index, among others. In this case, Cosine Similarity was applied.
Cosine Similarity: (Not to be confused with Cosine Distance) It tests how similar two vectors of numbers (such as scores or characteristics) are by looking at the angle between them. If the angle is small, it means the vectors are very similar, even if they have different lengths. It helps you understand how closely related two things are based on their direction. This means that the cosine of the angle between two vectors can be interpreted as a measure of the similarity of their orientations, regardless of their magnitudes.
The "dot product" or scalar product between two vectors is defined as:
Where:
u.v is the scalar product or "dot product" between the vectors u and v. (The Numpy library has a very efficient formula for calculating the dot product).
||u|| and ||v|| are the norms (or magnitudes) of the vectors u and v, respectively. (The Numpy library has a very efficient vector norm calculation and is the recommended one)
From the above formula, the cosine of the angle can be solved as:
This formula would be used as Cosine Similarity, and the resulting values vary between -1 and 1. So, if two vectors have an angle of:
0º: They have exactly the same orientation and their cosine takes the value of 1, that is, the vectors are perfectly aligned, indicating maximum similarity.
90º: They are perpendicular and their cosine is 0, indicating orthogonality.
180º: They have opposite orientations and their cosine is -1, indicating that they are diametrically opposite, reflecting maximum dissimilarity.
Cosine Distance: Measures the "dissimilarity" between two vectors A and B by calculating the cosine of the angle joining them. It can be defined as 1 minus the Cosine Similarity between A and B, as shown in the following formula:
This formula quantifies the Cosine Distance, ranging from 0 to 2.
Cosine Distance = 0 means the vectors are perfectly aligned (with no angle between them), indicating maximum similarity.
Cosine Distance = 2 suggests the vectors are diametrically opposed, indicating maximum dissimilarity.
Use of Python with its libraries such as Pandas for loading and managing data through DataFrames; Numpy for calculating the dot product and the similarity matrix; Scikit-learn for preprocessing with One Hot Encoding; Seaborn, Matplotlib, and Plotly for creating graphs; and Streamlit for developing the application's front-end.
Knowledge acquired about Recommender Systems and their possible uses.
Understanding the concept of similarity or similarity between vectors. This similarity between vectors, applied to this case, where two vectors represent two potential owners, is determined by a distance measure such as Cosine Similarity, which is the one used here.
Recognize the difference between Cosine Similarity and Cosine Distance. Cosine Distance measures the dissimilarity between vectors by calculating the cosine of the angle between them; Cosine Similarity quantifies how similar two vectors are based on the cosine of the same angle.
Improvement of knowledge about other possible distances to use in this kind of problems.
Problem: The owners of these rooms will be sharing the apartment with other owners. The startup's idea is to ensure a high degree of compatibility between the characteristics, tastes, and interests of the different owners who will be sharing the same apartment. With 12,000 owners, this task cannot be done manually, but it can be automated using a Data Science algorithm.
We have 12,000 potential clients, and for each of them, we have the answers to 17 questions (17 dimensions) that each client completed to allow us to evaluate their tastes and interests. Therefore, we have a data set or a series of data for each of those 12,000 potential clients; that is, for each potential client, we have an associated 17-dimensional vector.
To simplify the problem, the solution is restricted to finding the set of remaining owners of an apartment, assuming that one owner already exists. Otherwise, finding all the owners of an apartment would require at least one other complementary model to begin configuring the group for each apartment.
Starting from the 12,000 vectors of potential customers, the similarity between the vectors is calculated. What these similarity measures tell us is that the more similar two specific vectors are to each other, the more similar two potential owners are, that is, more compatible with each other.
Calculating the similarity between the vectors involves starting from a matrix of owners by questions, where each row corresponds to a 17-dimensional vector (one for each question), and moving on to another resulting symmetric matrix of owners by owners with the previously calculated similarity values for each one.
The similarity formula between vectors chosen for this case is the Cosine Distance (the most commonly used).
The resulting matrix of owners by owners is said to be symmetric because each user is equal to themselves; therefore, the cell that relates a user to themselves will always be equal to 1 (diagonal of the matrix). Furthermore, the similarity of owner A to owner B will be the same as the similarity of owner B to owner A. Therefore, we only need a single part of the resulting matrix, from the downward diagonal or the upward diagonal. Example: If there is an owner M who has a similarity of 0.8 to owner A, and an owner C who has a similarity of 0.17 to owner A; then looking at the rest of the owners with respect to owner A, in this case, we can say that owner M is the most similar to owner A than owner C, since it fits better (0.8 > 0.17 and closer to 1). In this way, we'll provide the startup with a list of the most compatible room owners in an apartment (A) so that a manual selection can be made from there. Perhaps narrowing down the potential 12,000 co-owners to the 10 or 15 potential co-owners whose data tells us they would be the most compatible. Finally, a person can review these results and select two or three to be offered the remaining rooms in that apartment.
The algorithm is based on three components (Python files):
main.py: Contains the front-end logic built with Streamlit.
utilidades.py: Contains functions used in any calculations.
logica.py: This is where the similarity matrix is calculated and the calculations are generated.
All the code and the entire application can be found in this GitHub repository.