Blind Dating Ensemble Classifier

Posted on Oct 28, 2022

Introduction

Dating is hard, especially if one is bad at first impressions. What's even harder, is that according to Fishman et al. 2006, men and women look for different things. According to the study, men primarily prefer women based on attractiveness and dislike women who have more intelligence or ambition than them. Women on the other hand look at race and intelligence as key factors in their decision and prefer those who come from affluent neighborhoods. Do these differences also change across other factors?

This project uses Fishman et al. 2006's data set which can be found on Kaggle to create a plotly dash app for users to explore how different features affect the probability of match, the event when both people choose each other. This 2006 study examined 20 speed dating events where about 20 men and 20 women per event engaged in 5 minute round speed dating.

However, what model should be used? Do we optimize on accuracy, recall, or precision? The Blind Dating Ensemble Classifier  uses a customizable Ensemble Vote Classifier. Users can pick from a Logarithmic Regression model, 2 different K-Nearest Neighbor models, 2 Gradient Boosting models, 2 Decision Tree models, and 2 Random Forest models.

Currently, the app is timing out at processes over 30 seconds. Thus, I used a small ensemble composed of the logarithmic regression model and both decision trees to answer the following questions:

  • Does racial similarity/difference affect match success?
  • Does one's answer to "Out of the 20 people you will meet, how many do you expect will be interested in dating you" affect match success?
  • Does how frequent one goes out affect match success?

Methods

  1. Data Preparation

I was asked to mention how to run the notebooks and python scripts in order to go from raw data to presenting on the app. The other blogs I have written up to this point just require one file in the repo to be run. For this project, one must run these files in the following order: notebooks/eda.ipynb (Methods I.a - I.c), notebooks/machineLearning.ipynb (Methods I.c-I.h, Methods II, and Methods III), src/collectionPipeLine.py (Methods III), and app.py (Methods IV).

It should be noted, that running eda.ipynb will delete and recreate empty data/processedData and data/plotlyDashData directories, and running machineLearning.ipynb without datingTrain.csv, datingTest.csv, and datingFull.csv in data/processedData will also delete and recreate empty data/processedData and data/plotlyDashData directories. Any data that should not be deleted should be in the root data folder.

Readers who want to skip over how the sausage is made and go straight to how the app works and can scroll down to Methods IV. With that said, here is a general idea of the raw data to app in more detail.

I.a) Getting locations (notebooks/eda.ipynb)

The first thing to do is add locations via Nominatim. Zipcodes are stored as numbers stringed with commas. What does this mean? Some international zipcodes start with multiple zeros. When this data set was stored, those zeros were lost. A zipcode like 00045 would only be stored as a 45. Then, you have zipcodes in the United States that are unfortunately stored as 92,069 instead of 92069. Nominatim will not understand either of these zipcodes. To fix this, the zipcodes are turned into strings. All commas are removed. If the string has a length less than 5, zeros were prepended until the length of the string was five. Once prepared, the code will loop through each row first checking the locationsDictionary if the zipcode value or from value (which could be city, state, country or combination of those elements) is in the dictionary. If zipcode or from value is in the locationsDictionary, it will assign the latitude and longitude of that location to lats and lons respectively. If not, the code calls the Nominatim's api to get coordinates for the zipcode. If the coordinates is None, the code will reattempt using the from value. Once a location is retrieved using either zipcode value or from value, the coordinates are stored in locationsDictionary. This repeats until the code has attempted to retrieve the lats and lons for each row.

I.b) Feature selection (notebooks/eda.ipynb)

After getting locations I choose important variables that might be of interest. The columns I chose are listed above for the list, which is dubbed as columnList.

I.c) Strings to Numbers and Numbers to Strings (notebooks/eda.ipynb & notebooks/machineLearning.ipynb)

Of the variables in columnList, there are numbers that represent categories or codes; these variables are put in a list I call nonBinaryCategoricalList. Lastly, there are numbers, such as SAT scores, tuition, and income, that due to their nature of having values over 1000, got stored as strings with commas. To continue with data processing, variables in the stringToFloatList needed to be casted as floats, and the numerical values in nonBinaryCategoricalList needed to be casted into strings.

I.d) Dummification  and Train-test-split (notebooks/machineLearning.ipynb)

I proceed to turn all string related columns into dummies, on condition that the column undergoing dummification has a set length less than 25. Individuals came from various cities and states, attended different undergraduate institutions, and have various different career titles. The names of all those different attributes might have a psychological effect on a person; however, making hundreds of dummy variables for these varied entries will bog down any of the machine learners chosen for this project. The original column names that got dummified became keys in dummyDictionary with a list of the new dummy column names as the corresponding values. In this project, I did not drop the first columns. This way, I can keep track of all values, and not have to remember or recover what was the default dummy value.

Once dummies have been added, a train-test-split can occur. We save datingTrain (train data), datingTest (test data), and datingFull (full dataset).

I.e) Joining to partner (notebooks/machineLearning.ipynb)

After train-test-splitting, I use util.joinToPartner. Again, the target variable of interest is match, the event where candidate and partner say yes to each other. Each row focuses a lot on candidates. We see why the candidate made their decision, but we're missing half of the deciding factor of the match, the partner's data. This is the first reason why I also kept a full dataset in addition to train and test datasets. Each row has an iid and a pid. Each candidate is identified with an iid. Each row is a pairing between a candidate and a partner. The partner's iid is listed as pid. Thus, grabbing the row from datingFull where datingFull.iid = datingTrain.pid and datingFull.pid = datingTrain.iid, will give us the partner's attributes and decisions. Partner columns are suffixed with "_o" (e.g: “age” means candidate age, while “age_o” means partner age).

I.f) Get distances (notebooks/machineLearning.ipynb)

Using the lats and lons from Methods I.a. and lats_o and lons_o from Methods I.e, I use geopy.great_circle to get the distance in miles between candidate location and partner location.

I.g) Fix Ambiguous Scores (notebooks/machineLearning.ipynb)

There were various questions asked before the speed dating event began. Five of them were traits and preferences questions. These 5 traits and preference questions paraphrased were

  1. Regarding attractiveness, sincerity, intelligence, fun, ambition, and shared interests, what is important in a potential partner
  2. How would most candidates of the opposite gender answer question 1
  3. On a scale of 1-10 for each attribute how would you rate yourself in attractiveness, sincerity, intelligence, fun, and ambition
  4. How would most candidates of your  gender answer question 1
  5. On a scale of 1-10 for each attribute how would others rate you in attractiveness, sincerity, intelligence, fun, and ambition

In the 20 waves, 16 of the waves were asked to answer questions 1, 2, and 4 using a 100 point budget, allocating more points to the more important and less points to the less important. Waves 6-9 were just asked to rate the importances on a scale of 1-10. Even though the instructions asked candidates of waves 1-5 & 10-20 to ensure answers added up to 100, The Data Science Book of Love Kaggle Notebook expresses that this keep to the score budget was not strictly enforced. Lastly, some waves were paused halfway to reflect on the experiences with previous pairs they met that night to re-answer questions 1 & 3, with scales from 1-10. Those who received those questions halfway through the speed dating experience would have had a conscious change of mind which could have influenced match probability.

To fix this, the machineLearning.ipynb consulted util.FixAmbigiousScores, which had 2 parts. The first part was to replace the original values for questions 1 & 3 with the answers of the halfway questions for pairings that answered the half-way questions and the round of the pairing ("order") was half or greater than 50% of the number of pairing rounds ("round") for that night. After that, the second part of util.FixAmbigiousScores rescaled the answer of 1,2, and 4 by proportionally allocating the 100 points based on the scores provided by the candidate for each question, regardless of wave identification.

I.h) Replaces nans with modes and means (notebooks/machineLearning.ipynb)

Lastly, I created a nanReplacementDictionary that stored a default value for each column to replace nans in training, test, and full databases. These nans are replaced prior to inputting the data frame into training, testing, or full models. For number columns with a set length less than 25, the nan replacement value was the mode of that column in the training set. All other  columns received the mean of that column in the training data set.

  1. Model Preparation (notebooks/machineLearning.ipynb)

II.a) Logarithmic Regression

Originally this was supposed to be a parameterless Logarithmic Regression. However, values were not converging. Thus, make_pipeline was called to standardize the data via a StandardScalar() and then fit to a Logarithmic Regression with a max iterations parameter set to 1e9.

II.b) K-Nearest Neighbors

Various people advising people to improve their social standings have commonly alluded to statements of  people becoming the average of the five people they associate with the most. Thus, a K-Nearest Neighbor model with n_neighbors=5 felt appropriate to include. The default n_neighbors value K-Nearest Neighbor models is typically the square root of the number of data points examined. Low n_neighbors values tend to overfit with variance. High n_neighbors parameters tend to underfit with bias. Adding both to an ensemble would get the best of both worlds. The overfit models could find edge cases and the underfit models could paint a general understanding with data points in a certain cluster. As a result, I included knn5 and knnsqrtn as options for the ensemble.

II.c) Gradient Boosting Classifiers

Gradient Boosting Classifiers start with a weak learner, but the learner improves in steps. At each step the model looks at the gradient and makes the correction proportional to the gradient times a learning rate. Lower learning rates will take longer to get to minimizing the error. Higher learning rates may have a tendency to overshoot and miss targets. Knowing this I provided 2 Gradient Boosters, gradientdeci (learning rate = 0.1) and gradientdeka (learning rate = 10).

II.d) Decision Trees

Decision trees were not originally considered for the ensemble set. However, seeing that recallForest, which will be mentioned in Methods II.e, was getting low recall, I realized that forests cannot be extreme enough to catch edge cases. However, a single tree can. The first tree, dubbed recallTree, grid searched for best recall with the following potential parameters:

  • criteria: gini, entropy, and log_loss
  • max_depth: sqrtfeatures, log2features, and thirdGeometricTerm
  • max_features: sqrtfeatures, log2features, and thirdGeometricTerm

Sqrtfeatures is the square root of the number of total features casted as an integer. Log2features is the log2 of the number of total features casted as an integer. ThirdGeometricTerm is calculated by taking the larger value between sqrtfeatures and log2features, dividing that value by the smaller counterpart, multiplying that quotient to the larger term, and casting it as an integer. When ordered from smallest to largest, the three form a finite geometric series with the last value being the third geometric term.

After recall has picked its parameters, the recallTree drafted parameters are removed from the selection, and preciseTree will pick gridsearch through the remaining parameters. recallTree will have the best recall, but preciseTree will only have the best precision based on remaining parameters. In my period of experimentation with the forest equivalents of these trees, recall scores were always dwarfed in comparison to precision. Putting recall to draft first gave the best recall possible without damaging precision's chances of getting good precision.

II.e) Random Forests

The random forest preparation process is identical to the decision tree preparation, with one addition: recallForest and preciseForest get to add n_estimator options 100, 200, and 300 to their grid searches.

III. Dash App preparation (notebooks/machineLearning.ipynb & src/collectionPipeLine.py)

After training the models, the remainder of machineLearning.ipynb is used for initial testing and analysis of each model individually and together as a 9 model VotingClassifer on a processed datingTest. This notebook also applies the data cleaning process to datingFull. Once datingFull.csv is processed, it will pass the following from data/processedData into data/plotlyDashData:

  • datingTrain.csv to train the models for collectionPipeLine.py and app.py
  • datingTest.csv to provide testing analysis for collectionPipeLine.py and app.py
  • datingFull.csv to assist collectionPipeLine.py and app.py in feature analysis for distribution, statistics, and correlation figures in the app
  • columnDataDictionary.json to assist collectionPipeLine.py and app.py in populating the featureSelect dropdown in featureAnalysis page and identifying partner columns
  • dummyDictionary.json to help collectionPipeLine.py and app.py associate raw non-dummy column names in the featureSelect drop down to the dummy columns in the data.
  • treeParams.json to pass the preferred parameters of recallTree and preciseTree to collectionPipeLine.py and app.py
  • forestParams.json to pass the preferred parameters of recallForest and preciseForest to collectionPipeLine.py and app.py

Before continuing discussion about collectionPipeLine.py, I need to discuss two jsons that are not in the data folder, but are placed at the root of the repository for reasons that will be elucidated in this paragraph. When somebody looks upon my app and wants to look at dummified values, like race, field of study, how often somebody goes on a date, etc... they would like to see what that value represents. No one will understand what race_1, field_cd_10, and date_7 means when they look at the figures without having to download the zip file from Kaggle and opening the speed dating data key document. Additionally, users would like to know the descriptions of what the non-knn submodels are dictating as their top 10 deciding features. I leave an exercise for the reader for what the feature analysis page would look like if I just used column names without descriptions. Hence, I manually created dummyValueDictionary.json (which associates values to the dummy columns they represent) and descriptionDictionary.json (which describes everything else that is feature related). I provide these on the root of the repository because I faced .gitignore inconsistencies between main and deployment branches and lost these two dictionaries while switching branches. Due to time constraints and thinking this was going to be a one time creation of each json, I employed no code to make these dictionary jsons. I am glad that online regex tools, jupyter lab, and json formatting and editing features in visual studio code exist. Readers who have time and patience to re-invent these wheels while employing my code can feel free to do so, but those who want to employ my code and do not have the time and patience to re-invent wheels can access these dictionary jsons from the repository.

Knowing that these two dictionaries exists in the root folder, we can now segue back to CollectionPipeLine.py, which will be looking for the data/plotlyDashData files that are mentioned at the beginning of this section and the two json dictionaries in the root folder. To reduce site computation, data calculations that could be calculated once and do not need recalculation on user interaction are done via collectionPipeLine.py. Running collectionPipeLine.py fills data/plotlyDashData with the following:

  • malePredictions.csv, femalePredictions.csv, and overallPredictions.csv allows app.py to grab prediction values based on a trained submodel for the data frame that corresponds to it.
  • collectionDictionary.json saves 5 json convertible objects:
    • modelDescriptionDictionary describes the submodels
    • matrixDictionary creates the confusion matrices for the submodels and the 9 model fullEnsemble Classifier
    • metricsTable provides the accuracy, recall, and precision of the submodels and the 9 model fullEnsemble Classifier
    • significantFeaturesDictionary gives the descriptions of the top 10 deciding factors for each non-knn submodel. The logarithmic regression uses the features tied with the 10 largest absolute value coefficients. The other six models use their top 10 feature importances.
    • featureSelectOptions is a list of dictionaries used for featureSelectDropdown. Label is the feature name and description. Value is the feature name
  1. Dash App Functionality And Figures (app.py)

At last, we have arrived at the app and its figures. (We thank those who skipped this section for their patience in waiting for us while touring the sausage factory). There are two main pages: Matchmakers and Feature Analysis

IV.a) Matchmakers

The name, Matchmakers, hints back to the original proposal of this project. The project changed focus from a match making app back to data science app when I realized I needed to write this paper, which requires me to provide a tool to analyze the dataset.

The first section is ensemble metrics, which displays the confusion matrix on the left, and the accuracy, recall, and precision metrics of the ensemble. As a reminder:

  • accuracy = (TP + TN)/(TP+TN+FP+FN) which measures correctness on positive and negative values
  • recall = TP/(TP+FN) which measures how well its able to grab all positive values
  • precision = TP/(TP+FP) which measures correctness of all positive guesses

Sections following the ensemble describe the submodels in the same way, except that if its model is not a knn, the top 10 deciding features are displayed below the matrix and metrics bar chart.

Toggling the check boxes above will show and hide submodel information and as well update the ensemble and its metrics

IV.b) Feature Analysis

After models have been selected, the user can then pick a feature from the featureSelect dropdown. The dropdown triggers callbacks to edit three or nine figures.

IV.b.i) Samerace

Samerace is one of the three shared parameters between candidate and partner. Unlike its siblings, partnerDistance and int_corr, which are continuous and have variety, samerace is binary. Thus, it is very similar to the dummy feature screens in IV.b.ii. Since each pairing is listed twice, one where the male is the candidate, and the other where the female is the candidate, showing the male and female sections are redundant. Thus, this is the only feature that shows one section of three figures. For its distribution figure, it shows a bar chart for the number of rows where the couples are the same race and the number of rows where the couples are of different races. The statistics figure is a bar chart displaying the average probability and standard error of same race pairs and different race pairs. Lastly, the correlation graph is a color scale scatter plot where the data point coordinates are (samerace value, prediction probability from selected ensemble) and the color is the actual value. 

IV.b.ii) Dummy features

Selecting a dummy feature will update nine figures composed of the combination of genderType data frame – male, female, and overall – and feature analysis figure type – distribution, statistics, and correlation. The distribution plot is a bar chart showing the counts of rows with that value. For statistics, the bar chart shows the mean and standard error of predicted probabilities of each possible value for that feature in the data frame with the selected ensemble. Lastly, the correlation bar chart shows the R correlation value between the predicted probability of the selected ensemble versus the feature choice.

IV.b.iii) Continuous features

Like the dummy features, feature selection on a continuous variable will update nine figures based on genderType data frame and figure type combination. The distribution type of a continuous variable is a figure factory dist_plot displaying the histogram and kde distribution of the column data plot. The statistics plot is a scatter plot of the selected ensemble predicted probability versus the feature value. Finally, the correlation bar chart displays the correlation values of not just the selected ensemble’s predicted probabilities versus the feature values, but it also shows those correlations for each submodel in the ensemble.

Results

I. Model evaluations

I.a) fullEnsemble

The results of its confusion matrix gives the full 9 model ensemble a 0.84 in accuracy, a 0.05 in recall, and a 0.57 in precision

I.b) logModel

logModel achieved a 0.84 in accuracy, 0.45 in precision, and a 0.08 in recall. Features of interest when using this model are race_o, career_c, go_out, match_es, movies_o, matches_es_o, career_c_o, movies, field_cd_o, and go_out_o.

I.c) knn5

knn5 gets a 0.82 in accuracy, 0.31 in precision, and a 0.11 in recall.

I.d) knnsqrtn

knnsqrtn earns a 0.84 in accuracy, a 0 in precision, and a 0 in recall.

I.e) gradientdeci

gradientdeci performs with 0.85 accuracy, 0.63 precision, and 0.08 recall. Features of interest include match_es_o, match_es, partnerDistance, shar2_1_o, lats_o, shar2_1, lats, expnum_o, attr1_1, and attr1_1_o

I.f) gradientdeka

The model, gradientdeka, scores 0.17, 0.16, and 0.98 in accuracy, precision, and recall respectively. gradientdeka suggests that users look at fun2_1_o, attr1_1, match_es_o, match_es, exercise_o, fun5_1_o, sinc1_1, int1_1, sha_1_1_o, and attr2_1

I.g) preciseTree

preciseTree has an accuracy of 0.84, a precision of 0.37, and a recall of 0.03. It suggests looking at matches_o, expnum, income, sinc1_1, intel1_1, music_o, attr4_1, yoga_o, go_out, and intel2_1

I.h) recallTree

recallTree scores 0.75, 0.22, and 0.25 for accuracy, precision, and recall. Users should look at int_corr, match_es, partnerDistance, lats_o, match_es_o, fun2_1, clubbing_o, tv_o, sinc2_1, and lats

I.i) preciseForest

The accuracy, precision, and recall scores of preciseForest are 0.84, 0.5, and 0.02. Features of interest with this model are match_es, match_es_o, partnerDistance, int_corr, lats_o, lats, lons, lons_o, attr1_1_o, and attr2_1

I.j) recallForest

recallForest's accuracy is 0.85. It's precision is is 0.57. Lastly, it's recall is 0.06. When using recall forest, users might be interested in partnerDistance, match_es_o, match_es, int_corr, lats_o, lats, lons_o, lons, attr1_1_o, and attr2_1

I.k) slimEnsemble

The ensemble consisting of logModel, preciseTree, and recallTree scores 0.84 in accuracy, 0.57 in precision, and 0.05 in recall

II. Samerace: The boolean value that expresses if the pair belong or do not belong to the same race

There were 3316 pairs that were of the same race. The remaining 5062 pairs were of different races. Same race pairs had a 0.170 +/- .003 probability of matching. Different race pairs had a 0.165 +/- 0.002 probability of matching. The Spearman correlation of the samerace value, which was either zero or one was 0.03, with a p value of 0.0.

III. Expnum: Candidate's answers to "Out of the 20 people you will meet, how many do you expect will be interested in dating you"

When males were asked the question, "Out of the 20 people you will meet, how many do you expect will be interested in dating you", the average answer was 5.57 +/- 0.04. When examining the individual components, recallTree saw a correlation value of 0.01 (p-value 0.71), preciseTree saw a correlation of 0.07 (p=0.0), and logModel scored the correlation at 0.17 (p-value=0.9). Together the slimEnsemble observed a Spearman correlation of 0.12 (p=0.0) for men's answers to the expected number of likes questions.

When looking at the ladies numbers, they answered expecting 5.54 +/- 0.03  would be interested in dating them. When looking at the slimEnsemble components, its recallTree reported a 0.02 correlation (p=0.33), its preciseTree relayed a 0.04 correlation (p=0.01), and its logModel interpreted a 0.6 (p=0.01) correlation. When together, the slimEnsemble calculates a correlation of 0.04 (p=0.01) between women's responses to "Out of the 20 people you will meet, how many do you expect will be interested in dating you" and their match probability.

In general, the average candidate expected 5.55 +/-  0.02 people to be interested in them. recallTree considered a 0.01 (p=0.34) correlation between match probability and responses. preciseTree and logModel determined the correlation between response and match probability to be 0.06 (p=0.0) and 0.11 (p=0.0), respectively. The collective of the three models adjusted the correlation between people's response to "Out of the 20 people you will meet, how many do you expect will be interested in dating you" to their match probability to 0.10 (p=0.0)

Looking at all three scatterplots, there was a variety of match predicted probability to expnum answer value combinations, but no clear trivial trend is visible.

IV. Go_Out: How frequent a candidate goes out (not necessarily on a date)

When asked often they go out, 20 men left the answer blank, 19 men said almost never, 78 men reported several times a year, 124 men claim once a month, 169 men state twice a month. 888 men answered once per week, 1628 men said twice per week, and 1268 boasted several times a week. In the same order as they were mentioned in the previous sentence, the match probabilities for each go_out response was 0.19 +/- 0.14, 0.06 +/- 0.02, 0.16 +/- 0.02,  0.11 +/- 0.01, 0.117 +/- 0.009, 0.150 +/- 0.005, 0.162 +/- 0.004, and 0.198 +/- 0.005. Likewise their Spearman R correlation values and p values are 0.03 (p=0.05), -0.07 (p=0.0), 0.04 (p=0.01), -0.09 (p=0.0), -0.10 (p=0.0), -0.05 (p=0.0), 0.0 (p=0.96), and 0.12 (p=0.0)

When women were asked about their social frequency, 59 did not specify it, 18 sheltered it at almost never, 21 preserved it at several times a year, 1/mo reserved it at 40, 281 incorporated it twice a month, 1061 kept it at once a week, 1362 maintained it at twice a week, and 1342 flaunted it at several times a week. In the same order as they are listed, the prediction probabilities are 0.19 +/- 0.02, 0.044 +/-0.003, 0.14 +/- 0.02, 0.20 +/- 0.03, 0.150 +/- 0.009, 0.154 +/- 0.004, 0.159 +/- 0.004, and 0.184 +/- 0.005. Similarly, the correlations and p-values are 0.02 (p=0.31), -0.10 (p=0.0), -0.01 (p=0.57), 0.02 (p=0.3), -0.02 (p=0.21), -0.11 (p=0.0), - 0.06 (p=0.0), and 0.18 (p=0.0).

 

In total, the speed dating experiment surveyed 79 people who left that answer blank, 37 people who almost never go out, 99 people who go out several times a year, 164 people who go out once a month , 450 who go out twice a month, 1949 who go out once a week, 2990 who go out twice a week, 2610 who go out several times a week. Utilizing the same order schema, probability of match values are 0.18 +/- 0.02, 0.030 +/- 0.002, 0.17 +/- 0.01, .14 +/- 0.01, 0.136 +/- 0.06, 0.166 +/- 0.003, and 0.189 +/- 0.003. The corresponding R and p values are -0.01 (p=0.5), -0.09 (p=0.0), 0.02 (p=0.1), -0.02 (p=0.14), -0.06 (0.0), -0.07 (0.0), 0.02 (0.11), and 0.09 (0.0).

Discussion

I. Model evaluations

When looking at the confusion matrices, one can see there are a total of 1413 failed pairings and 263 successful matches in the test data set. Only 16% of the test data are actual successful matches.This 16% success rate is also true for full and training sets.

Reviewing the test scores – which  are the scores displayed in the app – versus training scores – which was overlooked until writing the methods section –  illustrates that all models, except recallTree and both forests, seem balanced in accuracy; recallTree and both forests on the other hand are overfit. All models, minus gradientdeka, score well in accuracy.

In regards to precision, all models, but knnsqrtn and gradientdeka, seem to be overfit; knnsqrtn and gradientdeka, are balanced both in training and testing. Excluding gradientdeka, knnsqrtn and both trees, the remaining models do ok in testing precision.

Lastly, in examining recall, fullEnsemble, knn5, recallTree, preciseForest, recallForest, and slimEnsemble are overfit, and logModel, knnsqrtn, gradientdeci, gradientdeka, preciseTree, are balanced. Training recall went well for gradientdeka, recallTree, and recallForest, but only gradientdeka was the only model that had good testing recall.

Knnsqrtn is an interesting model. Knnsqrtn is an over biased model that predicts everything as failure. This model is failing everything because when it predicts a classification label for a datum, it makes its decision based on the closest square root of the number of training data points, i.e. sqrt(n_training).  Doing that calculation, shows that each data point is classified by the 82 closest pairings. With only 16% of the data being successes, finding at least 41 nearest neighbors that can point a single row to a success target is very low, and is calculable via bayes theorem.

The extreme counterpart to this is gradientdeka which stamps success to almost all of the test data. This might be due to gradientdeka’s learning step of 10, which would tend to overshoot until it finds a random minimum, instead of the local minimum.

Excluding the overly pessimistic knnsqrtn and overly optimistic gradientdeka, all models, including both fullEnsemble and slimEnsemble, score well in accuracy due to being able to identify over a majority of match failures. These 9 clear-headed models also have moderate to high testing precision due to low false positive count. Lastly, the seven sober sub-models individually do not do well with recall because only 16% of the training data is trying to identify 16% of the test data. VotingClassifiers that exclude the inebriated models, like slimEnsemble, will easily identify failures due to agreement on what are the failures, but they will struggle at identifying the 16% due to disagreement on which pairs are successful. fullEnsemble, which includes one always voting down model and one mostly voting up model will balance each other on the occasion when the mostly voting up model votes up. When the mostly voting up model votes down, the pairing in question has an automatic two votes against it, with the remaining 7, who are highly trained in identifying failure, deciding the pair’s fate.

II. Samerace: Does racial similarity/difference affect match success?

Knowing the number of data points, the means, and the standard error, one can perform a t-test to determine statistical significance. Consulting the online ttest2 calculator on graphpad, the p-value of a two-tailed P value is 1.00, showing no significant difference between same race and interracial pairing. Correlation analysis expresses that there is a significant 0.03 correlation in favor towards same race pairs, but is a 0.03 correlation score really a correlation? The t-test for the statistics graph and the scatterplot seem to express it’s not.

III. Expnum: Does one's answer to "Out of the 20 people you will meet, how many do you expect will be interested in dating you" affect match success?

The distribution of expnum values in both the distribution and the statistics plots shows a lot of nan-replacement was used. Even if these values were not nan-replaced, the nan-replacement value of a continuous variable like expnum is the mean of the non-nan values. All nine graphs are very similar to each other expressing there are small correlations that dictate success like in the same vs different race discussion section (Discussion II). The notable difference is that the slimEnsemble’s correlation score for both genders is  0.1. Technically, that 0.1 is a middle ground value between the men’s 0.12 and the women’s 0.04

These correlation values are small correlations, but it is more notable than the correlation score calculated in the samerace analysis. At this point I acknowledge that there is significant, but small correlation between expnum and match probability, but more investigation should be done before reaching any conclusions.

  1. Go_Out: Does how frequent one goes out affect match success?

Seeing that the best correlation on the three correlation graphs is just above 0.1, I’m going to ignore the correlation graph based on points previously mentioned in Discussion II and Discussion III. Correlation does make sense with continuous values like expnum, but boolean and dummy values like samerace and go_out, correlation does not make sense.

To make things easier, I used a for loop to do an individual t-test between the probability values for each go_out value type for each gender database. Below are the tables displaying the t-test p-values between each go_out value for each gender type

Based on the male p-value table, there are significant differences between the following value pairs.

  • Several times a week vs (twice a week, once a week, twice a month, once a month, and almost never)
  • Twice a week vs (once a week / twice a month / once a month / almost never)
  • Once a week vs (twice a month / once a month / almost never)
  • Twice a month vs Several times a year
  • Once a month vs (Several times a year / None Specified)
  • Several times a year vs Almost Never

With these rejected null hypotheses we can assert the following alternative hypotheses regarding probability of match for males:

  • P(Sev x per wk) > P(2x per wk)
  • P(Sev x per wk) > P(1x per wk)
  • P(Sev x per wk) > P(1x per mo)
  • P(Sev x per wk) > P(Almost Never)
  • P(2x per wk) > P(1x per wk)
  • P(2x per wk) > P(1x per mo)
  • P(2x per wk) > P(Almost Never)
  • P(1x per wk) > P(2x per mo)
  • P(1x per wk) > P(1x per mo)
  • P(1x per wk) > P(Almost Never)
  • P(Sev x per year) > P(2x per mo)
  • P(Sev x per year) > P(1x per mo)
  • P(None Specified) > P(1x per mo)
  • P(Sev x per yr) > P(Almost Never)

For females, the significant differences can be found in

  • Several times a week vs (twice a week / once a week / twice a month / almost never)
  • Twice a week vs almost never
  • Once a week vs (once a month / almost never)
  • Twice a month vs almost never
  • Once a month vs almost never
  • Several times a year vs almost never
  • Almost never vs blank

With those tests, it can be seen that relations between female match probability based on social frequency are as follows

  • P(Sev x per wk) > P(2x per wk)
  • P(Sev x per wk) > P(1x per wk)
  • P(Sev x per wk) > P(1x per mo)
  • P(Sev x per wk) > P(Almost Never)
  • P(2x per wk) > P(Almost Never)
  • P(1x per wk) > P(1x per mo)
  • P(1x per wk) > P(Almost Never)
  • P(2x per mo) > P(Almost Never)
  • P(1x per mo) > P(Almost Never)
  • P(Sev x per year) > P(Almost Never)
  • P(None Specified) > P(Almost Never)

In both genders, the significant differences are seen via the p-values of the following t-tests:

  • Several times a week vs (Twice a week / once a week / twice a month / once a month / almost never)
  • Twice a week vs (once a week / twice a month / once a month / almost never)
  • Once a week vs almost never
  • Twice a month vs almost never
  • Once a month vs (several times a year / almost never)
  • Several times a year vs almost never
  • Almost never vs left blank

For both genders, the match probability relationships are as follows

  • P(Sev x per wk) > P(2x per wk)
  • P(Sev x per wk) > P(1x per wk)
  • P(Sev x per wk) > P(2x per mo)
  • P(Sev x per wk) > P(1x per mo)
  • P(Sev x per wk) > P(Almost Never)
  • P(2x per wk) > P(1x per wk)
  • P(2x per wk) > P(2x per mo)
  • P(2x per wk) > P(1x per mo)
  • P(2x per wk) > P(Almost never)
  • P(1x per wk) > P(Almost never)
  • P(2x per mo) > P(Almost never)
  • P(1x per mo) > P(Sev x per yr)
  • P(1x per mo) > P(Almost never)
  • P(Sev x per yr) > P(Almost never)
  • P(None specified) > P(Almost never)

In general broad strokes the trend seems to be that being more social does increase the chance of increasing match probability, with some exceptions.

Conclusion and Future Works

Using the logarithmic regression, recallTree, and preciseTree on a speed dating dataset, where only 16% of the pairs are matches, I investigated the samerace and expnum variables and the dummy variables listed with go_out. In this investigation I conclude 

  • There is no statistical significant difference in matching within the same race or in an interracial pairing
  • There is a small correlation between How one answers, "Out of the 20 people you will meet, how many do you expect will be interested in dating you?" and the probability of matching
  • There is a general trend illustrating that higher social frequencies increase the probability of matching, with some exceptions.

Many learning experiences  occurred while creating this project, preparing for the presentation, and writing this blog. The first line of business is optimizing calculations in the deployment code. The app is deployed on heroku, which has a 30 second timeout that cannot be adjusted. Calculations over 30 seconds will crash the code. This is the main reason I chose to examine this blog with the slimEnsemble which required few comparisons, few steps, and few estimators. However, users would be interested in experimenting with the larger ensembles. This will be fixed as soon as possible.

After that is fixed, all other changes are optional, but my own curiosity wants me to improve this app. Thus, at my own leisure I will create and deploy a follow up project. This project will have the following differences:

  • The name of the app needs to be changed. This is a speed dating dataset, not a blind dating dataset.
  • knnsqrtn and gradientdeka need to be replaced. These two models are too extreme in their methods. I will be experimenting and searching for a biased knn model that can vote at least somewhat optimistically and a high learning rate gradient boosting classifier that will consider more fail cases.
  • Certain figures need to be improved or replaced
    • Models Metric scoring needs to add training metrics to show fitting
    • The samerace correlation color coded scatterplot is good at illustrating that there are success and failures regarding pairing within the same race or paring interracially. However, there could be a better way of displaying that. I am open to suggestions on what I can do with this figure.
    • The statistics figure for continuous variables, see expnum section, can probably be color coded according to actual match result. This probably could give us a more clear trend of what is happening than a cloud of blue dots. Also, seeing that most of the data lines up in columns, grouping figures like box plots and violin plots could be considered.
    • Correlation figures for dummy data, like go_out, need to be replaced with t-test p-value tables. Dummy data correlation does not make sense. Most of the input variables are ones and zeroes regarding dummy variables. Choosing a binary answer should not be correlated with a continuous output. Looking at the t-test p-values help identify significant differences between dummy values in a feature

If the new deployed app is satisfactory, I'll explore some other variables, and consider writing a follow up blog. Here are some areas I would like to explore:

  • matche_es: This continuous variable consists of the candidate's answers to "How many matches do you estimate you will get ?". It's been mentioned multiple times when examining feature importances. Like expnum, it is a measurement of dating confidence.
  • partnerDistance, lats, lats_o, lons, lons_o, and income: Distance between user hometowns does play a key role in picking a partner. Additionally, latitude and longitude themself could also play a role. James Clear, author of Atomic Habits, expresses in one of his blogs that Eurasia was able to advance faster than Africa and the Americas due to continent shape because East-West travel is easier than North-South travel. This is due to the climate being more consistent traveling East-West versus North-South. Maybe the climate differences between latitudes could affect partner match. When talking about longitudes in America, YouTuber, RealLifeLore, points out in his video on the 100th Meridian that plant life, water resources, and urban/rural development is different between regions East and West of the 100th meridian. Add the factor of timezones and what was mentioned about latitude, and one can see a pair's latitude and longitude coordinates of their hometowns could affect matching. In fact, the income variable in the speed dating dataset is not the income of the participant, but the median income of the participant's home zipcode. Location related variables might be key features to look at in the next blog.
  • Interests and questionnaire related questions. There is a lot to explore just on understanding how people's interests, and also how they perceive the dating market

There are a lot of interesting aspects to view with this dataset alone, let alone on how to approach modeling it. Hopefully, through studying it, we can find helpful insights for those who struggle in trying to get into the dating market. After all, matching is just the beginning of the difficulties of a relationship. Like how the models use different algorithms to calculate the data, we humans use the algorithms we have been given from youth to find a partner we predict as compatible. Alain de Botton wrote a New York Times Essay called, "Why you will marry the wrong person". In a 30 minute youtube video with the same title explaining that inflammatory essay , he elaborates that we don't look for compatibility, but for familiarity, and the best relationships are the ones that involve learning, teaching, and compromise. Hopefully, we can use the training from our past and present relationships, as well as learning from the ensemble of our peers and their training, to be prepared for new relationships, the testing data from our future.

References

App Link:  https://blinddatingensembleclassifier.herokuapp.com/ (REMINDER: CALCULATIONS OVER 30 SECONDS CURRENTLY CRASHES THE APP. USE SMALLER ENSEMBLES TILL THIS IS FIXED)

Github Link: https://github.com/GGSimmons1992/datingSelectionClassifier

About Author

Gary Simmons

I am a software developer who is aspiring to become a data scientist. Degrees: Applied Physics BS (California State University-San Marcos May 2014) Physics MS (University of Coloardo-Boulder May 2018) linkedin: https://www.linkedin.com/in/ggsimmons92/ github: https://github.com/GGSimmons1992
View all posts by Gary Simmons >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI