Swift for TensorFlow: Choosing your dataset
- by Tomás Ruiz López
- November 03, 2020
- swift• machine learning• tensorflow
- 7 minutes to read.
This is the second article in an ongoing series, Swift for TensorFlow.
In the introductory post of this series, we presented why TensorFlow is moving to Swift, and how you can get started using it in a Google Colab or in Xcode. Now, it’s time to roll up our sleeves and start coding our first Machine Learning model using Swift and TensorFlow!
Picking a dataset
An essential part of any Machine Learning project is having data to learn from. Fortunately, the Machine Learning community has a multitude of sample datasets that can be used in your projects, and even test the performance of your model compared to others.
If you are looking for datasets for your project, a fantastic place to find them, and even engage in competitions, is Kaggle. There, you can download datasets in multiple formats and check the solutions other people have provided for the same problem. It is a good learning resource and a place to validate some of the ideas you are trying to apply.
In this post, I chose a dataset corresponding to all travelers in the Titanic. We will be building a model to predict if a person would survive or not, based on the available data from this dataset.
Knowing your data
Getting familiar with the data you are dealing with is an important part of every Machine Learning project. Before coding your model, you need to select which parts of your data are going to be useful for the task you are trying to accomplish. Having tons of data doesn’t mean you have to use all of it. There is an entire discipline called Feature Engineering dedicated to deciding which features from your dataset are good candidates to create a solid model.
Moreover, in some cases, datasets are incomplete, need to be cleaned, or some features are not represented in a way that can be used by a Machine Learning model. In those cases, you will have to transform your data to make it easily usable by your model.
If we take a look at our Titanic dataset, we can see that it contains 11 features, plus a label indicating if the person survived or not. Let’s go over the features and decide if it makes sense to use it or not in our model.
The first column is a passenger identifier. This is a feature that is unique to each passenger and is arbitrarily assigned, so it does not have any relationship to whether a passenger survived or not. So we can discard it and not use it in our model.
This column tells us if the passenger was traveling in 1st, 2nd, or 3rd class. This is an indicator of the socio-economic level of the passenger, and it may be related, for instance, to the proximity to life boats. It is represented as an integer, so we can use this feature as is. Another possibility would be to use a one-hot encoding; that is, instead of having a single column describing the class with a 1, 2, or 3, we could have three columns, each one representing one of the classes, and having a 1 or a 0 on each of them, depending on the class the passenger was traveling in.
Passengers’ gender may also be related to their survivability. However, the original dataset describes the gender as a string (‘male’ or ‘female’). We need to transform this column and encode values as 0 for men, and 1 for women. We could also be doing a one-hot encoding of this feature.
Age is probably also related to survival, and could be used as it is. However, if we inspect the dataset, we see it is empty in an important percentage of rows (around 20%). We could discard these entries from the training set, but it is an important amount of data, so we can decide to fix this by inputing a default value. A possibility would be to add the mean age of the dataset; however, the distribution of values in this dataset is not normal, but slightly skewed, so inputing the mean may give biased results. Another possibility, which is the one I applied, is to set missing values to the median value (28 years of age). You can compute all these values yourself, but Kaggle already offers some statistical values for each column in the dataset.
Siblings, spouses, parents, and children
There are two columns indicating the number of siblings/spouses, and parents/children each passenger was traveling with. This features may be related to the chance of surviving the catastrophe, but, in this case, I decided to do something slightly different. Instead of using the features as they appear in the dataset, I computed a new column based on the two. This new column corresponds to a derived feature, namely Traveling alone, which has a 0 if the person was traveling alone (i.e., both original columns were set to 0), or a 1 otherwise.
The dataset includes the ticket number for the passenger, but this is also a value unrelated to the person, so it seems safe to discard it.
This feature corresponds to the amount of money a passenger paid to board the Titanic, and is probably highly correlated to the class they were traveling at, but we can include it as well in the features to train our model.
This feature could be related to the proximity to a life boat, but it is only available for 23% of the entries. In this case, it does not seem possible to fix those entries with a default value (our model won’t be learning anything from it), and using this feature would mean we had to discard 77% of our dataset, so we will ignore this feature.
Port of embarkation
Finally, there is a column to tell us which port the passenger embarked at (Southampton, Queenstown, or Cherbourg), represented by the first letter of the city. I decided to include this as a feature, representing cities with integer numbers from 0 to 2.
Splitting the dataset
Now that we have decided which features we want to use when we train our model, we need to split it into two sets: the training set and the dev set. The former will be used to learn to categorize from it, whereas the latter will be used to check how well our model performs after being trained.
A typical question is how much data you should put into each set. It depends on how much data you have in total. A typical division is done in an 80-20 basis (i.e., 80% of data to the training set, 20% to the dev set). However, this is not so critical if you have a huge amount of data. In such a case, you should put as much data as needed into the dev set to have clear confidence that your model performs well, and the rest should go into the training set. In any case, it is very important that the data distribution in both sets is the same, and close to what you will find in reality when you use your trained model.
In this post, we have performed an initial step towards having our first Machine Learning model to predict if a person would survive the catastrophe of the Titanic or not. Having a good dataset is crucial, but knowing which features you should use is equally important. Before starting to code, take time to know the dataset, and extract metrics about it. Investing time in this stage of the project will help you get a more reliable model.