Swift for TensorFlow: Reading your dataset

Swift for TensorFlow: Reading your dataset

This is the third article in the Swift for TensorFlow series, make sure to read Introduction to Swift for TensorFlow and Swift for TensorFlow: Choosing your dataset before continuing.

After choosing a dataset that includes all of the passenger data from the Titanic catastrophe, we spent some time looking into the features that this dataset provides, how useful they are for predicting survival of a passenger, and how we can represent them to be readable in our Machine Learning project using Swift for TensorFlow.

In this post, we’ll learn how we can read these features from our CSV files and use them to train our model. To do so, we will leverage some Python libraries that are available to us thanks to the PythonKit module.

Representing a batch of examples

We mentioned in our previous article that a typical Machine Learning project has two datasets (or three in some cases), namely the training and dev/test sets. The purpose of these sets is:

  • Training set: used to fit the learning model to the provided data.
  • Dev set: used to evaluate the accuracy during the training process.
  • Test set: used to evaluate the accuracy after the training process.

Datasets consist of multiple rows that are named examples. Each example corresponds to a row in our CSV-represented dataset, and has multiple features and a label. In our dataset, the label corresponds to the passenger surviving or not, and the features are the traveling class, sex, age, fare, embarkation port, and if the person was traveling alone or not.

We can represent this information as follows:

struct TitanicBatch {
  let features: Tensor<Float>
  let labels: Tensor<Int32>
}

You may be wondering what a Tensor is in the code above. A Tensor is a multidimensional array that is used to represent the information flowing through a TensorFlow model. In our case, a batch of a single example from the Titanic dataset would look like:

TitanicBatch(
  // Class, Sex, Age, Fare, Port, Travel alone
  features: [3, 0, 29, 9.4833, 0, 1],
  labels: [0]
)

However, in order to speed things up, TensorFlow (and other Machine Learning libraries) takes advantage of matrix multiplication, as our computers have specific hardware (GPUs or TPUs) to perform SIMD (Single Instruction, Multiple Data) operations in a very performant way. For this reason, we need to tell TensorFlow how we can stack multiple examples into a single Tensor representation. This is done using the Collatable protocol, and we can add our conformance as:

extension TitanicBatch: Collatable {
  init<BatchSamples: Collection>(collating samples: BatchSamples)
  where TitanicBatch == BatchSamples.Element {
    self.features = Tensor<Float>(stacking: samples.map(\.features))
    self.labels = Tensor<Int32>(stacking: samples.map(\.labels))
  }
}

What this code does is provide an initializer that, given a collection of single-example batches, stacks all of them into a single batch.

Reading CSV files

Now that we have a representation model, we can use it to load the examples from our CSV files. We can write a function that, given a file name, gives us an array of TitanicBatch, where each item in the array is a single example from the dataset. The reason to do so is that we can split this array into multiple mini-batches that are processed together, and can be collated together thanks to our previous conformance to Collatable. Moreover, we can shuffle this array after every epoch (an iteration of the training process) so that each batch is not always processed in the same order.

The implementation of this function is:

func readTitanic(filename: String) -> [TitanicBatch] {
  // (1) Import numpy
  let np = Python.import("numpy")

  // (2) Read features
  let features = np.loadtxt(
    filename,
    delimiter: ";",
    skiprows: 1,
    usecols: [1, 2, 3, 6, 7, 8],
    dtype: Float.numpyScalarTypes.first!)

  // (3) Read labels
  let labels = np.loadtxt(
    filename,
    delimiter: ";",
    skiprows: 1,
    usecols: [0],
    dtype: Int32.numpyScalarTypes.first!)

  // (4) Create Tensors
  guard let featuresTensor = Tensor<Float>(numpy: features),
        let labelsTensor = Tensor<Int32>(numpy: labels) else {
    fatalError("Could not load dataset \(file)")
  }

  // (5) Use our batch representation
  return zip(
    featuresTensor.unstacked(),
    labelsTensor.unstacked()
  ).map { pair in
    TitanicBatch(features: pair.0, labels: pair.1)
  }
}

Let’s break it down:

  1. First, we import the numpy library from Python. This is possible thanks to the PythonKit module.
  2. We load the features from the CSV file using np.loadtxt. We have to provide the filename, which delimiter is used (a semicolon in our case), how many rows to skip (1, as our files contain a header row), the indices of the columns we want to read, and the type of data we are reading.
  3. Using the same API, we read the labels as integers.
  4. We create Tensors for features and labels from the Python objects we just read; there is a specific initializer in Tensor to use this type of data.
  5. We create an array of batches using the representation we defined above.

With this function, we would be able to read both our training and dev/test sets just by passing the file name we want to read.

Conclusions

With this, we have our data in place to start training our Machine Learning model. There may be multiple alternatives to read data; you could be doing it in plain Swift, but in this post, I wanted to showcase the integration of Swift and Python, and how we don’t need to reinvent the wheel if there is already a tool to do the task we want to accomplish.

Ensure the success of your project

47 Degrees can work with you to help manage the risks of technology evolution, develop a team of top-tier engaged developers, improve productivity, lower maintenance cost, increase hardware utilization, and improve product quality; all while using the best technologies.