Emilio’s Blog

Handling Larger-than-memory Datasets in PyTorch Lightning: A Practical Guide

2024-03-20T16:00:00+00:00

In the world of deep learning, data is king. The more data we have, the better our models can learn and predict. However, this abundance of data can also pose a significant challenge. What happens when our datasets are so large that they don’t fit into our system’s RAM? This is a common issue faced by many machine learning engineers, and it’s the problem we’ll tackle in this blog post.

The solution to this problem lies in the way we store and load our data. Instead of trying to load all the data into memory at once, we can save it on disk and load only the samples we need when creating each batch. This approach, known as on-demand loading, allows us to work with datasets that are much larger than our available memory.

The major challenge that we face in this endeavor is speed. Loading data from disk is significantly slower than loading it from memory, so we need to ensure that this process is as efficient as possible to prevent it from becoming a bottleneck in our training pipeline.

In the following sections, we’ll delve deeper into these challenges and provide practical examples of how to overcome them using PyTorch Lightning. Here’s an UML diagram of how the solution will look like, I hope you will find it useful to navigate the post:

Pytorch Lightning

I am a big fan of Pytorch Lightning. It significantly reduces boilerplate code by providing a rich set of features, while maintaining a high degree of extensibility, modularity, and structure. One of the key components of PyTorch Lightning is the LightningModule, which encapsulates the core logic of the training process, including the forward pass, training step, validation step, and more. This is complemented by the LightningDataModule, which is responsible for organizing the data loading code. This clear separation of responsibilities between the data module and the training module makes the code easier to write, understand, and maintain.

As mentioned, LightningDataModule class is a blueprint for how to organize your data loading code, and it’s where we’ll implement our on-demand loading solution. The LightningDataModule class has several important methods that we’ll be using:

prepare_data: This method is called only once and is the place to download your data and perform any one-time preprocessing steps. It’s important to note that this method does not have access to the state of the LightningDataModule class, so it should not be used to set any instance variables.
setup: This method is called on every GPU in multi-GPU training and is used to perform any setup steps that require access to the dataset. For example, you might use this method to calculate the mean and standard deviation of your data for normalization purposes, splitting data into train and test set, etc.
train_dataloader and val_dataloader: These methods return the data loaders for the training and validation sets, respectively. They are called at the beginning of each epoch.

The lifecycle of the data module in a typical training run in PyTorch Lightning is as follows:

prepare_data is called to download and preprocess the data.
setup is called to perform any setup steps.
The training loop begins, and train_dataloader is called to get the training data loader.
The validation loop begins, and val_dataloader is called to get the validation data loader.
Steps 3 and 4 are repeated for each epoch.

In the next section, we’ll see how we can use these methods to implement on-demand loading for large datasets.

Structure of the data module

Our solution to handling large datasets in PyTorch Lightning involves decoupling data preparation and data storage, and weaving them together in the data module. This allows us to easily change the storage method and the data pre-processing independently as the complexity of the application grows.

Specifically, Data preparation is offloaded to a DataPreparer object. This object retrieves samples from their original source, such as a remote database or the internet, and prepares each sample as necessary. Preparation could involve tasks such as normalizing numerical data, tokenizing text data, or resizing and normalizing images. The important thing is that all, or most, expensive pre-processing is done in this stage, rather than during training.

Once prepared, each sample is handled by a DataStorage object, which saves the sample on disk. In addition to samples, a DatasetInfo object is used to store basic information about the dataset. This information is necessary during the setup phase and could include the number of samples, the number of features, information necessary for stratified splitting for cross-validation, or a list of tokens for NLP applications.

In the setup method, we first load the dataset information from the storage. Then, we split the data into training and validation sets. The exact splitting method is not shown here but would depend on the specific requirements of your application. Finally, we create Dataset objects for the training and validation sets, which can be used to retrieve the data during training.

Here’s how this structure looks in code:

from typing import TypeVar, Generic, List, Tuple


TSample = TypeVar("TSample")
TInfo = TypeVar("TInfo")


class DataModule(Generic[TSample, TInfo], LightningDataModule):
    """
    A LightningDataModule that decouples data preparation and storage.
    """
    def __init__(
        self, 
        storage: DataStorage[TSample, TInfo],
        preparer: DataPreparer[TSample]
    ):
        """
        Initializes the data module.

        Args:
            storage (DataStorage): The object responsible for storing the
             data.
            preparer (DataPreparer): The object responsible for preparing
             the data.
        """
        self._preparer = preparer
        self._storage = storage

    def prepare_data(self) -> None:
        """
        Prepares the data by retrieving and preparing samples, then storing
        them on disk.
        """
        if self._storage.is_prepared():
            return

        self._storage.start_preparation()
        for sample in self._preparer.prepare_data():
            self._storage.save_sample(sample)

        info = self._preparer.get_dataset_info()
        self._storage.finish_preparation(info)

    def setup(self, stage: str) -> None:
        """
        Sets up the data module by loading the dataset information and
        splitting the data into training and validation sets.

        Args:
            stage (str): The stage of the training process.
        """
        super().setup(stage)

        info = self._storage.load_dataset_info()
        
        train_idx, val_idx = self.split(info)
        
        self.train_dset = Dataset(train_idx, self._storage)
        self.val_dset = Dataset(val_idx, self._storage)
    
    def split(self, info: TInfo) -> Tuple[List[int], List[int]]:
        # TODO implement splitting as appropriate

Dataset interface

As you can see from the setup method, the dataset makes use of the storage object to access a subset of the data samples depending on the given indices. A basic implementation could be as follows:

class Dataset(Generic[TSample]):
    def __init__(self, indices: List[int], storage: DataStorage[TSample]):
        self._indices = indices
        self._storage = storage
    
    def __len__(self) -> int:
        return  len(self._indices)
    
    def __getitem__(self, idx: int) -> TSample:
        return self._storage.load_sample(self._indices[idx])

For this implementation, it is important to distinguish between global and local indices. While global indices uniquely identify each available sample and are needed to load samples from storage, local indices are specific to the training and validation datasets, and are used by Pytorch to request loading of a specific sample in a dataset.

For example, if we have 100 samples available we could use the first 80 for training and the last 20 for validation. In this case, the sample with local index 0 in the validation dataset will have global index 80, local index 1 is global index 81, local index 19 is global index 99, etc.

The dataset above is given on creation the global indices of the subset it represents, and performs this translation from local to global index in the __getitem__ methor before invoking the storage.

This distinction will also be important later on.

Data preparation interface

The DataPreparer interface defines the blueprint for a class that prepares data for consumption by a deep learning model. It has two abstract methods that need to be implemented by any concrete subclass:

prepare_data: This method is responsible for preparing the data. It should return an iterator over the samples in the dataset. Each sample is of a generic type TSample, which could be as simple as a tuple of tensors, or more complicated objects. I personally like to use dataclasses for this, but anything goes.
get_dataset_info: This method should return a DatasetInfo object that contains information about the dataset. This could include things like the number of samples, the number of classes, the shape of the input data, etc.

The interface is as follows:

from abc import ABC, abstractmethod
from typing import Iterator, TypeVar


class DataPreparer(ABC):
    """
    Abstract base class for a DataPreparer. A DataPreparer is responsible
    for preparing data for a DataLoader.
    """

    @abstractmethod
    def prepare_data(self) -> Iterator[TSample]:
        """
        This method is responsible for preparing the data. It should return
        an iterator over the samples in the dataset.
        """
        pass

    @abstractmethod
    def get_dataset_info(self) -> DatasetInfo:
        """
        This method should return a DatasetInfo object that contains
        information about the dataset.
        """
        return None

By defining a DataPreparer interface, we can create different subclasses for different types of data (e.g., image data, text data, etc.), each implementing the prepare_data and get_dataset_info methods in a way that is appropriate for that type of data. We could also create more complex DataPreparers that extend or re-use simpler DataPreparers, for example multi-modal applications could have a specific preparer for each modality. This makes our data loading code more flexible and reusable.

Data storage interface

The DataStorage interface defines the blueprint for a class that handles the storage and retrieval of data samples and dataset information. It has several abstract methods that need to be implemented by any concrete subclass:

is_prepared: This method checks if the data has already been prepared. It should return True if the data has been prepared, and False otherwise, and is used to avoid unnecessary data processing.
start_preparation: This method starts the data preparation process. It might be used to set up necessary resources or state before data preparation begins.
save_sample: This method saves a prepared sample. The exact way in which the sample is saved will depend on the specific implementation and underlying storage.
finish_preparation: This method finishes the data preparation process. It might be used to clean up resources or state after data preparation is complete.
load_dataset_info: This method loads the dataset information. The returned information should be the same as the one saved with finish_preparation.
load_sample: This method loads a sample. The sample should be the same as the one saved with save_sample.

The first three methods are used when saving the dataset, while the latter two are used to obtain saved samples during training.

The interface is as follows:

from abc import ABC, abstractmethod


class DataStorage(Generic[TSample, TInfo], ABC):
    """
    Abstract base class for a DataStorage. A DataStorage is responsible for 
    storing and retrieving data samples and dataset information.
    """

    @abstractmethod
    def is_prepared(self) -> bool:
        """
        This method checks if the data has already been prepared. It should 
        return True if the data has been prepared, and False otherwise.
        """
        pass

    @abstractmethod
    def start_preparation(self) -> None:
        """
        This method starts the data preparation process. It might be used to
        set up necessary resources or state before data preparation begins.
        """
        pass

    @abstractmethod
    def save_sample(self, sample: TSample) -> None:
        """
        This method saves a prepared sample. The exact way in which the
        sample is saved will depend on the specific implementation and
        underlying storage.

        Args:
            sample (TSample): The prepared sample to save.
        """
        pass

    @abstractmethod
    def finish_preparation(self, info: TInfo) -> None:
        """
        This method finishes the data preparation process. It might be used
        to clean up resources or state after data preparation is complete.

        Args:
            info (TInfo): The dataset information to save.
        """
        pass

    @abstractmethod
    def load_dataset_info(self) -> TInfo:
        """
        This method loads the dataset information saved previously.

        Returns:
            TInfo: The loaded dataset information.
        """
        pass

    @abstractmethod
    def load_sample(self, idx: int) -> TSample:
        """
        This method loads a sample. The sample should be the same as the
        one saved with `save_sample`.

        Args:
            idx (int): The index of the sample to load.

        Returns:
            TSample: The loaded sample.
        """
        pass

By defining a DataStorage interface, we can create different subclasses for different types of storage (e.g., in-memory storage, disk-based storage, cloud-based storage, etc.), each implementing the above methods in a way that is appropriate for that type of storage. This makes our data storage code more flexible and reusable, as we are going to see in the next sections.

In-memory data storage

Before going all-in on disk storage, let’s see a much simpler example.

In-memory data storage is the simplest and most efficient method for handling data when all samples fit into memory. In this case, we can save all samples into a single file and load the file only once when the first sample is requested. Then, we keep the file in memory so that loading all subsequent samples is very fast.

Here’s how this concept is implemented in the InMemoryDataStorage class:

import os


class InMemoryDataStorage(DataStorage[TSample, TInfo]):
    def __init__(self, datafile: str):

        self.datafile = datafile
        self._samples: Optional[List[TSample]] = None
        self._info = None

    def is_prepared(self) -> bool:
        return os.path.exists(self.datafile)

    def start_preparation(self) -> None:
        self._samples = []

    def save_sample(self, sample: TSample) -> None:
        if self._samples is None:
            raise RuntimeError("please call start_preparation before save_sample")
        self._samples.append(sample)

    def finish_preparation(self, info: TInfo) -> None:
        torch.save((self._samples, info), self.datafile)

    def load_dataset_info(self) -> TInfo:
        if self._info is None:
            self._samples, self._info = torch.load(self.datafile)
        return self._info

    def load_sample(self, idx: int) -> TSample:
        if self._samples is None:
            self._samples, self._info = torch.load(self.datafile)
        return self._samples[idx]

During data preparation, we append each sample to a list, and, once all samples have been prepared, they are saved to the specified file along with the dataset information.

When we need to load the dataset information or a specific sample, we first check if the data has been loaded into memory. If not, we load the data from the file. This ensures that the file is only loaded once, and all subsequent accesses performed from the object saved in memory.

On-disk data storage

The InMemoryDataStorage is an excellent solution when all data can be accommodated in memory. However, in instances where this is not feasible, we must resort to on-demand loading of samples from disk. Most importantly, the way in which sampels are stored significantly influences data retrieval speed.

Typically, file access incurs a roughly constant overhead, dependent on the storage technology, in addition to a variable delay based on the file size. Disks generally perform optimally when tasked with reading and writing large data chunks sequentially, as opposed to numerous small, random reads or writes.

In the context of Solid State Drives (SSDs), for instance, the hardware is usually capable of reading a minimum size of about 4 KB. Consequently, storing files smaller than this minimum size offers no speed advantage, as the SSD will still read the minimum size, regardless of the actual file size. Furthermore, SSDs comprise several flash memory chips that can be accessed simultaneously when working with sufficiently large files. However, smaller files would only access a single chip, thereby not benefiting from the hardware parallelism.

In the case of Hard Disk Drives (HDDs), file access begins with disk seeks, which involve moving the read/write head to the correct disk location. This mechanical operation takes a significant amount of time. However, once the initial seek is completed, sequential access is quite speedy, as the read/write head remains stationary while the disk platter spins beneath it.

The implication of these factors is that disks cannot achieve peak performance when frequently accessing small files. Therefore, saving each sample in a separate file is not the most efficient method.

For this reason, we instead create blocks of samples that are saved together in a single file. For instance, we could save 100, 1000, or even 10000 samples in the same file. The optimal number of samples per file depends on the final file size on disk, the speed of reading it, etc. Nonetheless, a good starting point could be 1000 samples per file.

Here’s how this concept is implemented in the OnDiskBlockDataStorage class:

import pickle


class OnDiskBlockDataStorage(DataStorage[TSample, TInfo]):
    def __init__(self, base_folder: str, block_size: int = 5000):
        self.base_folder = base_folder
        self.block_size = block_size
        self.datafile = os.path.join(base_folder, "dataset_info.pkl")
        
        self._info = None
        self._sample_count = self.block_count = 0
        self._current_saving_block: Optional[List[TSample]] = None
        self._loaded_block: Optional[List[TSample]] = None
        self._loaded_block_idx: Optional[int] = None

    def is_prepared(self) -> bool:
        return os.path.exists(self.datafile)

    def start_preparation(self) -> None:
        self._current_saving_block = []

    def save_sample(self, sample: TSample) -> None:
        if self._current_saving_block is None:
            raise RuntimeError(
                "please call start_preparation before saving samples"
            )

        self._current_saving_block.append(sample)
        self._sample_count += 1

        if len(self._current_saving_block) >= self.block_size:
            self._save_current_block_and_start_new()

    def _save_current_block_and_start_new(self) -> None:
        dest_path = self._block_path(self.block_count)
        dest_folder, _ = os.path.split(dest_path)
        os.makedirs(dest_folder, exist_ok=True)
        torch.save(self._current_saving_block, dest_path)
        self._current_saving_block = []
        self.block_count += 1

    def _block_path(self, block_id: int) -> str:
        return os.path.join(self.base_folder, "blocks", f"{block_id}.pt")

    def finish_preparation(self, info: TInfo) -> None:
        if self._current_saving_block:
            self._save_current_block_and_start_new()

        with open(self.datafile, "wb") as f:
            # use protocol 4 to save large obejcts
            pickle.dump(
                (info, self.block_count, self.block_size),
                f, protocol=4
            )

    def load_dataset_info(self) -> TInfo:
        if self._info is None:
            with open(self.datafile, "rb") as f:
                data = pickle.load(f)
                self._info, self.block_count, self.block_size = data

        return self._info

    def load_sample(self, idx: int) -> TSample:
        block_id = idx // self.block_size
        offset = idx % self.block_size
        if self._loaded_block_idx != block_id:
            block_path = self._block_path(block_id)
            self._loaded_block = torch.load(block_path)
            self._loaded_block_idx = block_id
        return self._loaded_block[offset]

During data preparation, we create an empty list to store the block of samples being constructed. When the number of samples in the list reaches the desired block size, we save all of these samples to a single file. Once all samples have been prepared, we also save the provided dataset information, block count, and block size to a separate file. This file also serves as sentinel to determine if the dataset preparation was already performed.

When we need to load a specific sample, we check if the corresponding block has been loaded into memory. If not, we first load the entire block from the file, then we return the sample that was requested.

Random sampling with block storage

Saving data in blocks does however pose an additional challenge when accessing samples in a random order.

Random sampling is crucial in training deep learning models because it helps to prevent overfitting and ensures that the model generalizes well. It does this by breaking potential correlations in the data and ensuring that each training batch is a good representation of the overall dataset. This randomness ensures that the model doesn’t learn the order of the training data, which could lead to poor performance on unseen data. In technical terms, random sampling is an unbiased estimator of the loss gradient with respect to the dataset, which is the reason why mini-batch training is possible.

However, when data is saved in blocks as we did above, entirely random access is rather inefficient as it requires loading an entire block from disk each time a single sample is needed, since samples in a random order are likely to belong to different blocks.

The solution to this problem is to build a custom sampler that selects blocks in a random order, then yields all samples in that block also in a random order. This approach maintains the benefits of random sampling while also taking advantage of the efficiency of block data loading.

While this solution is not perfectly random, as samples within the same block are more likely to appear in the same batch, it is typically good enough for practical purposes as long as the blocks are large enough, and the samples were divided into blocks randomly during preparation. In this case the batches will still contain a good variety of samples; for example, there are 2.3e60 different batches of 32 elements that can be constructed from a single blocks of size 1000.

This approach can be implemented using a custom sampler with PyTorch’s DataLoader. DataLoaders in PyTorch are used to load data in complex ways, such as multi-threaded data loading and custom sampling strategies. They use samplers to specify the sequence of indices/keys used in data loading.

There are several common types of samplers used in PyTorch:

SequentialSampler: This sampler loads data in a sequential order. It’s useful when you want to go through the dataset in the same order every time, most commonly used in the validation dataset.
RandomSampler: This sampler loads data in a random order. It’s useful when you want to shuffle your data, as should be done for teh training dataset.
BatchSampler: This sampler loads data in mini-batches. It’s useful when you want to load data in chunks rather than one sample at a time. It is typically composed with other samplers such as the ones described above.

In our case, we would write a custom sampler that selects blocks in a random order, and then selects samples within each block in a random order. We then combine this custom sampler with the BatchSampler and use it with the standard Pytorch DataLoader.

This is where the distinction between global and local sample indices described above with the dataset becomes relevant. The sampler also needs to return local indices, but do so in such a way that local indices in the same batch correspond to global indices that were stored in the same block.

class BlockSampler:
    """
    A custom sampler class that groups samples into blocks and yields
    samples from the same block before moving on to the next block.
    The blocks and samples within blocks can be accessed in a random or
    sequential order, based on the `shuffle` parameter.
    """

    def __init__(
        self,
        indices: List[int],
        block_size: int,
        shuffle: bool
    ) -> None:
        """
        Initializes the BlockSampler.

        Args:
            indices (List[int]): A list of global sample indices contained
             by the dataset.
            block_size (int): The number of samples in each block.
            shuffle (bool): If True, blocks and samples within blocks are
             accessed in a random order. If False, they are accessed
             sequentially.
        """
        self._block_size = block_size
        self._shuffle = shuffle
        self._indices = indices
        self._blocks: Dict[int, List[int]] = {}

        # use global indices to identify the blocks spanned by this
        # dataset, and store in each block the corresponding local index
        # of the sample
        for local_idx, global_idx in enumerate(indices):
            b = global_idx // block_size
            if b not in self._blocks:
                self._blocks[b] = []
            self._blocks[b].append(local_idx)

    def __len__(self) -> int:
        """
        Returns the total number of samples.
        """
        return len(self._indices)

    def __iter__(self) -> Iterator[int]:
        """
        Yields sample indices such that each block is only visited once.
        """
        block_sequence = self._sequence(self._blocks.keys())
        for block in block_sequence:
            sample_sequence = self._sequence(self._blocks[block])
            for sample in sample_sequence:
                yield sample

    def _sequence(self, indices: Sequence[int]) -> Iterator[int]:
        sorted_indices = sorted(indices)
        if self._shuffle:
            yield from np.random.choice(
                list(sorted_indices),
                size=len(sorted_indices),
                replace=False
            )
        else:
            yield from sorted_indices

The BlockSampler class is a custom sampler that groups samples into blocks and yields samples from the same block before moving on to the next block. This is achieved by dividing the global indices by the block size to get the block number for each sample, and then storing the local indices of the samples in the corresponding block.

Finally, we need to use this sampler, if appropriate, when creating the data loaders for the training and validation dataset:

class DataModule(LightningDataModule):

    # previous code ...

    def train_dataloader(self) -> DataLoader[DataSample]:
        sam = self._get_sampler(self.train_dset, shuffle=True)

        return DataLoader(
            self.train_dset,
            num_workers=self.num_workers,
            collate_fn=self.train_dset.collate,
            batch_size=None,
            sampler=sam,
        )

    def val_dataloader(self) -> DataLoader[DataSample]:
        sam = self._get_sampler(self.val_dset, shuffle=False)

        return DataLoader(
            self.val_dset,
            num_workers=self.num_workers,
            collate_fn=self.val_dset.collate,
            batch_size=None,
            sampler=sam,
        )

    def _get_sampler(self, dataset: Dataset, shuffle: bool) -> Any:
        """
        Returns a BatchSampler that uses a BlockSampler as its inner
        sampler if the storage saved data in blocks, otherwise a random
        or sequential sampler.

        Args:
            dataset (Dataset): The dataset for which to get the sampler.
            shuffle (bool): If True, samples are accessed a random order, otherwise they are accessed sequentially.

        Returns:
            BatchSampler: A BatchSampler.
        """
        if isinstance(self.storage, OnDiskBlockDataStorage):
            inner_sampler = BlockSampler(
                dataset.indices, self.storage.block_size, shuffle=shuffle
            )
        elif shuffle:
            inner_sampler = RandomSampler(dataset)
        else:
            inner_sampler = SequentialSampler(dataset)

        return BatchSampler(inner_sampler, self.batch_size, drop_last=False)

Conclusion

In this blog post we saw how to efficiently load data from disk in PyTorch Lightning when it does not all fit in memory. The solution involves saving groups of samples into a single file, and using a custom sampler to enable almost-random access to these samples while minimizing disk reads, by iterating over the blocks one at a time.

How to use Visual Studio Code to run and debug code on SLURM compute nodes

2023-12-05T10:00:00+00:00

If you’re a developer or data scientist using SLURM to handle your compute workloads, you have surely encountered issues in debugging your code on compute nodes. In this blog post, I share a simple solution for that, allowing you to develop and debug code running directly with the compute resources you need. Although the focus is on Visual Studio Code, the same approach can be applied to other IDEs that support remote development via SSH.

The problem with software development on SLURM

SLURM is a cluster manager that allows users to submit jobs to be executed on compute nodes with the appropriate resources. In principle, one should develop their program on a local machine, then upload it to the cluster, and submit jobs to execute it and obtain results. In practice, this is cumbersome and error-prone, as there are often compatibility issues between the local machine and the compute nodes on the cluster due to the different execution environments, such as operating systems, library versions, etc. It is therefore common for SLURM users to do their development on the cluster login node, and either (1) perform small test runs on the login node itself, or (2) test their code by submitting jobs. Both alternatives are not optimal: in the first case, the resources on the login node are different than those on the compute node and may not suffice to support many users developing concurrently, while in the second case it is impossible to debug the code from the integrated development environment (IDE), seriously hampering development.

In this post, I present a simple solution that solves both problems, allowing one to use the full power of IDE debugging directly on compute nodes. I will focus on Visual Studio Code, but the same trick should be applicable to other IDEs that support remote development via SSH (including, for example, PyCharm).

SSH access to compute nodes

An innocent solution would be to directly SSH to a compute node, but this is not a good idea because you would be able to “steal” all the resources on that node, defeating the very purpose of SLURM (which is to share resources among users). It is for this reason that some SLURM clusters do not even allow users to SSH into compute nodes. And if your SLURM clusters allows it, you should still be polite and not do it to get compute resources.

Actually, there is a way to achieve the same result while still respecting resource allocation: run a SSH server in a SLURM job! For this, we can use Dropbear, a lightweight SSH server that can be started by normal users.

Step 1: Set up dropbear

We will only install dropbear for our user on the login node, simply cloning the repository and compiling the binary. You can find the full instructions in the repository, but a basic installation will be like this:

> # we are executing this on the login node
> git clone https://github.com/mkj/dropbear
> cd dropbear
> # compile the server
> ./configure
> make PROGRAMS="dropbear dbclient dropbearkey dropbearconvert scp"
> # install binaries in a local folder
> mkdir install
> make install DESTDIR=install

If you do not have a compiler available, you can use a package manager such as Minoconda or Micromamba to install the compiler tools package for your user only.

After this, the dropbear binary will be in ~/dropbear/install/usr/local/sbin/dropbear. We keep everything in our home folder to avoid messing up with the login node and angering the sysadmins :)

The last step to prepare the server is to generate a key file:

> ~/dropbear/install/usr/local/bin/dropbearkey \
    -t ecdsa -s 521 -f ~/dropbear/install/server-key

Step 2: Start the SSH server in a SLURM job

Next, we submit a SLURM job that will run the SSH server. I do this by using the following script:

> cat run-vscode-server-gpu.sh
#!/bin/bash

#SBATCH --time 12:00:00
#SBATCH --job-name vscode-gpu
#SBATCH --cpus-per-task 8
#SBATCH --mem 32G
#SBATCH --partition gpu
#SBATCH --gres gpu:1
#SBATCH --output ~/vscode-gpu.log

DROPBEAR=~/dropbear/install

# dropbear arguments:
#  -r    Server key
#  -F    Don't fork into background
#  -E    Log to stderr rather than syslog
#  -w    Disallow root logins
#  -s    Disable password logins
#  -p    Port where to listen for connections
#  -P    Create pid file PidFile
$DROPBEAR/usr/local/sbin/dropbear \
    -r $DROPBEAR/server-key -F -E -w -s -p 64321 \
    -P $DROPBEAR/var/run/dropbear.pid
> # submit the job
> sbatch run-vscode-server-gpu.sh

Dropbear will use the authorized keys in ~/.ssh/authorized_keys to determine who can connect and who cannot, meaning that you do not have to worry about other users connecting to this SSH server.

Now, simply submit this job before your morning coffee and wait for it to start:

> squeue -u `whoami`
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          43029233       gpu vscode-g      edo  R       2:55      1 supergpu03

Remember the node where the server is running (supergpu03) as we need it later!

In case of troubles the log file should contain more information, but in case of success the server will patiently wait for connections:

> cat ~/vscode-cpu.log
[33449] Dec 05 09:52:03 Not backgrounding

Step 3: Configure Visual Studio Code

Before doing this, set up Visual Studio Code for remote development via SSH following the official guide, including SSH key-based authentication. Next, open the SSH config file by searching for “Remote-SSH: Open SSH Configuration File…” from the command palette (invoked via Ctrl+Shift+P), and add the following configuration:

# Login node - adapt to your cluster
Host hpc-login
  HostName login.cluster.com
  User edo

# Compute node where the dropbear is running. Note:
#  - The HostName must correspond to the one you saw in `squeue`
#  - ProxyJump instructs VS Code to connect to the compute node via the login node;
#    it is not necessary if you are able to directly connect to the compute node.
#  - The Port is the same we used in `run-vscode-server-gpu.sh`
Host hpc-compute
  HostName supergpu03
  ProxyJump hpc-login
  User edo
  Port 64321

Every time you start a new SSH server in this way you should make sure that the HostName of the hpc-compute host matches what is listed in squeue. Or you could try to run the server always on the same node via #SBATCH --nodelist supergpu03 in the server submit script, but you may have to wait for resources to free before your server can start.

Step 4: Connect to the compute node

Finally, connect to the server running on the compute node as you would usually do, i.e., by selecting “Remote-SSH: Connect to Host…” from the command palette and choosing hpc-compute as the target.

Conclusion

Now you can use all the power of Visual Studio Code with compute resources such as GPUs while respecting your resource allocation.

Happy debugging!

What is the average length of a queue of cars?

2023-11-01T10:00:00+00:00

Some time ago I was driving on a twisty mountain road, stuck in a slow-moving queue of cars as it was impossible to overtake safely. Out of boredom, I was wondering how many cars were in the queue, and, more generally, what would be the average length of queues in this road. Let’s find out!

First, let’s formalize the problem. Assume that the road has a single entry, no exits, and is infinitely long (poor drivers!). Furthermore, upon entering the road each vehicle moves forward at a given average speed. In this scenario, faster vehicles will eventually catch up with the slower ones in front of them, and, since overtakes are not possible, will slow down and queue behind them. After some time, a “steady state” is reached where several groups of vehicles form, each moving forward at the speed of the slowest vehicle in front of the queue. The question we want to answer is, therefore: what is the average length of these groups?

The intuitive (but wrong) approach

The first idea I had was rather intuitive, but as it turns out, wrong. Let the average speeds of the $i$-th vehicle entering the road be $X_i$, and assume that all $X_1\ldots,X_\infty$ are independent and identically distributed (i.i.d.). Following our assumptions above, a queue of $n$ vehicles will form if $X_1\leq X_2$, and $X_1 \leq X_3$, and $\ldots$, and $X_1\leq X_n$, and $X_1>X_{n+1}$. Since all variables are i.i.d., we can find the probability of all of these events to be true as the product of the individual probabilities:

\[p(N=n)=\begin{cases} 1 & n \leq 1 \\ \left[\prod_{i=2}^n p(X_i\geq X_1)\right] p(X_{n+1}Some thought before going on with the math should convince you that the final result does not depend on the distributions of the velocities. Different distributions would affect how quickly queues form, but not their length after an infinite amount of time. Indeed, since $X_1$ and $X_i$ are i.i.d., the probability that $X_1\leq X_i$ must be $1/2$. Actually, for this we do not even need independence, but only exchangeability (which is implied by independence, and therefore holds in our case). In our case, exchangeable random variables have the property that $p(X_1=x,X_i=x’)=p(X_1=x’,X_i=x)$. This kind of symmetry means that there is no “preferred” ordering of the two velocities, and therefore the probability that one is larger than the other can only be $1/2$ (you can verify this formally by explicitly writing down and solving an integral for that probability).

Since $p(X_1\leq X_i)=1/2$, expanding the equation above gives:

\[p(N=n)= p(X_{n+1}Which holds again for $n>1$; For example, there is a probability of $1/2$ that there are at least two cars. Finally, the expected value of $N$ is computed as

\[\mathbb{E}[N]=\sum_{n=1}^N n\cdot p(N=n)=\sum_{n=1}^N \frac{n}{2^n}=2\]

Therefore, the average number of cars in a queue is 2! Which definitely does not match my experience ;)

To conclude this (wrong, as we are going to see in a minute) solution, note that the derivation above was somewhat pedantic and brute-forced. With a little bit more insight, one could realize that, assuming the velocities to be i.i.d., the number of vehicles in a queue is a random variable with Geometric distribution. Each Bernoulli trial corresponds to a new vehicle entering the road and checking whether it is not faster than the queue of cars in front of it. A Geometric random variable with parameter $p=1/2$ has the same distribution and expectation that we derived above.

Simulation

As I hinted at the beginning, the reasoning above is actually wrong, and I only realized that because I implemented a simulation and found completely different results. Let’s dive in!

import seaborn as sns
from tqdm import trange
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

rnd = np.random.default_rng(2315)

First, we define a function to sample a random velocity $X_i$. For simplicity we sample from an uniform distribution, but you can easily change this one to verify that the average queue length does not depend on the distribution of the velocities.

def getv():
    ''' Returns the velocity of a random vehicle '''
    return rnd.uniform()

Next, we perform 100,000 simulations where we grow a queue as long as new cars are faster than the car in front:

sim_count = 100_000

queue_lengths = []
for _ in trange(sim_count):
    i, v0 = 1, getv()

    vi = getv()
    while vi >= v0:
        i += 1
        vi = getv()

    queue_lengths.append(i)

queue_lengths = np.array(queue_lengths)

100%|██████████| 100000/100000 [00:02<00:00, 33857.86it/s]

Let’s check some descriptive statistics of the queue lengths:

pd.Series(queue_lengths).describe()

count    100000.000000
mean         10.689810
std         200.154592
min           1.000000
25%           1.000000
50%           2.000000
75%           4.000000
max       22849.000000
dtype: float64

Half of the queues only contain two cars, as we also found above, however the average length of 11 cars is way off. Moreover, if the reasoning above was correct, observing a queue of 22,849 cars would be essentially impossible! Something is definitely wrong.

To confirm, let’s compare the empirical distribution of the queue lengths with our predictions:

plt.plot(
    sorted(queue_lengths),
    np.linspace(0, 1, len(queue_lengths)),
    label='Observed',
)

plt.step(
    np.arange(1, len(queue_lengths)),
    np.cumsum(0.5**np.arange(1, len(queue_lengths))),
    where='post',
    label='Computed',
)

plt.xscale('log')
plt.xlabel('Length')
plt.ylabel('CDF')
plt.legend()
plt.show()

Except for the case of $n\leq 2$, we are way off, and the predicted probability of longer queues decays way too fast.

The Correct Solution

Finding the right solution took me a while. To be honest, even in this moment I am not really sure whether I understand why the reasoning above is wrong.

Consider this: if you see a queue of twenty cars, what can you infer about the car in front? It must be pretty slow compared to the average, right? But if the queue only has two cars, the one in front cannot be that slow, as compared to everybody else. In fact, suppose that the car in front is slower than 80% of all drivers. Then, each new driver entering the road has a probability of 80% to be faster than the car in front. Therefore, in that case, the probability that there are $n$ cars in a queue equals $0.8^{(n-1)}\cdot 0.2$, where the last term accounts for the fact that the last car entering the road is even slower than the first one. In formal terms, for $n>1$:

\[p(N=n\vert X_1=x)= \left[\prod_{i=2}^n p(X_i\geq x)\right] p(X_{n+1}Which should look familiar! It is indeed what we found above, except that now we are conditioning on the value of $X_1$. The reasoning based on exchangeability, while formally correct, does not apply to this problem because the first car of the queue is fixed.

We can remove the dependence on $x$ by integrating it away:

\[p(N=n)=\int_{-\infty}^{\infty} p(N=n|X_1=x)p(X_1=x)\text{d}x\]

At first sight, this mighty integral does not seem approachable due to the large product it contains. However, remember that all $X_i$’s are i.i.d., therefore we can simplify this expression as follows:

\[p(N=n\vert X=x)= \left[\prod_{i=2}^n p(X\geq x)\right]p(XSince this only depends on the CDF of $X$, we can use the change of variable formula to get rid of the density of $X_1$, i.e., the term $p(X_1=x)$ in the integral above.

In general terms, the change of variable formula, or integration by substitution method, states that:

\[\int_a^b f(g(x))g'(x)\text{d}x=\int_{g(a)}^{g(b)}f(u)\text{d}u\]

where $u=g(x)$.

Here, we are going to use $u=g(x)=p(X\leq x)$, which means that $g’(x)=p(X=x)$, and obviously $f(g(x))=p(N=n|X=x)$. This makes $u$ an uniform random variable distributed between 0 and 1, and is known in statistics as the probability integral transform. With this substitution we obtain:

\[p(N=n)=\int_0^1 u (1-u)^{n-1} \text{d}u\]

If this transformation looks rather obscure to you, rest assured it is to me, too. But it is easy to justify it intuitively via the reasoning we did above: if the first car is in the slowest $u\%$ of all cars, then the probability that each new car is faster than that is $(1-u)\%$, and the probability of having $n$ cars in a queue is $u(1-u)^{n-1}$ (always accounting for the very last car that is even slower than the first one). And since we do not know what $u$ is, we have to try all possible values. We use the transformation above to work with quantiles instead of the actual velocity of the cars. This has a beautiful consequence:

Our results hold no matter what is the distribution of car velocities. In other words, no amount of driving lessons or better roads can influence the length of queues (assuming that roads are long enough for queues to grow).

To solve this we perform another change of variable with $v=1-u$ and $\text{d}u=-\text{d}v$ to obtain:

\[p(N=n)=\int_1^0 -(1-v) v^{n-1} \text{d}v=\int_1^0\left(v^n-v^{n-1}\right)\text{d}v\]

Now, the two pieces can be approached independently: given that the indefinite integral of $v^n$ is $v^{n+1}/(n+1)$, the solution is

\[p(N=n)= \frac{v^{n+1}}{n+1}\bigg\vert_1^0 -\frac{v^{n}}{n}\bigg\vert_1^0 =-\frac{1}{n+1}+\frac{1}{n} =\frac{1}{n(n+1)}\]

Before doing anything else, let’s compare this result with our earlier simulation:

plt.plot(
    sorted(queue_lengths),
    np.linspace(0, 1, len(queue_lengths)),
    label='Observed',
)

plt.step(
    np.arange(1, len(queue_lengths)),
    np.cumsum([1/(n*(n+1)) for n in range(1,len(queue_lengths))]),
    where='post',
    label='Computed',
)

plt.xscale('log')
plt.xlabel('Length')
plt.ylabel('CDF')
plt.legend()
plt.show()

They match beautifully! Here is another way of comparing the two distributions:

n = 50
plt.plot(
    1-np.cumsum([1/(n*(n+1)) for n in range(1,n)]),
    [1-np.mean(np.array(queue_lengths) <= i) for i in range(1,n)],
    'o'
)
plt.plot([0., .5], [0, .5], '--', label='y=x')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Observed probability')
plt.ylabel('Computed probability')
plt.legend()
plt.show()

In this chart, each dot is a specific queue length, and the $x$ and $y$ values are the observed and computed probabilities of a queue having that length. Again, we see great agreement.

Now that we are confident that we have a formula for distribution of the queue length, let’s compute its expected value:

\[\mathbb{E}[N]=\sum_{n=1}^\infty n\cdot p(N=n)=\sum_{n=1}^\infty \frac{n}{n(n+1)} =\sum_{n=1}^\infty \frac{1}{n+1}\]

Uh oh. This series diverges to infinity.

I am afraid that long queues will exist even in the most advanced alien society (as long as they are based on roads).

How to save locally the result of XPath queries in Firefox and Chrome

2023-08-11T10:00:00+00:00

It happens relatively often that, while browsing the internet like a normal person, I want to extract some data from a webpage, save it locally, and manipulate it in some way. Since it is an one-off operation, I really do not want to bother writing a web-scraper with Python. Instead, here is a simple way of doing this through the developer console in Firefox or Chrome!

Simple scraping tasks can often be achieved by navigating to a page and executing some Xpath queries to extract the elements of interest. Python and Selenium can be used to write complex web-scrapers to automate this kind of web navigation and data gathering, but this way is too cumbersome for small, one-off scraping tasks. I have been looking for a way of doing this directly in the developer console of my browser as I navigate to the page I am interested in, but while executing Xpath is trivial via $x('//some/path'), saving the results is not.

The trick

Until, at last, I found this solution on StackOverflow, allowing one to save objects as JSON directly from the console:

function downloadObjectAsJson(exportObj, exportName){
  var dataStr = "data:text/json;charset=utf-8," +
    encodeURIComponent(JSON.stringify(exportObj));
  var downloadAnchorNode = document.createElement('a');
  downloadAnchorNode.setAttribute("href",     dataStr);
  downloadAnchorNode.setAttribute("download", exportName + ".json");
  document.body.appendChild(downloadAnchorNode); // required for firefox
  downloadAnchorNode.click();
  downloadAnchorNode.remove();
}

Essentially, this snippet:

Serializes the object to be saved into JSON,
Adds to the page a temporary a element whose href attribute is set to the encoded data to be saved,
Simulates a click from the user, tricking the browser into downloading the data to a file,
Finally removes this element from the page.

Xpath queries executed via $x return arrays of HTML elements, which are not JSON-serializable. Converting them to an appropriate representation is however very easy:

function convertElementArrayToStringArray(element_array) {
  converted = [];

  for(var i = 0; i < element_array.length; i++) {
    if("outerHTML" in element_array[i]) {
      converted.push(element_array[i].outerHTML);
    }
    else {
      converted.push(element_array[i].nodeValue);
    }
  }

  return conv;
}

This function converts HTML nodes to their outerHTML representation, while keeping text nodes as they are.

Executing the query and saving the result is then just a matter of chaining these two functions:

function saveSelectorQuery(result) {
  var conv = convertElementArrayToStringArray(result);
  downloadObjectAsJson(conv, "selector-query");
}

Usage

For ease of use, here are the previous functions as a single snippet:

function downloadObjectAsJson(exportObj, exportName){
  var dataStr = "data:text/json;charset=utf-8," +
    encodeURIComponent(JSON.stringify(exportObj));
  var downloadAnchorNode = document.createElement('a');
  downloadAnchorNode.setAttribute("href",     dataStr);
  downloadAnchorNode.setAttribute("download", exportName + ".json");
  document.body.appendChild(downloadAnchorNode); // required for firefox
  downloadAnchorNode.click();
  downloadAnchorNode.remove();
}

function convertElementArrayToStringArray(element_array) {
  converted = [];

  for(var i = 0; i < element_array.length; i++) {
    if("outerHTML" in element_array[i]) {
      converted.push(element_array[i].outerHTML);
    }
    else {
      converted.push(element_array[i].nodeValue);
    }
  }

  return converted;
}

function saveSelectorQuery(result) {
  var conv = convertElementArrayToStringArray(result);
  downloadObjectAsJson(conv, "selector-query");
}

Simply copy-paste these into the developer console, then call the last function with your selector to download the results!

For example, executing saveSelectorQuery($x("//h2")) on this very web page (try it!) will download a file called selector-query.json with the following contents:

["The trick","Usage","Emilio's Blog"]

which are exactly the second-level headers in the post. To only get the titles of the headers, without the surrounding HTML, simply append ‘/text()’ at the end of the previous query.

After this, read the JSON file with your favorite programming language and have fun!

Simplest Implementation of Diffusion Models

2023-06-25T10:00:00+00:00

This tutorial presents the simplest possible implementation of diffusion models in plain pytorch, following the exposition of Ho 2020, Denoising Diffusion Probabilistic Models.¹

Generative models learn to generate new samples (e.g., images) starting from a latent variable following a tractable (i.e., simple) distribution. Diffusion models have recently emerged as a very powerful and capable type of generative models, underlying most of the latest astonishing examples of generative AI that have captured public imagination, such as Stable Diffusion,² Midjourney,³ and DALL.E⁴. Diffusion models do this by first establishing a simple way to transform samples from the distribution of interest (the images) to a Gaussian distribution, then training a neural network to reverse this process. In this way, the network learns how to transform samples from the Gaussian into samples from the distribution of interest.

Diffusion refers to the gradual corruption of training examples by repeatedly adding a small amount of noise, mimicking the way heat diffuses through a material until it reaches an uniform temperature. After a few hundreds or thousands of noise diffusion steps, the information in the original sample is completely lost, such that the result is indistinguishable from the Gaussian that we will use as a starting point to generate new samples. Figure 2 from the paper (Ho, 2020) demonstrates this process graphically:

Here, $x_0$ is the original sample, the image of a guy, and the process of adding noise is represented by the dashed arrow going from right to left, so that, after $T$ steps, only noise remains in $x_T$. The generative process is represented by the arrows going from left to right, from $x_T$ to $x_0$, and the generative model is denoted by $p_\theta$, while the noise-adding process is $q$.

In this tutorial, we will learn to generate samples from a very simple unidimensional distribution, so that we can easily visualize the generative process. Let’s start by generating some data:

import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import numpy as np
import torch
import seaborn as sns
import itertools
from tqdm.auto import tqdm

data_distribution = torch.distributions.mixture_same_family.MixtureSameFamily(
    torch.distributions.Categorical(torch.tensor([1, 2])),
    torch.distributions.Normal(torch.tensor([-4., 4.]), torch.tensor([1., 1.]))
)

dataset = data_distribution.sample(torch.Size([1000, 1]))
sns.histplot(dataset[:, 0])
plt.show()

This plot represents the data distribution, i.e., $q(x_0)$. As you can see, our training dataset contains samples from a mixture of two Gaussian distributions, where the component on the right is sampled twice as much frequently.

The forward diffusion process is in Equation 2 of the paper:

\[q(x_{1:T}|x_0):=\prod_{t=1}^T q(x_t|x_{t-1})\]

with each step adding Gaussian noise:

\[q(x_t|x_{t-1}):=\mathcal{N}(x_t | \sqrt{1-\beta_t}x_{t-1} ; \beta_t I)\]

The mean and variance of this distribution is chosen so that the end distribution of $x_T$ after the diffusion process is a zero-mean, unit-variance Gaussian, from which we can easily sample.

This process is easily implemented with a loop:

# we will keep these parameters fixed throughout
TIME_STEPS = 250
BETA = 0.02

def do_diffusion(data, steps=TIME_STEPS, beta=BETA):
    # perform diffusion following equation 2
    # returns a list of q(x(t)) and x(t)
    # starting from t=0 (i.e., the dataset)

    distributions, samples = [None], [data]
    xt = data
    for t in range(steps):
        q = torch.distributions.Normal(
            np.sqrt(1 - beta) * xt,
            np.sqrt(beta)
        )
        xt = q.sample()

        distributions.append(q)
        samples.append(xt)

    return distributions, samples

_, samples = do_diffusion(dataset)

We can visualize the diffusion process by plotting time on the $x$ axis, and the diffused samples on the $y$ axis:

for t in torch.stack(samples)[:, :, 0].T[:100]:
    plt.plot(t, c='navy', alpha=0.1)
plt.xlabel('Diffusion time')
plt.ylabel('Data')
plt.show()

As you can see, adding noise gradually transforms all samples into a Normal $\mathcal{N}(0,1)$ distribution. We are now ready to train a model to invert this process.

Training

To keep things as simple as possible, here we use the loss in Equation 3 in the paper without any of the optimizations presented later, which only play a role for complex, real-world distributions.

In this case, diffusion models are trained by first corrupting the training examples, then trying to reconstruct the cleaner examples from the noisy examples at each step of the corruption process. The loss is an upper bound on the negative log likelihood:

\[L := \mathbb{E}_q\left[ -\log p(x_T) -\sum_{t=1}^T \log\frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right]\]

Where the generative model, also called reverse process, has form:

\[p_\theta(x_{t-1}|x_t):=\mathcal{N}(x_{t-1} ; \mu_\theta(x_t,t), \Sigma_\theta(x_t, t))\]

Note that we are training two neural networks, $\mu_\theta$ and $\Sigma_\theta$, which take as input a noisy sample $x_t$ and the step $t$, and try to predict the parameters of the distribution of the sample $x_{t-1}$ to which noise was added. Intuitively, we are training these networks to maximize the predicted probability of observing the uncorrputed example $x_{t-1}$ based on $x_t$, i.e., the term $p_\theta(x_{t-1}\vert x_t)$ in the loss, for each diffusion step. Remember that $x_t$ was generated earlier from $x_{t-1}$ by adding noise; the networks have to learn to undo the noise. The other terms in the loss involving $q(x_t\vert x_{t-1})$ are not necessary to learn a good generative model, since they are constant, but are useful as a “frame of reference” to make a “perfect” generative model achieve a loss of 0.

The loss is implemented in the function below. This function requires the entire diffusion trajectory for the training samples, as well as the two neural networks that define the inverse process:

def compute_loss(forward_distributions, forward_samples, mean_model, var_model):
    # here we compute the loss in equation 3
    # forward = q , reverse = p

    # loss for x(T)
    p = torch.distributions.Normal(
        torch.zeros(forward_samples[0].shape),
        torch.ones(forward_samples[0].shape)
    )
    loss = -p.log_prob(forward_samples[-1]).mean()

    for t in range(1, len(forward_samples)):
        xt = forward_samples[t]         # x(t)
        xprev = forward_samples[t - 1]  # x(t-1)
        q = forward_distributions[t]    # q( x(t) | x(t-1) )

        # normalize t between 0 and 1 and add it as a new column
        # to the inputs of the mu and sigma networks
        xin = torch.cat(
            (xt, (t / len(forward_samples)) * torch.ones(xt.shape[0], 1)),
            dim=1
        )
        # compute p( x(t-1) | x(t) ) as equation 1
        mu = mean_model(xin)
        sigma = var_model(xin)
        p = torch.distributions.Normal(mu, sigma)

        # add a term to the loss
        loss -= torch.mean(p.log_prob(xprev))
        loss += torch.mean(q.log_prob(xt))

    return loss / len(forward_samples)

Let us now define two very simple neural networks to predict the mean and variance. Both of these networks take two inputs: the noisy sample $x_t$ and the normalized time-step $t$. As you can see from the snippet above, the time-step is added as an additional column feature, and, since the input is also one-dimensional, the total input size is two.

mean_model = torch.nn.Sequential(
    torch.nn.Linear(2, 4), torch.nn.ReLU(),
    torch.nn.Linear(4, 1)
)
var_model = torch.nn.Sequential(
    torch.nn.Linear(2, 4), torch.nn.ReLU(),
    torch.nn.Linear(4, 1), torch.nn.Softplus()
)

Let’s now train them:

optim = torch.optim.AdamW(
    itertools.chain(mean_model.parameters(), var_model.parameters()),
    lr=1e-2, weight_decay=1e-6,
)

loss_history = []
bar = tqdm(range(1000))
for e in bar:
    forward_distributions, forward_samples = do_diffusion(dataset)

    optim.zero_grad()
    loss = compute_loss(
        forward_distributions, forward_samples, mean_model, var_model
    )
    loss.backward()
    optim.step()

    bar.set_description(f'Loss: {loss.item():.4f}')
    loss_history.append(loss.item())

We can make sure that the model has converged by inspecting the loss:

plt.plot(loss_history)
plt.yscale('log')
plt.ylabel('Loss')
plt.xlabel('Training step')
plt.show()

Sample generation

Finally, with the trained neural networks, we can generate new samples from the data distribution.

This process is very similar to the earlier diffusion process, except that here we start from a Normally-distributed $x_T$ and use the predicted mean and variance to gradually “remove” noise:

def sample_reverse(mean_model, var_model, count, steps=TIME_STEPS):
    p = torch.distributions.Normal(torch.zeros(count, 1), torch.ones(count, 1))
    xt = p.sample()
    sample_history = [xt]
    for t in range(steps, 0, -1):
        xin = torch.cat((xt, t * torch.ones(xt.shape) / steps), dim=1)
        p = torch.distributions.Normal(
            mean_model(xin), var_model(xin)
        )
        xt = p.sample()
        sample_history.append(xt)
    return sample_history

samps = torch.stack(sample_reverse(mean_model, var_model, 1000))

for t in samps[:,:,0].T[:200]:
    plt.plot(t, c='C%d' % int(t[-1] > 0), alpha=0.1)
plt.xlabel('Generation time')
plt.ylabel('Data')
plt.show()

And this is the distribution at the last step of generation:

sns.histplot(samps[-1, :, 0])
plt.show()

It is very similar to the initial data distribution, which means that our model has successfully learned to generate samples resembling the training dataset!

I hope you found this tutorial useful! You can download a notebook with this code here.

Temporary variables in Python

2023-04-19T10:00:00+00:00

Very easy using context managers!

The functions locals() and globals() return dictionaries containing the variables that are defined in the current local or global scope. What is cool is that variables can be declared, modified, and “undeclared” by modifying these dictionaries (see a tutorial here)!

To get a temporary variable, we can therefore build a context manager that adds and removes a variable to either the local or the global definitions, for example:

class TemporaryVariables:
    def __init__(self, dest, **kwargs):
        # dest is a dictionary, either from locals() or globals()
        # kwargs are the variables to define, and their values
        self._vars = kwargs
        self._old = dest
        self._backup = {}
        for k, v in self._vars.items():
            # for each variable...
            if k in self._old:
                # store the old value if overwriting
                self._backup[k] = self._old[k]
            # and set the new value
            self._old[k] = v

    def __enter__(self, *args, **kwargs):
        pass

    def __exit__(self, exc_type, exc_val, exc_tb):
        for k, v in self._vars.items():
            # for each variable...
            if k in self._backup:
                # restore the old value if it was overwritten
                self._old[k] = self._backup[k]
            else:
                # or "undefine" the variable if it was new
                self._old.pop(k)

Here’s how you would use this:

a = 'hello'
print(a)
with TemporaryVariables(locals(), a='world'):
    a = a + '!'
    print(a)
print(a)

Which prints:

hello
world!
hello

Due to the way we took the backup, variables that were not defined before the with block remain undefined after it, too:

with TemporaryVariables(locals(), x=42):
    print(x)
print(x)  # x was only defined in the with block

We now get an error when using x after whe with:

42

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [8], line 3
      1 with TemporaryVariables(locals(), x=42):
      2     print(x)
----> 3 print(x)

NameError: name 'x' is not defined

Finally, notice the difference between local and global scoping:

def f():
    print('inside f with b =', b)


def scope_test():
    with TemporaryVariables(globals(), b=2) as q:
        f()

    with TemporaryVariables(locals(), b=2) as q:
        f()

scope_test()

Now, only the variable b that is defined in the global scope can be used in functions called from within the with:

inside f, with b = 2

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)

Cell In [9], line 13
     10     with TemporaryVariables(locals(), b=2) as q:
     11         f()
---> 13 scope_test()


Cell In [9], line 11, in scope_test()
      7     f()
     10 with TemporaryVariables(locals(), b=2) as q:
---> 11     f()


Cell In [9], line 2, in f()
      1 def f():
----> 2     print('inside f, with b =', b)


NameError: name 'b' is not defined

Happy hacking!

Thoughts on the Impact of Large Language Models on Software Development

2023-04-10T10:00:00+00:00

I believe that large language models (LLMs) such as ChatGPT, Copilot, GPT-4, etc., will become ubiquitous in software development. This will ultimately lead to even more software being written, and of lesser quality, more bloated and with more bugs. Additionally, good software developers will become harder to find. Obviously, making predictions about the future is difficult, and I realize there are many points in my argument that one could argue about.

TL;DR: I will make a connection with the lemon market hypothesis,¹ and argue that LLMs will make it worse.

Software developers will still be around

First, let me get this straight: I do not think that LLMs will replace software developers. People who think this either have misconceptions on what developer work entails, or are (in my opinion) excessively optimistic about the rate of progress in artificial intelligence.

To clear up the misconception, coding is a relatively minor and fairly straightforward responsibility of software developers. The main function of developers is to translate business requirements, as defined in, for example, a user story in the Scrum Agile framework, into a formal specification for the machine. Business requirements in general are, by their nature, incomplete and ambiguous, because they rely on many unstated assumptions and a shared understanding of the world, including the domain in which the software operates, how the users interact with it, etc. In fact, if writing exhaustive and unambiguous business requirements was easy, Agile project management would not have been invented, and everybody would still be doing waterfall.

It is conceivable that, in the medium term, LLMs will acquire the ability to translate business requirements into formal specifications that are 80% correct 80% of the time, perhaps even inserting approximately correct code at the approximately correct position in the code-base, but the remaining 20% will necessarily require some form of human intervention, be it fixing the input to the LLM or its output (i.e., its prompt, or the code it generates). I think that most business folks will not be willing to do this on their own, because troubleshooting complex systems still requires an intimate knowledge of the system itself, much deeper than that of project managers, as well as lots of time and energy.

This is not the first time that a new tool promises to make developers obsolete, consider for example SQL and graphical programming languages: in the end, they did not replace developers simply because dealing with software is simply a complex task that requires dedicated people with a certain expertise. LLMs will make some things easier, especially for common use-cases such as REST-based CRUD applications (ugh), but the fact remains that troubleshooting takes time away from thinking at the business, hence some people, whose main responsibility is to build and troubleshoot software, will always be needed, and those people are known as software developers. The way they work may change, but the essence will not.

One could also think that the state of artificial intelligence will advance enough that LLMs will be able to deal with all of this complexity on their own. That could be possible, but at this time I do not think that anybody can give any sort of meaningful and informed answer, given how fast things are moving. I personally believe that, if this is even possible, it will not come from making LLMs even larger but will require additional breakthroughs about knowledge representation, causal reasoning, world models (yes, LLMs do seem to form internal world models, sometimes,² but to what extent and how effectively is still unknown), etc. In any case, I think that the Pareto principle,³ or the 80-20 rule, is a good heuristic to think at these situations. According to it, 80% of the feature take 20% of the effort, and the remaining 20% of the features need 80% of the effort to be done. I think that LLMs today haven’t even reached that initial 80% of features, and although they required herculean efforts to train, I believe that fine-tuning their abilities to deal with large and complex code-bases will take quite some time. But predictions are hard, so who knows.

Everybody will get better at coding

I believe that eventually LLMs will make most developers better, and that this gain will be largest for low-skilled developers and smallest for high-skilled ones. While there is no reliable measurement or even agreed-upon definition of developer skill, I would consider a highly-skilled developer somebody who can maintain and extend code-bases of at least a million lines of code for years on end without setting the whole thing on fire (yes, lines of code is not a very good measurement and complexity also depends on the language, but whatever, I think you get the idea). This leads me to the following conjectures:

In general, current average developers will become moderately more productive. Actually writing code is a relatively minor task of software developers, I would estimate it at 30% or so, with the remaining time spent in thinking and discussing what needs to be done, how to best do it, and verifying that it was done properly. In this case, assuming that LLMs will make them five times more productive (according to a recent study⁴ this number may be closer to twice as productive), the overall productivity gain will be 32%. Even assuming, generously, a 50% of purely coding workload, the productivity gain would be of 67%. These are considerable numbers for sure, but not as groundbreaking as some would lead you to believe. And yes, the demos of LLMs writing code for simple apps and websites are truly impressive, but the vast majority of the time is spent modifying old node, not writing new code. This forecast could also be somewhat less reliable for the medium term, in case LLMs really do revolutionize development beyond fancy auto-completion; for example, some demos already show LLMs completely automating certain tasks, such that it is not even necessary to write a single line of code.
Most developers will be forced to use LLMs, simply because missing on these productivity gains will put them into a lower league, with lower salary and worse colleagues.
The best developers will not become better developers by using LLMs The previous two points were related to speed, and while most developers will be faster at typing code, good developers will see limited benefits from LLMs. As I argued above, the main bottleneck for developers is not typing but thinking, and good developers do not need suggestions on that. Referring to the millions-lines-of-code criterion I mentioned above, the main problem of implementing new features and fixing bugs in such a code-base is figuring out which of the twenty ways of doing it is the best one. Choosing a sub-optimal way will not have any negative impact in the short term, but consistently choosing worse alternatives leads to a slow but sure accumulation of technical debt, i.e., a decrease in code quality that eventually reflects in slower velocity when implementing new features, and an increased rate of bugs. High-skilled developers know how to avoid or delay this as much as possible, and I believe that LLMs will not help much in this for quite a long time, because these choices depend on trade-offs and background knowledge about the domain and possible future directions of the software. Again, it may be possible to describe enough of this to a LLM that it provides a sensible solution, but the mere act of describing the issue accurately and comprehensively is already a skill that only very good developers have.
LLMs will create a lot of new, mostly low-skilled developers. LLMs will make learning to code possible for many people who would not otherwise be successful at it, for example because they would have given up in frustration, or giving them that little extra push they need before some concept “clicks” in their head. Most of these new coders will, however, not acquire a high-skilled status, because that still requires plenty of time, dedication, effort, discipline, and study. I believe that the only way to really learn and get good at practical skills, such as coding, is to make plenty of mistakes and learning from them. This takes time and dedication, with or without LLMs. At the same time, many more people will be able to write simple scripts and apps to scratch an itch they have, and in this regard LLMs could have a similar effect as that of Excel, i.e., empowering non-technical users to perform relatively simple tasks through programming.

Will the average quality of software developers decrease?

Given that I just presented four reasons as for why all developers will benefit from LLMs, you would think that obviously the average skill will increase, but that is not necessarily the case. To see why, consider that, according to the last point above, many new developers of lower-than-average skill would enter the software development market just because of LLMs. If they outnumber the “traditional” developers, who could leverage LLMs to upskill themselves but did not need them to land a developer job, then the average will go down.

Admittedly, this is likely the most controversial statement in this post, and I believe it is the hardest to argue for or against. However, just to prove that this is not a contradiction, but is in principle possible, consider the following simulation. Let there be three levels of developer skill, low, medium, and high, and a certain number of developers with each of level skill. Let us use a parameter, called skill factor, to determine how many developers are in a skill group compared to the lower group. For example, a skill factor of 0.2 means that the number of high-skill developers is 20% that of medium-skill developers, which itself is 20% that of low-skill developers. The introduction of LLMs will cause a certain fraction of developers to upskill and move to the next skill level. At the same time, it will also create new developers with a lower skill factor, i.e., with a distribution that is more skewed towards the lower end, according to the earlier conjecture. By adding these two factors together, we can compute the skill distribution of developers before and after the introduction of LLMs:

			Developer skill
	Skill Factor	Total Count	Low	Medium	High
Total devs before LLMs
Devs upskilled by LLMs
Devs created by LLMs
Total devs after LLMs	N/A

Feel free to adjust the numbers and try out different scenarios. The default settings assume that, before LLMs, each of the higher skill levels contains 30% as many developers as the lower skill, which result in about one high-skilled developer for every 17 low-skilled ones. The settings also assume that 10% of developers will be able to upskill by using LLMs; the more you are willing to assume that LLMs will disrupt software development, the larger this number should be. The best case being that all existing devs will upskill by using LLMs, eliminating low-skilled developers and creating about three medium-skilled ones for each high-skilled developer. Furthermore, the default settings assume that LLMs will create twice as many developers as currently existing, but with a skill factor of 20% that is lower than that of established developers. This factor depends on how easy it will be for “outsiders” to learn programming with LLMs, which is why it I assume it is lower than the factor for pre-LLMs developers.

The issue of measuring developer skill

Now that we have some ideas on how LLMs will influence the skill of developers, let’s try to think at what could happen to employers. Before completing the argument, however, I would like to briefly return to the issue of developer skill.

As I mentioned, measuring developer skill and productivity is a hard, open problem for which no good solution exists. Metrics such as number of commits, numbers of lines of code, number of bugs fixed, etc., are all easy to gamify, and while they do have diagnostic value they are hardly correlated with the real productivity which is essentially user value. Even understanding the skill of a software developer during job interviews is not easy, with whiteboard and leetcode-style problems merely filtering away those who did not spend an absurd amount of time preparing for irrelevant problems like those, and take-home assignments too simple to say much about the interviewee’s skills. I am also considerably simplifying the matter by assuming that there is a single dimension to skill.

To make things worse, I think that LLMs will make it harder to measure developer skill, especially for hiring decisions, by facilitating the creation of content, such as superficial posts on blogs and LinkedIn and buggy demo projects on GitHub, that low-skill developers can use to fool potential employers by giving an appearance of proficiency. Moreover, employed low-skill developers that rely too much on LLMs will generate technical debt at a faster rate, jeopardizing progress in the long term while still appearing, to the uninformed managers, to be performing at a higher skill level. Code reviews by and pair programming together with higher-skilled developers could prevent this from happening, but it will reduce the average productivity of the organization as the higher-skilled developers would spend less time coding and more time supervising.

And its impact on the market for software developers

A lemon market¹ is feedback loop that drives down the average quality of sold goods. The market of used cars and motorbikes is a typical example of a lemon market. It is difficult for a buyer to determine the quality of an used car, because it is determined by factors, such as the driving style of the previous owner(s), whether maintenance was performed properly and regularly, etc., that are not visible to the buyer, and easily falsified by the seller. Therefore, the rational buyer should assume a car is of average quality, and be prepared to spend an average price for it. Sellers of good cars would demand a price that is higher than the average, and will not be able to sell their high-quality car because buyers cannot ascertain that the car is, in fact, in better condition than most others. Therefore, as sellers of good cars cannot get a satisfactory price, they will choose not to sell the car after all, making the average quality of cars in the market lower, and leading buyers to revise their expectations, and thus their price, downwards. This will, in turn, leading to sellers of moderately good cars not to sell, and so on, creating a feedback loop. The name actually comes from a market of lemons (good cars) and peaches (bad cars), but I find the analogy with cars more intuitive. I also cannot fathom how one cannot possibly distinguish a lemon from a peach.

I would argue that the market for software developers is (approximately) a lemon market, and LLMs will only make it worse. In the case of developers, the buyers are companies hiring, the sellers are the developers looking for a job, and the product sold is their software development ability. A lemon market appears when the following conditions hold:⁵

Asymmetry of information, in which no buyers can accurately assess the value of a product through examination before sale is made and all sellers can more accurately assess the value of a product prior to sale; as discussed above, companies are definitely struggling to assess the skills of software developers both during hiring and when working. But to be fair, so is the same for other developers.
An incentive exists for the seller to pass off a low-quality product as a higher-quality one; if you have ever been looking for a job you know this is true.
Sellers have no credible disclosure technology (sellers with a great car have no way to disclose this credibly to buyers); if you have ever been looking for a job you know this is also true.
Either a continuum of seller qualities exists or the average seller type is sufficiently low (buyers are sufficiently pessimistic about the seller’s quality); the skill of software developers certainly lies on a continuum spread across several dimensions, although nobody knows in which way and how to quanify that (see above).
Deficiency of effective public quality assurances (by reputation or regulation and/or of effective guarantees/warranties); I think it is not controversial to say that a degree in Computer Science does not say much about the ability of a graduate. Past job experience could help, but is also a noisy signal.

Determining whether the market for software developers a lemon market is certainly not straightforward, and one could easily argue that it is not, especially in regard to points (1) and (5). That is a fair critique, however this is not a binary distinction, and realistically speaking every market has some degree of “lemon-ness” (or lemonade?). Anyways, my point is that LLMs will create more lemons and fewer peaches, or, to be more precise, the lemon-to-peach ratio will increase. This is in part due to the change in skill distribution, for example as simulated above, and in part due to the feedback loop inherent in lemon markets. To see why, consider that, if LLMs will actually make it harder to measure developer skill, both the information asymmetry (point 1) and the credibility of disclosure mechanisms (point 3) will get worse (I was not helped by a LLM to write this post, by the way, but will you believe it if I told you?). In the end, and especially if you think that the average skill of software developers will decrease, this translates to a more severe form of lemon market, with a stronger feedback loop driving peaches away and reducing developers’ salary.

The consequences of this will be that some developers will choose to do something else, alternative career paths that pay better or similarly but require less effort. At the same time, the increased supply of cheaper developers will enable companies to create even more software products, however, the number of bugs will increase in tandem as the average skill of developers will decrease. It is also possible, on the contrary, that LLMs will increase the average skill enough to offset the changes in information asymmetry and disclosure mechanisms, thus resulting in the opposite effect.

It is really hard to predict the future. Even if you do not agree with my conclusion, I hope that you enjoyed this line of thinking, and that I raised some interesting points for you to ponder about. If this is the case, feel free to share this article and/or get in touch. Obviously, other people who are smarter than I am also thought at these problems, and studied the impact of LLMs on the economy as a whole, so go read those as well.⁶⁷

Transitive coins

2023-03-20T10:00:00+00:00

Three coins each show heads with probability 3/5 and tails otherwise. The first coin gives 10 points for a head and 2 for a tail, the second gives 4 points for both head and tail, and the third gives 3 points for a head and 20 for a tail. You and your opponent each choose a coin; you cannot choose the same coin. Each of you tosses your coin and the person with the larger score wins 10$. Would you prefer to be the first to pick a coin or the second?

By the way, this problem is number 16 in Section 2.7 in the book “One thousands exercises in probability”. Intuitively I thought that going first would always be the best option, because it would allow the first player to choose the coin that gives the best chances of winning, while going second would put them at the mercy of their opponent. The surprising solution comes from computing the optimal strategy, so let’s get to it.

First, note the wrong, but intuitive, approach to the problem: go first and choose the coin that gives the best expected score. The expected scores would be 10x3/5+2x2/5=34/5 for the first coin, 4x3/5+4x2/5=20/5 for the second, and 3x3/5+20x2/5=49/25 for the third, thus this strategy would choose first and pick the third coin. However, in this way you have a probability of 3/5 of getting three points, which is worse than any outcome of the second coin and heads of the first coin, which also happens with probability 3/5. Therefore, this does not seem like a good strategy, as there is a larger probability of losing than winning.

Let’s instead pretend to be the first player, and compute the probability of winning for all possible choices of coin. For convenience, here’s a recap of how the score for each throw:

	Head (p=3/5)	Tail (p=2/5)
Coin 1	10	2
Coin 2	4	4
Coin 3	3	20

For the first case, assume that the first player picks coin 1 and the second player picks coin 2. All possible outcomes of this match-up are summarized in this table:

First player	Second player
Coin 1	Coin 2	Probability	First player wins
Head - 10 pt	Head - 4 pt	9/25	Yes
Head - 10 pt	Tail - 4 pt	6/25	Yes
Tail - 2 pt	Head - 4 pt	6/25	No
Tail - 2 pt	Tail - 4 pt	4/25	No

Where the probability of each outcome is the product of the two probabilities, 3/5 for heads and 2/5 for tails, since the coins are independent. In this case, the first player wins with probability 9/25+6/25=15/25 (we can sum the probabilities because the two events are mutually exclusive) and the second player wins with probability 1-15/25=10/25. This situation is clearly symmetric, in the sense that if the first player picks the second coin, and the second player picks the first coin, the victory probabilities are reversed, i.e., 10/25 for the first player and 15/25 for the second player.

The second match-up is coin 1 versus coin 3:

First player	Second player
Coin 1	Coin 3	Probability	First player wins
Head - 10 pt	Head - 3 pt	9/25	Yes
Head - 10 pt	Tail - 20 pt	6/25	No
Tail - 2 pt	Head - 3 pt	6/25	No
Tail - 2 pt	Tail - 20 pt	4/25	No

In this case, the first player only wins with probability 9/25 and the second with probability 16/25.

The last match-up is coin 2 versus coin 3:

First player	Second player
Coin 2	Coin 3	Probability	First player wins
Head - 4 pt	Head - 3 pt	9/25	Yes
Head - 4 pt	Tail - 20 pt	6/25	No
Tail - 4 pt	Head - 3 pt	6/25	Yes
Tail - 4 pt	Tail - 20 pt	4/25	No

With the victory probabilities of 15/25 and 10/25 for the first and second player respectively.

Let’s collect the probability of victory for the first player in a table, where the columns represent the choice of the first player, and rows represent the choice of the second player:

	Coin 1	Coin 2	Coin 3
Coin 1	-	10/25	16/25
Coin 2	15/25	-	10/25
Coin 3	9/25	15/25	-

Diagonal entries are blank because the players have to choose different coins. Let’s now analyze the strategy for the first player:

Suppose the first player chooses the first coin. The second player could pick the second coin, and win with probability 10/25, or choose the third coin, and win with probability 16/25. Thus, the second player would choose coin 3.
If the first player chooses coin 2, then the second player would choose coin 1 and win with probability 15/25.
Lastly, if the third player chooses the third coin, the second player would choose the coin 2, and win with probability 15/25.

In other words, the second player wins the game with probability of 15/25 or larger, therefore, the solution to the riddle is to go second.

As a side-note, the reasoning we performed above to find the best strategy is known in game theory as Minimax¹. Essentially, as the first player, we are looking for the option that results in the other player having the minimum maximum chance of winning. In other words, given the first player’s move, the second player rationally chooses the move that maximizes their chances of winning; therefore, as the first player, we should choose the move that minimizes the second player’s maximum victory chances. This principle underlies many methods in artificial intelligence under the name of adversarial training, in which two or more components of a system compete with each other. Notable examples are Generative Adversarial Networks (GANs, Goodfellow et al. 2014²), which are used to generate new and realistic samples imitating a set of given examples. GANs are composed by two separate components, a generator that generates new samples, and a discriminator that predicts whether the generated sample is real or artificial. These two networks compete with each other, the generator trying to fool the discriminator, and the discriminator trying to uncover the generator. When properly executed, the generator learns to fool the discriminator and to produce realistic samples at the same time.

https://en.wikipedia.org/wiki/Minimax ↩
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial networks.” Communications of the ACM 63, no. 11 (2020): 139-144. ↩

Best time to buy and sell stocks

2023-02-04T10:00:00+00:00

There is a problem on LeetCode that goes like this: “You want to maximize your profit by choosing a single day to buy one stock and choosing a different day in the future to sell that stock. Find the maximum profit you can achieve from this transaction.” Where we can decide to sell “in the past” to maximize the profit. Finding the solution and why it works took me way too much effort (spoiler alert).

I have to admit, I only found the solution because I thought at a similar problem (finding the subarray with the largest sum) we studied in the class on algorithms and data structures during my bachelor’s degree. And I also have to admit, I was equally bewildered at that time.

Anyways, in this problem we are given the list of prices and we have to find the maximum possible profit we could achieve by buying at some point and selling after that point. The solution is surprisingly simple:

class Solution:
    def maxProfit(self, prices: List[int]) -> int:
        n = len(prices)
        best = 0
        i = 0
        j = 1
        while j < n:
            profit = prices[j] - prices[i]
            if profit > best:
                best = profit
            elif profit < 0:
                i = j
            j += 1

        return best

As an aside, it does not look very Pythonic because it’s written to be fast. In fact, this is in the top 8% fastest solutions (I also have no clue how to make it faster). A more stylish version would be something like this:

class Solution:
    def maxProfit(self, prices: List[int]) -> int:
        best = i = 0
        for j, x in enumerate(prices):
            profit = x - prices[i]
            best = max(best, profit)
            if profit < 0:
                i = j
        return best

But this is only in the top 20%.

So how does this work? We keep two cursors, i and j, corresponding to the times we buy and sell the stock. We use j to go forward in time, computing on every new day the profit we would make and if necessary updating the highest profit yet. So far so good, but then what happens is pretty weird: if we find a negative profit, we decide to restart buying today (at day j)! How on earth does this make sense? Imagine we are in this situation:

Then, you understand, moving forward with j but not with i will generate negative profits for a while. For sure, at some point we will reach a point of time k where the price has recovered to the same level as day i, but we could also have bought the stock in the dip between j and k and maybe that’s where the maximum profit will be! The key to understand the solution is that if we find a point in the future (after k) with better profits than the current best, we could make even more money by shifting i to the dip between j and k. Look:

Imagine we found a new best at time j', then clearly a better solution would be with i starting at the lowest point between j and k, and this is the purpose of the i = j statement. Importantly, i will always be at the bottom of a dip: the price immediately preceding i will be greater than or equal to it and the price immediately following i will be greater than it. Why? Because as long as prices keep going down (generating negative profits), i follows j until it gets to the bottom of the dip. Then, as soon as prices go up (generating positive profits), i will stay in the dip and j will move until it gets back down to the price at day i, at which point i follows j again to the bottom of the new dip.

I hope that writing this down will help me (and you!) remember this type of reasoning for the next similar problem. Feel free to apply this idea to the maximum subarray problem if you haven’t already, happy (leet)coding!

How do accumulating ETFs benefit individual investors exactly? Net asset value, authorized participants and creation/redemption mechanisms

2023-01-24T10:00:00+00:00

Even a superficial read about Exchange Traded Funds (ETFs) will reveal that there are two strategies by which ETFs handle dividends. Either they are passed to the individual ETF investors by the so-called distributing ETFs, or the fund keeps the dividends and promises to reinvest them into the market in what are called accumulating ETFs. You may now wonder, as an accumulating ETF investor, what tangible benefits do you get from this strategy?

The immediately obvious advantage of having the fund reinvest your dividends for you is that, well, you don’t forget to do it. The major reason to do this, however, is that your dividends are not taxed because, well, you never received them in the first place. For this reason, investing in accumulating ETFs will provide higher long term returns after taxes and fees.

However, as an investor in accumulating ETFs you may be puzzled to realize that the number of shares you own never goes up unless you yourself purchase some. Aren’t accumulating ETFs supposed to reinvest dividends? Somewhat naively, I thought that an accumulating ETF would give me dividends as additional shares rather than cash, but it doesn’t! So how do I benefit from dividends, how do I see that they exist? I dived into a moderately deep rabbit hole to understand why, and here I summarize what I found (jump to the conclusion at the end if you are impatient).

Net asset value (NAV)

To understand where dividends go, you need to know about the net asset value (NAV) of an ETF. Simply stated, the NAV is the net value of the fund, assets minus liabilities, divided by the number of shares. Imagine an ETF owning 80 shares of Company A and 20 shares of Company B. These shares are part of the fund’s assets, and if they are currently traded at 20 for Company A and 40 for Company B, then the fund owns 80x20+20x40=2400 in assets. Assuming for simplicity that the fund owns no cash (the other major type of asset) and has no liabilities, then its net value is also 2400, and if there are in total 100 circulating shares of this ETF then its NAV is 24.

Dividends increase the NAV of accumulating ETFs

Imagine that today is dividends day and that the accumulating ETF above receives 0.5 per share from Company A and 1 from Company B. Then, the ETF receives in total 80x0.5+20*1=60 in cash. Because of this additional cash, the assets of the funds increased from 2400 to 2460, and its NAV is now 24.6. Alice, owning 10 shares of this accumulating ETF, would still own 10 shares after dividends are issued by the two companies, and will not receive a single penny.

Dividends do not increase the NAV of distributing ETFs

While an accumulating ETF would keep that cash and increase its NAV, a distributing ETF would pass the cash to investors. As the ETF received 60 in dividends and is split into 100 shares, investors would receive 60/100=0.6 per share in dividends. Therefore Bob, owning 10 shares of the distributing ETF, would receive 6 in dividends.

Ah cool, the NAV increases. So?!

To recap, after dividends from accumulating and distributing ETFs are handled, Alice owns 10 shares with a NAV of 24.6, and Bob owns 10 shares with a NAV of 24 plus 6 in cash. Somehow it feels balanced, because both Alice and Bob own assets worth 246: for Alice 24.6x10=246 and for Bob 24x10+6=246. However, if you are Alice, you may feel something is missing: you received nothing from the accumulating ETF! The market price at which the ETF shares are traded is determined only by the laws of demand and offer and has nothing to do with the NAV, so Alice does not feel richer at all: even if the NAV of her fund increased, the market price did no.

Authorized participants (AP) pin market price to NAV

As described, the situation does look rather inconvenient, as the market price of an ETF is free to fluctuate according to market demand. But wait, if this is actually the case, how can ETFs track an index without straying, if they are also bought and sold like any other security? This is why authorized participants (AP) exist. APs are large financial institutions with lots of cash that sit between an ETF and the market and ensure that the market price does not deviate from the NAV by using two mechanisms called Creation and Redemption. Why do they do this? Because they make money in the process!

Creation and Redemption allow APs to control the market price

Creation and Redemption are two mechanism that respectively increase and decrease the number of available ETF shares. They allow controlling the market price of an ETF by modifying the supply side of the equation: creating shares increases supply and reduces the price, while redeeming shares reduces supply and increases the price. Conceptually, an ETF share is nothing more than a piece of paper which says “this paper is worth one share”, so in principle the owner of the ETF can create as many shares as they want. But obviously they just can’t create shares at will and distribute them around, because this will only devalue the existing shares without achieving anything more than angering investors.

Instead, shares are created by the fund selling them to an AP at the NAV price. This is extremely important: APs can purchase ETF shares at the NAV price! The NAV price!! NAV!!! The NAV is regularly reported publicly by the fund. So if you are a cash-strapped AP and notice that the market price of an ETF is higher than its NAV, what do you do? Obviously, you purchase ETF shares from the fund at the (lower) NAV price and sell them to the broader market for the (higher) market price, pocketing the difference! This is the creation mechanism in a nutshell. Redemption works in the opposite direction: if the market price is lower than the NAV, for example just after the fund receives dividends, APs purchase shares from the stock exchange and sell them to the fund at NAV price for profit. In our example above, the NAV increased to 24.6 after dividends while the market price remained at 24 (because APs matched it to the NAV before dividends), therefore an AP could purchase an ETF share in the stock market for 24 and redeem it with the fund for 24.6, with a profit of 2.5%.

I did not mention an important detail, namely that creating and redeeming shares is not performed with cash but rather with the underlying securities that make up the index tracked by the ETF. In our example, Company A was trading at 20 and Company B at 40. Following the 80/20 allocation above, the AP can exchange with the fund one ETF share, purchased for 24 in the stock market and redeemed for 24.6 with the fund, and receive (24.6x0.8)/20=0.984 shares of Company A and (24.6x0.2)/40=0.123 shares of Company B. At this point the AP could sell these shares in the stock market for 20x0.984+40x0.123=24.6 in cash, thus realizing the same gain of 2.5% or simply keep those shares and use them in the future to create ETF shares when the market price is higher than the NAV. There are several other reasons why ETFs benefit from APs, including lower ETF fees; read more here.

But then the market price of accumulating ETFs should be higher than that of its distributing siblings!

Right!? Because the NAV of accumulating ETFs keeps increasing while that of distributing ETFS does not, and APs will match the market price to the NAV. And indeed accumulating ETFs are pricier! Had I bothered to check before starting to read about NAV and APs I would have saved one afternoon (but I’d be slightly more ignorant). This is the relative change in market price during the last five years of an accumulating ETF, in green, and a distributing ETF, in blue, both tracking the MSCI World index:

As you can see, on August 14th, 2022 the market price of the accumulating ETF increased by 65.55% compared to January 1st, 2018, while in the same period the distributing ETF only increased by 52.63%. You can also see, at the bottom left, that dividends are not reinvested in the distributing ETF. Not including dividends gives the historical market price, while including them gives the return of an investor. By reinvesting dividends, the total (pre-tax, without fees) returns of investors in accumulating and distributing ETFs are identical: accumulating ETFs investors will own fewer but pricier shares, while distributing ETFs investors will own more, cheaper, shares, and this balances out so that the total assets are worth exactly the same.

Conclusion

The question I started with was: “where do my dividends go when investing in accumulating ETFs, and how do I benefit from them?” I was wondering this because of a misconception that led me to think that dividends would reach me as additional shares rather than cash. Instead, as an investor in accumulating ETFs I benefit by higher market prices compared to the equivalent distributing ETF, but the total value of the assets I own is the same.