About me
My name is Emilio, I am a postdoctoral researcher at Boehringer Ingelheim, a pharma company, where I use artificial intelligence to help my colleagues create safer drugs. I obtained my PhD at the LMU University in Munich, Germany, where I studied ways to design more effective cancer vaccines.
Previously, I studied computer science and data science, and I have a minor degree in innovation & entrepreneurship. I also have experience in the industry as a data engineer, data scientist, and full-stack software engineer (curriculum vitae).
I write about
All posts
-
Handling Larger-than-memory Datasets in PyTorch Lightning: A Practical Guide
In the world of deep learning, data is king. The more data we have, the better our models can learn and predict. However, this abundance of data can also pose a significant challenge. What happens when our datasets are so large that they don’t fit into our system’s RAM? This is a common issue faced by many machine learning engineers, and it’s the problem we’ll tackle in this blog post. -
How to use Visual Studio Code to run and debug code on SLURM compute nodes
If you’re a developer or data scientist using SLURM to handle your compute workloads, you have surely encountered issues in debugging your code on compute nodes. In this blog post, I share a simple solution for that, allowing you to develop and debug code running directly with the compute resources you need. Although the focus is on Visual Studio Code, the same approach can be applied to other IDEs that support remote development via SSH. -
What is the average length of a queue of cars?
Some time ago I was driving on a twisty mountain road, stuck in a slow-moving queue of cars as it was impossible to overtake safely. Out of boredom, I was wondering how many cars were in the queue, and, more generally, what would be the average length of queues in this road. Let’s find out! -
How to save locally the result of XPath queries in Firefox and Chrome
It happens relatively often that, while browsing the internet like a normal person, I want to extract some data from a webpage, save it locally, and manipulate it in some way. Since it is an one-off operation, I really do not want to bother writing a web-scraper with Python. Instead, here is a simple way of doing this through the developer console in Firefox or Chrome! -
Simplest Implementation of Diffusion Models
This tutorial presents the simplest possible implementation of diffusion models in plain pytorch, following the exposition of Ho 2020, Denoising Diffusion Probabilistic Models.1 http://arxiv.org/abs/2006.11239 ↩ -
Temporary variables in Python
Very easy using context managers! -
Thoughts on the Impact of Large Language Models on Software Development
I believe that large language models (LLMs) such as ChatGPT, Copilot, GPT-4, etc., will become ubiquitous in software development. This will ultimately lead to even more software being written, and of lesser quality, more bloated and with more bugs. Additionally, good software developers will become harder to find. Obviously, making predictions about the future is difficult, and I realize there are many points in my argument that one could argue about. -
Transitive coins
Three coins each show heads with probability 3/5 and tails otherwise. The first coin gives 10 points for a head and 2 for a tail, the second gives 4 points for both head and tail, and the third gives 3 points for a head and 20 for a tail. You and your opponent each choose a coin; you cannot choose the same coin. Each of you tosses your coin and the person with the larger score wins 10$. Would you prefer to be the first to pick a coin or the second? -
Best time to buy and sell stocks
There is a problem on LeetCode that goes like this: “You want to maximize your profit by choosing a single day to buy one stock and choosing a different day in the future to sell that stock. Find the maximum profit you can achieve from this transaction.” Where we can decide to sell “in the past” to maximize the profit. Finding the solution and why it works took me way too much effort (spoiler alert). -
How do accumulating ETFs benefit individual investors exactly? Net asset value, authorized participants and creation/redemption mechanisms
Even a superficial read about Exchange Traded Funds (ETFs) will reveal that there are two strategies by which ETFs handle dividends. Either they are passed to the individual ETF investors by the so-called distributing ETFs, or the fund keeps the dividends and promises to reinvest them into the market in what are called accumulating ETFs. You may now wonder, as an accumulating ETF investor, what tangible benefits do you get from this strategy? -
Automatically triggering make when editing files
Shortening the edit-build-test cycle as much as possible can greatly increase your productivity. This post presents an useful trick to run make, or any other command, every time a file is modified. I mainly use this when working with LaTeX to compile a PDF upon save, but I can imagine many other use-cases. -
Go to conferences to meet people
I recently had the chance to attend the NeurIPS conferences in person in New Orleans. Despite being a PhD student for almost four years, because of covid and a research detour this was actually the first in-person conference I attended. Due to its scale, with more than 2,600 accepted papers in the main track and more than 10,000 people attending, it was an extremely overwhelming experience. -
The 37% rule
Suppose you want to pick the best item out of a collection, but you must decide after seeing each item whether to keep it and walk away or discard it and keep looking, loosing this item forever. An apartment search in competitive cities is an example of this kind of decision process. The optimal stopping rule is to examine the first 37% items without committing, then choose the item that is the best among the ones seen so far; incredibly, following this strategy one selects the absolute best item in 37% of the cases. But what happens when the best item is not selected? -
Running multiple experiments via Snakemake
After proposing a new method you have to evaluate it in a range of scenarios in order to understand its strengths and weaknesses. This usually means trying several variations of the method, on different datasets, multiple times, and analyzing the results. The total number of runs you have to do quickly explodes, and automation becomes essential to manage all these experiments. -
NetworkManager settings to improve GSM connection in the PinePhone
I am so happy of my PinePhone, but there is one particular issue that significantly decreased its usability: connection to the GSM mobile network. Fortunately, with some simple tweaks to NetworkManager it is possible to make this operation much faster. -
Doing research is like doing a puzzle
I was gifted a puzzle for my birthday and while assembling it I noticed some similarities with the work of a researcher. Hear me out. -
PhD Metagame - Mistakes were made
Three years have passed, and I have slowly realized how far back in the academic game I am compared to certain other peers around me. I knew relatively early on that I would eventually switch to industry, thus I consciously did not optimize for academic success. However, the extent of that gap is mind-blowing, and it is natural to wonder how could some people get so far ahead given a similar environment, time and resources. -
Teaching Variational Autoencoders
Trying to explain the fundamental concepts behind variational autoencoders made me realize something much deeper about learning and teaching. -
How to split money fairly after a vacation
After a week of fun and relax with friends, splurging money without second thoughts, it is time to make sums and make sure everybody paid what is fair. How to do this? -
Get vaccinated now or take an antiviral after showing Covid19 symptoms?
Pfizer’s antiviral Covid19 drug (aka Paxlovid or Ritonavir) was shown to be 89% effective in reducing hospitalization and death when administered within three days of the onset of symptoms. As a reasonable person, you may now wonder whether it is safer to get a vaccine now, or that antiviral after you showed Covid19 symptoms. This is hard to tell with the information currently available, but here I show how to think at this problem and get closer to an answer. -
A loss function for positive unlabeled learning
Positive unlabeled (PU) learning is a semi-supervised binary classification setting when no labeled negative example is available to learn a classifier. This means that the dataset is composed of a set of labeled positive examples and an usually much larger set of unlabeled examples containing both positives and negatives. Despite the absence of labeled negatives, a special loss function exists to learn from PU data. -
Bouncing balls
A projectile is launched from the very center of the floor of a rectangular room that is 40 feet wide with a very high ceiling. The projectile hits the wall at a height exactly 10 feet above the floor, reflects off this wall (obeying the “angle of incidence equals angle of reflection” rule), hits the opposite wall, and reflects again, finally landing back exactly where it was launched, without hitting the ceiling. This is possible because the projectile does not travel along straight lines, but instead travels along parabolic segments due to gravity. When the projectile is at its highest point, how high above the floor is it? -
Eight tips to effectively supervise students during their Master's thesis
I am a fan of knowledge transfer between peers, teaching what I know to others and learning back from them. At University I frequently helped my fellow course mates with the material, so I was very interested in formally mentoring students when I started my PhD. Luckily my supervisor, who is really talented at this, agreed to let me help him with supervising some Master’s theses. In this article, also published as a Nature Career Column, I present eight lessons that I learned by watching him at work and trying on my own. -
Open Science Stories Podcast - Sharing source code
I had the unique opportunity of recording an episode for a podcast about Open Science hosted by my colleague Heidi Seibold. I talk about how making research artifacts open source can help professional and aspiring scientists to better understand your work. Here’s the transcript and my thoughts on it. -
Orange Slices
I found an interesting geometrical riddle on Twitter that I could not ignore. After I read the first few chapters of “The art of problem solving” I wanted to challenge myself, and this turned out to be a very nice problem with a neat solution. -
Embedding files into Jupyter notebooks with clickable download links
Sharing Jupyter notebook or exporting them to HTML is a great way of sharing the results of an analysis with other stakeholders. Some analyses however produce additional data that cannot be simply shown in the nobeook. In such cases, your only option is to send additional files along with the notebook. Or is it? -
The expectation maximization algorithm without the agonizing pain
My first introduction to the expectation maximization (EM) algorithm was a bit traumatic: rarely have I been left so clueless by a lecture. Even after reading the relevant chapter in Bishop’s venerable Machine Learning book a couple of times things were not so clear. After some more struggles it finally clicked, and it is really simple! -
Temporary variables in Python's list comprehension
Although list comprehensions are very handy, it is difficult to write non-trivial expressions, mostly because it is not possible to use variables to store temporary results. Or is it? -
Speedrunning the NeurIPS Bayesian Deep Learning Poster Session
Thanks to the virtual platform Gather.town, I could swiftly zip from poster to poster in the NeurIPS Bayesian Deep Learning meetups and write a short summary of all 71 posters in only 2:15 hours! -
Who cares about transposes?
Exchanging rows and columns of a matrix is hardly an inspiring operation. Yet in linear algebra we frequently take the transpose of a matrix. Why is that? -
How I almost lost one year worth of notes!
I accidentally deleted most newline characters in my 10-thousands-lines, 65-thousands-words notes, painstakingly collected during over a year of PhD! This is how I recovered them. -
One year of PhD: a retrospective
After one year, it is time to reflect on my time as a PhD student: what was good, what bad, what was difficult, and how to improve. -
Using large numpy arrays and pandas dataframes with multiprocessing
Thanks to multiprocessing, it is relatively straightforward to write parallel code in Python. However, these processes communicate by copying and (de)serializing data, which can make parallel code even slower when large objects are passed back and forth. This post shows how to use shared memory to avoid all the copying and serializing, making it possible to have fast parallel code that works with large datasets. -
... but humans can learn with vErY fEw ExAmPlEs !!11!!!1!
Next time you hear somebody saying that artificial intelligence is flawed because it requires millions of examples to learn anything while humans only need very few, resist the urge of kicking them in the teeth and show them this list instead. -
A tricky question
A few days months ago (took me a while to wrap this up) I stumbled upon this question on the internet: Which answer in this list is the correct answer to this question? All of the below, None of the below, All of the above, One of the above, None of the above, or None of the above. -
Technology of vaccines for COVID-19
Currently, several vaccines for COVID-19 are undergoing clinical trials. They are based on a variety of innovative technological platforms, several of which have never been used in any licensed vaccine. This post analyzes five of them, and presents a layman explanation of their principles of operation. -
Joint epitope selection and spacer design for string-of-beads vaccines
We have just submitted a paper describing a new framework for vaccine design. Similarly to our previous project, the main innovation here is that we take a holistic approach to the design problem, and show how this improves the end result. -
Automatic differentiation from scratch
Automatic differentiation (AD) is one of the most important features present in modern frameworks such as Tensorflow, Pytorch, Theano, etc. AD has made paramter optimization through gradient descent an order of magnitude faster and easier, and drastically lowered the barrier of entry for people without a solid mathematical background. In spite of its utility, AD is surprisingly simple to implement, which is what we are going to do here. -
Limits of single-hidden-layer neural networks
A few decades ago, researchers were trying to understand which kind of shapes can be modeled by neural networks. Even after the universal approximation theorem was proven, they still wanted to know which kind of decision regions (i.e. regions in the input space classified as positive) can be exactly reproduced, and which ones can only be approximated, with a neural network with a single hidden layer. -
Drug therapy for the Coronavirus
Last month, a new epidemic started in Wuhan, China, and quickly spread all over the world. As of now, there are reports of a few companies rushing to develop, or having already developed, a vaccine for this virus. -
Six months of PhD: a retrospective
In this post, I am going to reflect on my past six months as a PhD student: what went right, what went wrong, what was good, and what was bad. -
Flippin' Cards
Here’s a riddle for you: A friend brings you in a dark room and hands you a shuffled deck of 40 cards, 10 of which are facing up and the other 30 are facing down. Your task, she tells you, is to come out of the room holding two decks that have the same number of cards facing up. As it is dark in the room, you cannot see anything, and the cards cannot be distinguished by touch. How can you do it? -
The first project of my PhD
I am in the process of publishing the first paper I wrote in my PhD, and, as promised, with this post I am making an effort to explain this work to a less technical audience. -
(How) should researchers worry about ethics?
A few days ago I saw a tweet mentioning that NLP researchers should seriously start worrying about the ethical implications of their discoveries. That is easier said than done. -
How to predict aleatoric uncertainty for log-transformed data
Suppose you want to train a neural network (or any other model) on a regression problem with heteroscedastic noise (i.e. data-dependent); a way to do that is to have the model predict both mean and variance of the output, and include this variance in the log likelihood. -
A gentle introduction to natural language and genomics
During the research I am conducting for my next project I stumbled upon the intriguing idea of applying tools developed for Natural Language Processing (NLP) to bioinformatics. After all, the comparison seems to hold up: you can think of the DNA as a collection of books (genes), each of which contains several chapters (proteins) related to a certain topic. -
xkcd commentary - Frequentists vs. Bayesians
I found this xkcd comic hilarious and, at the same time, brilliant: -
What is my PhD about?
So I have started my PhD last month, after a not-so-good experience in the consulting industry that prompted me to reflect deeply on my values and goals. I thought writing a blog would be a good way to improve my communication skills and further disseminate my research beyond specialized conferences and journals.