{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "layout: \"post\"\n",
    "title: \"Using large numpy arrays and pandas dataframes with multiprocessing\"\n",
    "date: 2020-06-19 09:00:00 +0200\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thanks to multiprocessing, it is relatively straightforward to write parallel\n",
    "code in Python. However, these processes communicate by copying and\n",
    "(de)serializing data, which can make parallel code even slower when large\n",
    "objects are passed back and forth. This post shows how to use shared memory to\n",
    "avoid all the copying and serializing, making it possible to have fast parallel\n",
    "code that works with large datasets.\n",
    "\n",
    "<!-- more -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to demonstrate the problem empirically, let us create a large data-frame and do some processing on each row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import multiprocessing as mp\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data size: 32.4 MB\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Col-0</th>\n",
       "      <th>Col-1</th>\n",
       "      <th>Col-2</th>\n",
       "      <th>Col-3</th>\n",
       "      <th>Col-4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Idx-0</th>\n",
       "      <td>0.759636</td>\n",
       "      <td>0.475505</td>\n",
       "      <td>0.625045</td>\n",
       "      <td>0.806929</td>\n",
       "      <td>0.447964</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Idx-1</th>\n",
       "      <td>0.085767</td>\n",
       "      <td>0.038912</td>\n",
       "      <td>0.726519</td>\n",
       "      <td>0.771332</td>\n",
       "      <td>0.287777</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Idx-2</th>\n",
       "      <td>0.781693</td>\n",
       "      <td>0.589335</td>\n",
       "      <td>0.485391</td>\n",
       "      <td>0.090572</td>\n",
       "      <td>0.559625</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Idx-3</th>\n",
       "      <td>0.309681</td>\n",
       "      <td>0.307883</td>\n",
       "      <td>0.656973</td>\n",
       "      <td>0.104621</td>\n",
       "      <td>0.171662</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Idx-4</th>\n",
       "      <td>0.477405</td>\n",
       "      <td>0.211749</td>\n",
       "      <td>0.692536</td>\n",
       "      <td>0.574679</td>\n",
       "      <td>0.379362</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Col-0     Col-1     Col-2     Col-3     Col-4\n",
       "Idx-0  0.759636  0.475505  0.625045  0.806929  0.447964\n",
       "Idx-1  0.085767  0.038912  0.726519  0.771332  0.287777\n",
       "Idx-2  0.781693  0.589335  0.485391  0.090572  0.559625\n",
       "Idx-3  0.309681  0.307883  0.656973  0.104621  0.171662\n",
       "Idx-4  0.477405  0.211749  0.692536  0.574679  0.379362"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows, cols = 1000, 5000\n",
    "df = pd.DataFrame(\n",
    "    np.random.random(size=(rows, cols)),\n",
    "    columns=[f'Col-{i}' for i in range(cols)],\n",
    "    index=[f'Idx-{i}' for i in range(rows)]\n",
    ")\n",
    "\n",
    "print(f'Data size: {df.values.nbytes / 1024 / 1204:.1f} MB')\n",
    "df.iloc[:5, :5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following is a simple, and a bit silly, row transformation: we construct a random matrix of the same size as the dataframe, take the mean across rows, and compute the outer product between it and the specified row. It is not too heavy computationally, but (this is the important part), both its inputs and outputs are matrices of size $1000\\times5000$, which have five millions entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def do_work(args):\n",
    "    df, idx = args\n",
    "    data = np.random.random(size=(len(df), len(df.columns)))\n",
    "    result = np.outer(df.loc[idx], data.mean(axis=0))\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first set a baseline by transforming 250 random rows in a sequential fashion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 250/250 [00:17<00:00, 13.98it/s]\n"
     ]
    }
   ],
   "source": [
    "process_rows = np.random.choice(len(df), 250)\n",
    "for i in tqdm(process_rows):\n",
    "    result = do_work((df, df.index[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is how you would naively transform this code to a parallel version using multiprocessing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 250/250 [01:37<00:00,  2.57it/s]\n"
     ]
    }
   ],
   "source": [
    "with mp.Pool() as pool:\n",
    "    tasks = ((df, df.index[idx]) for idx in process_rows)\n",
    "    result = pool.imap(do_work, tasks)\n",
    "    for res in tqdm(result, total=len(process_rows)):\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is much slower! As I hinted above, the problem is that the processes are exchanging a lot of data that has to be serialized, copied, and de-serialized. All of this takes time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The solution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thansk to [`shared_memory`](https://docs.python.org/3/library/multiprocessing.shared_memory.html), making this fast is a breeze! A caveat, though: it only works with Python 3.8 or above.\n",
    "\n",
    "We are first going to deal with plain numpy arrays, then build upon this to share pandas dataframes. The idea is to write a wrapper that takes care of moving data to and from the shared memory. I strongly encourage you to read the documentation I linked above, which explains this in more detail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from multiprocessing.shared_memory import SharedMemory\n",
    "\n",
    "\n",
    "class SharedNumpyArray:\n",
    "    '''\n",
    "    Wraps a numpy array so that it can be shared quickly among processes,\n",
    "    avoiding unnecessary copying and (de)serializing.\n",
    "    '''\n",
    "    def __init__(self, array):\n",
    "        '''\n",
    "        Creates the shared memory and copies the array therein\n",
    "        '''\n",
    "        # create the shared memory location of the same size of the array\n",
    "        self._shared = SharedMemory(create=True, size=array.nbytes)\n",
    "        \n",
    "        # save data type and shape, necessary to read the data correctly\n",
    "        self._dtype, self._shape = array.dtype, array.shape\n",
    "        \n",
    "        # create a new numpy array that uses the shared memory we created.\n",
    "        # at first, it is filled with zeros\n",
    "        res = np.ndarray(\n",
    "            self._shape, dtype=self._dtype, buffer=self._shared.buf\n",
    "        )\n",
    "        \n",
    "        # copy data from the array to the shared memory. numpy will\n",
    "        # take care of copying everything in the correct format\n",
    "        res[:] = array[:]\n",
    "\n",
    "    def read(self):\n",
    "        '''\n",
    "        Reads the array from the shared memory without unnecessary copying.\n",
    "        '''\n",
    "        # simply create an array of the correct shape and type,\n",
    "        # using the shared memory location we created earlier\n",
    "        return np.ndarray(self._shape, self._dtype, buffer=self._shared.buf)\n",
    "\n",
    "    def copy(self):\n",
    "        '''\n",
    "        Returns a new copy of the array stored in shared memory.\n",
    "        '''\n",
    "        return np.copy(self.read_array())\n",
    "        \n",
    "    def unlink(self):\n",
    "        '''\n",
    "        Releases the allocated memory. Call when finished using the data,\n",
    "        or when the data was copied somewhere else.\n",
    "        '''\n",
    "        self._shared.close()\n",
    "        self._shared.unlink()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, use this class to wrap the array, and send this wrapped object as parameter to a process and/or as a return value of the process, it's that simple! \n",
    "\n",
    "In particular, note that the array itself is not saved in the object. That is the whole point, we do not want to move it around! We can move the shared memory, though, because doing so will not copy the underlying memory, only a reference to it will be moved. Also note the `unlink` function: you must not forget to call it whenever you are done working with the array, or, alternatively, when you stored a copy somewhere else. If you don't do this, that shared memory will never be disposed of, and eventually you will run out of memory.\n",
    "\n",
    "Using that wrapper, it is trivial to share a pandas dataframe: we wrap the values using the class above, and save index and columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SharedPandasDataFrame:\n",
    "    '''\n",
    "    Wraps a pandas dataframe so that it can be shared quickly among processes,\n",
    "    avoiding unnecessary copying and (de)serializing.\n",
    "    '''\n",
    "    def __init__(self, df):\n",
    "        '''\n",
    "        Creates the shared memory and copies the dataframe therein\n",
    "        '''\n",
    "        self._values = SharedNumpyArray(df.values)\n",
    "        self._index = df.index\n",
    "        self._columns = df.columns\n",
    "\n",
    "    def read(self):\n",
    "        '''\n",
    "        Reads the dataframe from the shared memory\n",
    "        without unnecessary copying.\n",
    "        '''\n",
    "        return pd.DataFrame(\n",
    "            self._values.read(),\n",
    "            index=self._index,\n",
    "            columns=self._columns\n",
    "        )\n",
    "    \n",
    "    def copy(self):\n",
    "        '''\n",
    "        Returns a new copy of the dataframe stored in shared memory.\n",
    "        '''\n",
    "        return pd.DataFrame(\n",
    "            self._values.copy(),\n",
    "            index=self._index,\n",
    "            columns=self._columns\n",
    "        )\n",
    "        \n",
    "    def unlink(self):\n",
    "        '''\n",
    "        Releases the allocated memory. Call when finished using the data,\n",
    "        or when the data was copied somewhere else.\n",
    "        '''\n",
    "        self._values.unlink()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is how to use them, and how quick it is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def work_fast(args):\n",
    "    shared_df, idx = args\n",
    "    \n",
    "    # read dataframe from shared memory\n",
    "    df = shared_df.read()\n",
    "    \n",
    "    # call old function\n",
    "    result = do_work((df, idx))\n",
    "    \n",
    "    # wrap and return the result\n",
    "    return SharedNumpyArray(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 250/250 [00:13<00:00, 17.94it/s]\n"
     ]
    }
   ],
   "source": [
    "shared_df = SharedPandasDataFrame(df)\n",
    "\n",
    "with mp.Pool() as pool:\n",
    "    tasks = ((shared_df, df.index[idx]) for idx in process_rows)\n",
    "    result = pool.imap(work_fast, tasks)\n",
    "    for res in tqdm(result, total=len(process_rows)):\n",
    "        res.unlink()  # IMPORTANT\n",
    "\n",
    "shared_df.unlink()  # IMPORTANT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait a minute! you might say, this is barely faster than the single-process version! Well yes, but it is roughly seven times faster than the multiprocessing version :). You might have noticed that there is still some copying going on, after all: when we create the shared memory, we have to copy the array in there. Depending on the computations you perform in the worker process, you might be able to avoid this, e.g. by pre-allocating the shared memory and performing only in-place operations, but it strongly depends on exactly what and how you compute.\n",
    "\n",
    "The advantage of multiprocessing with shared memory becomes more apparent when workers perform more computations. For example, suppose we want to take the convolution of the whole dataframe with a $20\\times20$ filter made by $20^2$ random entries of the specified row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.ndimage import convolve\n",
    "\n",
    "def do_work(args):\n",
    "    df, idx = args\n",
    "    kernel_idx = np.random.choice(df.shape[1], 20 * 20)\n",
    "    kernel = df.loc[idx][kernel_idx].values.reshape((20, 20))\n",
    "    result = convolve(df.values, kernel)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, let us compare the sequential and multi-process versions. We can safely rule out the naive multiprocessing solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [00:46<00:00,  1.93s/it]\n"
     ]
    }
   ],
   "source": [
    "process_rows = np.random.choice(len(df), 24)\n",
    "for i in tqdm(process_rows):\n",
    "    result = do_work((df, df.index[i]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [00:08<00:00,  2.83it/s]\n"
     ]
    }
   ],
   "source": [
    "shared_df = SharedPandasDataFrame(df)\n",
    "\n",
    "with mp.Pool() as pool:\n",
    "    tasks = ((shared_df, df.index[idx]) for idx in process_rows)\n",
    "    result = pool.imap(work_fast, tasks)\n",
    "    for res in tqdm(result, total=len(process_rows)):\n",
    "        res.unlink()  # IMPORTANT\n",
    "\n",
    "shared_df.unlink()  # IMPORTANT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that the computations take much longer than copying the result, the advantage becomes clearer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When is this solution (not) applicable?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As discussed above, there is still some copying involved, therefore it is not straightforward to tell when this solution might be faster. When the size of the result is large, but the computations required to obtain it are not so heavy, a sequential approach might be faster, but it is not clear where to draw the line.\n",
    "\n",
    "Another case to watch out is when you have large inputs, but small outputs. This solution is not necessary when you only read the input, but do not modify it. This is because the inputs follow a mechanism called [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write), i.e. are not copied _unless_ they are modified. This can be shown by slightly modifying the example above to return the sum of the convolution, instead of the convolution itself:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [00:09<00:00,  2.47it/s]\n"
     ]
    }
   ],
   "source": [
    "def do_work(args):\n",
    "    df, idx = args\n",
    "    kernel_idx = np.random.choice(df.shape[1], 20 * 20)\n",
    "    kernel = df.loc[idx][kernel_idx].values.reshape((20, 20))\n",
    "    result = convolve(df.values, kernel)\n",
    "    return result.sum()\n",
    "\n",
    "\n",
    "with mp.Pool() as pool:\n",
    "    tasks = ((df, df.index[idx]) for idx in process_rows)\n",
    "    result = pool.imap(do_work, tasks)\n",
    "    for res in tqdm(result, total=len(process_rows)):\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is now as fast as the code above returning the whole result of the convolution.\n",
    "\n",
    "However, if you have classes the trick above becomes necessary again. I honestly do not know why. This also happens if the worker function is defined outside of the class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shared\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [00:00<00:00, 10363.77it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Not shared\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [00:29<00:00,  1.21s/it]\n"
     ]
    }
   ],
   "source": [
    "class Worker:\n",
    "    def __init__(self):\n",
    "        self.data = np.random.random(size=(10000, 10000))\n",
    "\n",
    "    @staticmethod\n",
    "    def work_not_shared(args):\n",
    "        data, i = args\n",
    "        return data[i].mean()\n",
    "    \n",
    "    def run_not_shared(self):\n",
    "        with mp.Pool() as pool:\n",
    "            tasks = [[self.data, idx] for idx in range(24)]\n",
    "            result = pool.imap(self.work_not_shared, tasks)\n",
    "            for res in tqdm(result, total=len(tasks)):\n",
    "                pass\n",
    "    \n",
    "    @staticmethod\n",
    "    def work_shared(args):\n",
    "        data, i = args\n",
    "        return data.read()[i].mean()\n",
    "    \n",
    "    def run_shared(self):\n",
    "        shared = SharedNumpyArray(self.data)\n",
    "        with mp.Pool() as pool:\n",
    "            tasks = [[shared, idx] for idx in range(24)]\n",
    "            result = pool.imap(self.work_shared, tasks)\n",
    "            for res in tqdm(result, total=len(tasks)):\n",
    "                pass\n",
    "        shared.unlink()\n",
    "\n",
    "\n",
    "print('Shared')\n",
    "Worker().run_shared()\n",
    "\n",
    "print('Not shared')\n",
    "Worker().run_not_shared()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Happy multiprocessing!\n",
    "\n",
    "This blog post, by the way, is fully contained in a jupyter notebook downloadable from [here](/attachments/multiprocessing-large-objects.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:abm2]",
   "language": "python",
   "name": "conda-env-abm2-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}