{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "layout: \"post\"\n", "title: \"Using large numpy arrays and pandas dataframes with multiprocessing\"\n", "date: 2020-06-19 09:00:00 +0200\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to multiprocessing, it is relatively straightforward to write parallel\n", "code in Python. However, these processes communicate by copying and\n", "(de)serializing data, which can make parallel code even slower when large\n", "objects are passed back and forth. This post shows how to use shared memory to\n", "avoid all the copying and serializing, making it possible to have fast parallel\n", "code that works with large datasets.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to demonstrate the problem empirically, let us create a large data-frame and do some processing on each row:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import multiprocessing as mp\n", "import numpy as np\n", "import pandas as pd\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data size: 32.4 MB\n" ] }, { "data": { "text/html": [ "
\n", " | Col-0 | \n", "Col-1 | \n", "Col-2 | \n", "Col-3 | \n", "Col-4 | \n", "
---|---|---|---|---|---|
Idx-0 | \n", "0.759636 | \n", "0.475505 | \n", "0.625045 | \n", "0.806929 | \n", "0.447964 | \n", "
Idx-1 | \n", "0.085767 | \n", "0.038912 | \n", "0.726519 | \n", "0.771332 | \n", "0.287777 | \n", "
Idx-2 | \n", "0.781693 | \n", "0.589335 | \n", "0.485391 | \n", "0.090572 | \n", "0.559625 | \n", "
Idx-3 | \n", "0.309681 | \n", "0.307883 | \n", "0.656973 | \n", "0.104621 | \n", "0.171662 | \n", "
Idx-4 | \n", "0.477405 | \n", "0.211749 | \n", "0.692536 | \n", "0.574679 | \n", "0.379362 | \n", "