ReCoDE_MCMCFF/learning/05 Adding functionality.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5ac56056-ca33-4f13-8e36-564b94144c1e",
   "metadata": {
    "tags": []
   },
   "source": [
    "<h1 align=\"center\">Markov Chain Monte Carlo for fun and profit</h1>\n",
    "<h1 align=\"center\"> 🎲 ⛓️ 👉 🧪 </h1>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "eb5d773e-4cc0-48ae-bb71-7ece7ab5f936",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from numba import jit\n",
    "\n",
    "# This loads some custom styles for matplotlib\n",
    "import json, matplotlib\n",
    "\n",
    "with open(\"assets/matplotlibrc.json\") as f:\n",
    "    matplotlib.rcParams.update(json.load(f))\n",
    "\n",
    "np.random.seed(\n",
    "    42\n",
    ")  # This makes our random numbers reproducable when the notebook is rerun in order"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "486f066c-f027-44e8-8937-8636a52f32fb",
   "metadata": {},
   "source": [
    "## Functionality\n",
    "\n",
    "The main thing we want to be able to do is to take measurements, the code as I have writting it doesn't really allow that because it only returns the final state in the chain. Let's say we have a measurement called `average_color(state)` that we want to average over the whole chain. We could just stick that inside our definition of `mcmc` but we know that we will likely make other measurements too and we don't want to keep writing new versions of our core functionality!\n",
    "\n",
    "## Exercise 1\n",
    "Have a think about how you would implement this and what options you have."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c28b0a86-28f8-426f-9013-70e962f02256",
   "metadata": {},
   "source": [
    "## Solution 1\n",
    "So I chatted with my mentors on this project on how to best do this and we came up with a few ideas:\n",
    "\n",
    "### Option 1: Just save all the states and return them\n",
    "\n",
    "The problem with this is the states are very big and we don't want to waste all that memory. For an NxN state that uses 8 bit integers (the smallest we can use in numpy) 1000 samples would already use 2.5Gb of memory! We will see later that we'd really like to be able to go a bit bigger than 50x50 and 1000 samples!\n",
    "\n",
    "### Option 2: Pass in a function to make measurements\n",
    "```python\n",
    "\n",
    "def mcmc(initial_state, steps, T, measurement, energy=energy):\n",
    "    ...\n",
    "\n",
    "    current_state = initial_state.copy()\n",
    "    E = N**2 * energy(current_state)\n",
    "    for i in range(steps):\n",
    "        measurements[i] = measurement(state)\n",
    "        ...\n",
    "\n",
    "    return measurements\n",
    "```\n",
    "\n",
    "This could work but it limits how we can store measurements and what shape and type they can be. What if we want to store our measurements in a numpy array? Or what if your measurement itself is a vector or and object that can't easily be stored in a numpy array? We would have to think carefully about what functionality we want."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7c9575f-2450-4298-a507-90f0c1b9b284",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Option 3: Use Inheritance\n",
    "```python\n",
    "# This class would define the basic functionality of performing MCMC\n",
    "class MCMCSampler(object):\n",
    "    def run(self, initial_state, steps, T):\n",
    "        ...\n",
    "        for i in range(steps):\n",
    "            self.measurement(state)\n",
    "\n",
    "       \n",
    "# This class would inherit from it and just implement the measurement\n",
    "class AverageColorSampler(MCMCSampler):\n",
    "    measurements = np.zeros(10)\n",
    "    index = 0\n",
    "    \n",
    "    def measurement(self, state):\n",
    "        self.measurements[self.index] = some_function(state)\n",
    "        self.index += 1\n",
    "        \n",
    "color_sampler = AverageColorSampler(...)\n",
    "measurements = color_sampler.run(...)\n",
    "```\n",
    "\n",
    "This would definitely work but I personally am not a huge fan of object oriented programming so I'm gonna skip this option!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d05d25d-c9ba-406d-9977-0ca4aeb430a7",
   "metadata": {},
   "source": [
    "## Option 4: Use a generator\n",
    "This is the approach I ended up settling on, we will use [python generator function](https://peps.python.org/pep-0255/). While you may not have come across generator functions before, you almost certainly will have come across generators, `range(n)` is a generator, `(i for i in [1,2,3])` is a generator. Generator functions are a way to build your own generators, by way of example here is range implemented as a generator function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5b17c054-230f-4188-98a6-51fd8fe5b437",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<generator object my_range at 0x7fab3acae4a0>, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def my_range(n):\n",
    "    \"Behaves like the builtin range function of one argument\"\n",
    "    i = 0\n",
    "    while i < n:\n",
    "        yield i  # sends i out to whatever function called us\n",
    "        i += 1\n",
    "    return  # let's python know that we have nothing else to give\n",
    "\n",
    "\n",
    "my_range(10), list(my_range(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b74fadbe-80c2-4a20-b651-0e47188b005a",
   "metadata": {},
   "source": [
    "This requires only a very small change to our mcmc function and suddenly we can do whatever we like with the states! While we're at it I'm going to add an aditional argument `stepsize` that allows us to only sample the state every `stepsize` MCMC steps. You'll see why we would want to set this to value greater than 1 in a moment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f73d6335-6514-45b1-9128-d72122d8b0b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "19.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
      "10.8 ms ± 869 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
      "0x slowdown!\n"
     ]
    }
   ],
   "source": [
    "from MCFF.ising_model import energy, all_up_state\n",
    "from MCFF.mcmc import mcmc\n",
    "\n",
    "\n",
    "@jit(nopython=True, nogil=True)\n",
    "def mcmc_generator(initial_state, steps, T, stepsize=1000, energy=energy):\n",
    "    N, M = initial_state.shape\n",
    "    assert N == M\n",
    "\n",
    "    current_state = initial_state.copy()\n",
    "    E = N**2 * energy(current_state)\n",
    "    for _ in range(steps):\n",
    "        for _ in range(stepsize):\n",
    "            i, j = np.random.randint(N), np.random.randint(N)\n",
    "\n",
    "            # modify the state a little, here we just flip a random pixel\n",
    "            current_state[i, j] *= -1\n",
    "            new_E = N**2 * energy(current_state)\n",
    "\n",
    "            if (new_E < E) or np.exp(-(new_E - E) / T) > np.random.random():\n",
    "                E = new_E\n",
    "            else:\n",
    "                current_state[i, j] *= -1  # reject the change we made\n",
    "        yield current_state.copy()\n",
    "    return\n",
    "\n",
    "\n",
    "N_steps = 1000\n",
    "stepsize = 1\n",
    "initial_state = all_up_state(20)\n",
    "without_yield = %timeit -o mcmc(initial_state, steps = N_steps, T = 5)\n",
    "with_yield = %timeit -o [np.mean(s) for s in mcmc_generator(initial_state, T = 5, steps = N_steps, stepsize = 1)]\n",
    "print(f\"{with_yield.best / without_yield.best:.0f}x slowdown!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "132bf8d1-d341-494a-9881-689605a104b7",
   "metadata": {},
   "source": [
    "Fun fact: if you replace `yield current_state.copy()` with `yield current_state` your python kernel will crash when you run the code. I believe this is a bug in Numba that related to how pointers to numpy arrays work but let's not worry too much about it. \n",
    "\n",
    "We take a factor of two slowdown but that doesn't seem so much to pay for the fact we can now sample the state at every single step rather than just the last."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:recode]",
   "language": "python",
   "name": "conda-env-recode-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}