Split a lot of notebooks

2025-06-26 08:51:16 +02:00 · 2022-06-16 14:00:28 +02:00 · 2022-06-16 14:00:28 +02:00 · 1fc68aa565
commit 1fc68aa565
parent a772b43fa7
9 changed files with 1185 additions and 589 deletions
--- a/Introduction.ipynb
+++ b/Introduction.ipynb
--- a/codebase.ipynb
+++ b/codebase.ipynb
--- a/learning/02
+++ b/learning/02
@ -11,7 +11,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
   "id": "6e51fe6c-a8b8-48ed-9e7f-70e18945e597",
   "metadata": {},
   "outputs": [],
@ -31,12 +31,9 @@
   "id": "5e745b48-9cde-49d1-b91b-b04282f6d30d",
   "metadata": {},
   "source": [
-    "Before we proceed I want to get some tests running. We'll use these to test the correctness of our code and also to check that we don't break anything that previously worked as we make changes and improvments. Checking you haven't broken something is called a regression test.\n",
+    "# Testing\n",
    "\n",
-    "Table of Contents:\n",
-    "- Setting up the directory structure of the project\n",
-    "- Creating a python package\n",
-    "- Testing!"
+    "Before we proceed with writing any more code I want to put what we already have in python file and make it into an installable module. This will be useful both for importing code into these notebooks and for testing later."
   ]
  },
  {
@ -44,7 +41,7 @@
   "id": "70c12053-ab4c-4a2b-a31a-16f01776419f",
   "metadata": {},
   "source": [
-    "# Directory Structure\n",
+    "## Directory Structure\n",
    "More info:\n",
    "- [General Python Packaging advice](https://packaging.python.org/en/latest/tutorials/packaging-projects/)\n",
    "- [Packaging for pytest](https://docs.pytest.org/en/6.2.x/goodpractices.html)\n",
@ -92,7 +89,7 @@
    "\n",
    "```python\n",
    "from MCFF.ising_model import all_up_state, all_down_state, random_state\n",
-    "from MCFF import mcmc\n",
+    "from MCFF import mcmc #once we've written this that is!\n",
    "```\n",
    "\n",
    "`pyproject.toml` and `setup.cfg` are the current way to describe the metadat about a python package like how it should be installed and who the author is etc, but typically you just copy the standard layouts and build from there. The empty `__init__.py` file flags that this folder is a python module.\n",
@ -147,6 +144,14 @@
    "```\n",
    "The dot means we should install MCFF from the current directory and `--editable` means to do it as an editable package so that we can edit the files in MCFF and not have to reinstall. This is really useful for development."
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3db75fb8-8229-41f4-8004-463b832cd4a4",
+   "metadata": {},
+   "source": [
+    "You can see what the files look like at this point at [this commit][add link here]. In the next notebook, we will finally write the Markov Chain Monte Carlo function!"
+   ]
  }
 ],
 "metadata": {
--- a/sampler.ipynb
+++ b/sampler.ipynb
--- a/Testing.ipynb
+++ b/Testing.ipynb
@ -29,7 +29,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 1,
   "id": "d4082e07-c51f-46ba-9a5e-bf45c2c319ba",
   "metadata": {},
   "outputs": [
@ -159,6 +159,7 @@
   "metadata": {},
   "source": [
    "## Autoformaters\n",
+    "Resources: [](https://the-turing-way.netlify.app/reproducible-research/code-quality/code-quality-style.html)\n",
    "\n",
    "While we're doing things that will help keep our code clean and tidy in the future, I would recommend installing a code formatter like `black`. This is a program that enforces a particular formatting style on your code by simply doing it for you. At first this sounds a bit weird, but it has a few benefits:\n",
    "\n",
@ -208,6 +209,14 @@
    "    return \" \".join([a, b, c / 2.0 + 3])"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "68cccdcc-b82e-4dec-bfc1-54072db8d762",
+   "metadata": {},
+   "source": [
+    "Finally, be aware that if you try to commit code with incorrect syntax then black will just error and prevent it, this is probably a good thing but there may be the occasional time where that's a problem."
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "0ba4802e-40e9-4cd5-8877-7a51ad2224b1",
@ -217,10 +226,46 @@
    "\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "e255b929-3ae7-4584-a2d4-5386d443a4af",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "...\n",
+    "\n",
+    "We can use a t-test to check a sample from the MCMC sampler matches our expectations. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "64cd4fd1-2a22-41ec-820f-5f5cac4a68b3",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# observations = average_color_data.mean(axis = -1)\n",
+    "\n",
+    "# from scipy import stats\n",
+    "# stats.ttest_1samp(observations[0], 0).pvalue < 1 / 50"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a38999f-e4c9-4b59-913c-35eb7fbffb4c",
+   "metadata": {},
+   "source": [
+    "## Test Driven Development\n",
+    "\n"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "38ce37fc-fd99-478b-9d34-6726af280bf0",
+   "id": "1b729c67-246d-4620-8e00-f355432c28a7",
   "metadata": {},
   "outputs": [],
   "source": []
--- a/functionality.ipynb
+++ b/functionality.ipynb
@ -0,0 +1,243 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "5ac56056-ca33-4f13-8e36-564b94144c1e",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "<h1 align=\"center\">Markov Chain Monte Carlo for fun and profit</h1>\n",
+    "<h1 align=\"center\"> 🎲 ⛓️ 👉 🧪 </h1>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "eb5d773e-4cc0-48ae-bb71-7ece7ab5f936",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from numba import jit\n",
+    "\n",
+    "# This loads some custom styles for matplotlib\n",
+    "import json, matplotlib\n",
+    "\n",
+    "with open(\"assets/matplotlibrc.json\") as f:\n",
+    "    matplotlib.rcParams.update(json.load(f))\n",
+    "\n",
+    "np.random.seed(\n",
+    "    42\n",
+    ")  # This makes our random numbers reproducable when the notebook is rerun in order"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "486f066c-f027-44e8-8937-8636a52f32fb",
+   "metadata": {},
+   "source": [
+    "## Functionality\n",
+    "\n",
+    "The main thing we want to be able to do is to take measurements, the code as I have writting it doesn't really allow that because it only returns the final state in the chain. Let's say we have a measurement called `average_color(state)` that we want to average over the whole chain. We could just stick that inside our definition of `mcmc` but we know that we will likely make other measurements too and we don't want to keep writing new versions of our core functionality!\n",
+    "\n",
+    "## Exercise 1\n",
+    "Have a think about how you would implement this and what options you have."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c28b0a86-28f8-426f-9013-70e962f02256",
+   "metadata": {},
+   "source": [
+    "## Solution 1\n",
+    "So I chatted with my mentors on this project on how to best do this and we came up with a few ideas:\n",
+    "\n",
+    "### Option 1: Just save all the states and return them\n",
+    "\n",
+    "The problem with this is the states are very big and we don't want to waste all that memory. For an NxN state that uses 8 bit integers (the smallest we can use in numpy) 1000 samples would already use 2.5Gb of memory! We will see later that we'd really like to be able to go a bit bigger than 50x50 and 1000 samples!\n",
+    "\n",
+    "### Option 2: Pass in a function to make measurements\n",
+    "```python\n",
+    "\n",
+    "def mcmc(initial_state, steps, T, measurement, energy=energy):\n",
+    "    ...\n",
+    "\n",
+    "    current_state = initial_state.copy()\n",
+    "    E = N**2 * energy(current_state)\n",
+    "    for i in range(steps):\n",
+    "        measurements[i] = measurement(state)\n",
+    "        ...\n",
+    "\n",
+    "    return measurements\n",
+    "```\n",
+    "\n",
+    "This could work but it limits how we can store measurements and what shape and type they can be. What if we want to store our measurements in a numpy array? Or what if your measurement itself is a vector or and object that can't easily be stored in a numpy array? We would have to think carefully about what functionality we want."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7c9575f-2450-4298-a507-90f0c1b9b284",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Option 3: Use Inheritance\n",
+    "```python\n",
+    "# This class would define the basic functionality of performing MCMC\n",
+    "class MCMCSampler(object):\n",
+    "    def run(self, initial_state, steps, T):\n",
+    "        ...\n",
+    "        for i in range(steps):\n",
+    "            self.measurement(state)\n",
+    "\n",
+    "       \n",
+    "# This class would inherit from it and just implement the measurement\n",
+    "class AverageColorSampler(MCMCSampler):\n",
+    "    measurements = np.zeros(10)\n",
+    "    index = 0\n",
+    "    \n",
+    "    def measurement(self, state):\n",
+    "        self.measurements[self.index] = some_function(state)\n",
+    "        self.index += 1\n",
+    "        \n",
+    "color_sampler = AverageColorSampler(...)\n",
+    "measurements = color_sampler.run(...)\n",
+    "```\n",
+    "\n",
+    "This would definitely work but I personally am not a huge fan of object oriented programming so I'm gonna skip this option!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d05d25d-c9ba-406d-9977-0ca4aeb430a7",
+   "metadata": {},
+   "source": [
+    "## Option 4: Use a generator\n",
+    "This is the approach I ended up settling on, we will use [python generator function](https://peps.python.org/pep-0255/). While you may not have come across generator functions before, you almost certainly will have come across generators, `range(n)` is a generator, `(i for i in [1,2,3])` is a generator. Generator functions are a way to build your own generators, by way of example here is range implemented as a generator function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "5b17c054-230f-4188-98a6-51fd8fe5b437",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(<generator object my_range at 0x7fab3acae4a0>, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def my_range(n):\n",
+    "    \"Behaves like the builtin range function of one argument\"\n",
+    "    i = 0\n",
+    "    while i < n:\n",
+    "        yield i  # sends i out to whatever function called us\n",
+    "        i += 1\n",
+    "    return  # let's python know that we have nothing else to give\n",
+    "\n",
+    "\n",
+    "my_range(10), list(my_range(10))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b74fadbe-80c2-4a20-b651-0e47188b005a",
+   "metadata": {},
+   "source": [
+    "This requires only a very small change to our mcmc function and suddenly we can do whatever we like with the states! While we're at it I'm going to add an aditional argument `stepsize` that allows us to only sample the state every `stepsize` MCMC steps. You'll see why we would want to set this to value greater than 1 in a moment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "f73d6335-6514-45b1-9128-d72122d8b0b7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "19.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
+      "10.8 ms ± 869 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
+      "0x slowdown!\n"
+     ]
+    }
+   ],
+   "source": [
+    "from MCFF.ising_model import energy, all_up_state\n",
+    "from MCFF.mcmc import mcmc\n",
+    "\n",
+    "\n",
+    "@jit(nopython=True, nogil=True)\n",
+    "def mcmc_generator(initial_state, steps, T, stepsize=1000, energy=energy):\n",
+    "    N, M = initial_state.shape\n",
+    "    assert N == M\n",
+    "\n",
+    "    current_state = initial_state.copy()\n",
+    "    E = N**2 * energy(current_state)\n",
+    "    for _ in range(steps):\n",
+    "        for _ in range(stepsize):\n",
+    "            i, j = np.random.randint(N), np.random.randint(N)\n",
+    "\n",
+    "            # modify the state a little, here we just flip a random pixel\n",
+    "            current_state[i, j] *= -1\n",
+    "            new_E = N**2 * energy(current_state)\n",
+    "\n",
+    "            if (new_E < E) or np.exp(-(new_E - E) / T) > np.random.random():\n",
+    "                E = new_E\n",
+    "            else:\n",
+    "                current_state[i, j] *= -1  # reject the change we made\n",
+    "        yield current_state.copy()\n",
+    "    return\n",
+    "\n",
+    "\n",
+    "N_steps = 1000\n",
+    "stepsize = 1\n",
+    "initial_state = all_up_state(20)\n",
+    "without_yield = %timeit -o mcmc(initial_state, steps = N_steps, T = 5)\n",
+    "with_yield = %timeit -o [np.mean(s) for s in mcmc_generator(initial_state, T = 5, steps = N_steps, stepsize = 1)]\n",
+    "print(f\"{with_yield.best / without_yield.best:.0f}x slowdown!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "132bf8d1-d341-494a-9881-689605a104b7",
+   "metadata": {},
+   "source": [
+    "Fun fact: if you replace `yield current_state.copy()` with `yield current_state` your python kernel will crash when you run the code. I believe this is a bug in Numba that related to how pointers to numpy arrays work but let's not worry too much about it. \n",
+    "\n",
+    "We take a factor of two slowdown but that doesn't seem so much to pay for the fact we can now sample the state at every single step rather than just the last."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:recode]",
+   "language": "python",
+   "name": "conda-env-recode-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/learning/06
+++ b/learning/06
--- a/outputs.ipynb
+++ b/outputs.ipynb
--- a/science.ipynb
+++ b/science.ipynb
@ -0,0 +1,33 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "94c1cebd-8bb4-4f19-bdb3-c503cd620d22",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:recode]",
+   "language": "python",
+   "name": "conda-env-recode-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}