From 63523481e89ae8c8f74a900ae43b035e3312f9c8 Mon Sep 17 00:00:00 2001 From: Tom Hodson Date: Thu, 30 Jun 2022 11:57:05 +0100 Subject: [PATCH] Updates! --- docs/learning/01 Introduction.ipynb | 18 +- docs/learning/04 Testing.ipynb | 219 +++++++++++++----- .../08 Doing Reproducible Science.ipynb | 4 +- environment.yml | 19 ++ environment_from_history.yml | 14 ++ environment_full.yml | 94 ++++++++ 6 files changed, 304 insertions(+), 64 deletions(-) create mode 100644 environment.yml create mode 100644 environment_from_history.yml create mode 100644 environment_full.yml diff --git a/docs/learning/01 Introduction.ipynb b/docs/learning/01 Introduction.ipynb index 0afc7b7..6148bf5 100644 --- a/docs/learning/01 Introduction.ipynb +++ b/docs/learning/01 Introduction.ipynb @@ -21,13 +21,27 @@ "I would also suggest you setup a python environment just for this. You can use your preferred method to do this, but I will recomend `conda` because it's both what I currently use and what is recommeded by Imperial: LINK \n", "\n", "```bash\n", - "#make a new conda environment named recode, with python 3.9 and the packages in requirements.txt\n", - "conda env create --name recode python=3.9 --file requirements.txt\n", + "#make a new conda environment from the specification in environment.yml\n", + "conda env create --file environment.yml\n", "\n", "#activate the environment\n", "conda activate recode\n", "```\n", "\n", + "If you'd prefer to keep this environment nicely stored away in this repository, you can save in a folder called env by doing\n", + "\n", + "```bash\n", + "conda env create --prefix env --file environment.yml\n", + "conda activate ./env #you have to run this in the enclosing directory of course!\n", + "```\n", + "\n", + "Further Reading on how to set up an environment:\n", + "- [Software Carpentry](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/02-working-with-environments/index.html) More practical.\n", + "- [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/renv.html) Discusses why we use environments.\n", + "- [Essential Software Engineering for Researchers](https://imperialcollegelondon.github.io/grad_school_software_engineering_course/l1-01-tools-I/index.html) Quick overview.\n", + "\n", + "\n", + "\n", "## The Problem\n", "\n", "So without further ado lets talk about the problem we'll be working on, you don't necessaryily need to understand the full details of this to learn the important lessons but I will give a quick summary here. We want to simulate a physical model called the **Ising model**, which is famous in physics because it's about the simplest thing you can come up with that displays a phase transition, a special kind of shift between two different behaviours." diff --git a/docs/learning/04 Testing.ipynb b/docs/learning/04 Testing.ipynb index 07046c1..46b4079 100644 --- a/docs/learning/04 Testing.ipynb +++ b/docs/learning/04 Testing.ipynb @@ -18,7 +18,12 @@ "tags": [] }, "source": [ - "Ok we can finally start writing and running some tests! Check out the [pytest website](https://docs.pytest.org/en/7.1.x/getting-started.html) for a tutorial on how to write tests in pytest and head over to the [Turing Way](https://the-turing-way.netlify.app/reproducible-research/testing.html) for a great introduction to testing in general. \n", + "Further reading on Testing:\n", + "- [The official Pytest docs](https://docs.pytest.org/en/7.1.x/getting-started.html)\n", + "- [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/testing.html)\n", + "- [Essential Software Engineering for Researchers](https://imperialcollegelondon.github.io/grad_school_software_engineering_course/l2-01-testing_overview/index.html)\n", + "\n", + "Ok we can finally start writing and running some tests!\n", "\n", "I copied some of the initial tests that we did in chapter 1 into `test_energy.py` installed pytest into my development environment with `pip install pytest`. If you're using conda you need to use `conda install pytest` and now I can run the `pytest` command in the ReCoDE_MCFF directory. Pytest will automatically discover our tests and run them, to do this it relies on their being python files with functions named `test_\\*` which it will run.\n", "\n", @@ -134,10 +139,9 @@ "id": "d70a8934-a58d-4aa6-afca-35fee23bf851", "metadata": {}, "source": [ - "## Advanced Testing Methods: Hypothesis\n", + "## Advanced Testing Methods: Property Based Testing\n", "\n", - "\n", - "I won't do into huge detail here but I thought it would be nice to make you aware of a nice library called `Hypothesis` that helps with this problem of finding edge cases. `Hypothesis` gives you tools to generate randomised inputs to functions, so as long as you can come up with some way to verify the output is correct (or just that the code doens't throw and error!) then this can be a powerful method of testing. \n", + "I won't do into huge detail here but I thought it would be nice to make you aware of a nice library called `Hypothesis` that helps with this problem of finding edge cases. `Hypothesis` gives you tools to generate randomised inputs to functions, so as long as you can come up with some way to verify the output is correct or has the correct _properties_ (or just that the code doens't throw and error!) then this can be a powerful method of testing. \n", "\n", "\n", "Take a look in `test_energy_using_hypothesis.py`\n", @@ -153,13 +157,159 @@ "You tell Hypothesis how to generate the test data, in this case we use some numpy specifc code to generate 2 dimensional arrays with `dtype = int` and entries randomly sampled from `[1, -1]`. We use the same trick as before of checking two implementations against one another." ] }, + { + "cell_type": "markdown", + "id": "128cc7f1-8c9c-44f9-8c6a-ec16acb6fc68", + "metadata": { + "tags": [] + }, + "source": [ + "## Testing Stochastic Code\n", + "\n", + "We have a interesting problem here, most testing assumes that for the same inputs we will always get the same outputs but our MCMC sampler is a stochastic algorithm. So how can we test it? I can see three mains routes we can take:\n", + "\n", + "- Fix the seed of the random number generator to make it deterministic\n", + "- Do statistical tests on the output \n", + "- Use property based testing (see above)\n", + "\n", + "### Fixed Seeds\n", + "The random number generators we typically use are really pseudo-random number generators: given a value called a seed they generate a deterministic pattern that looks for most purposes like a random sequence. Typically the seed is determined by something that is _more random_ such as a physical random number generator. However if we fix the seed we can create reproducabile plots and test our code more easily!" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "c28d257e-5466-4cf0-9381-46f1cbdeaf8e", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([ 0.55326059, 0.21760061, -0.05798999, -2.31893609, 0.43149417,\n", + " -2.12627978, 0.90992122, 0.60596557, 0.83005665, 0.82769834]),\n", + " array([-0.57820285, -0.65570117, 1.60871517, -0.83666294, 2.03363763,\n", + " 0.44904314, 0.31099544, -0.85810422, -0.87923828, 0.96426779]))" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "seed = [\n", + " 2937053738,\n", + " 1783364611,\n", + " 3145507090,\n", + "] # generated once with rng.integers(2**63, size = 3) and then saved\n", + "\n", + "# New Style\n", + "# numba doesn't yet support this so I haven't used it in our code\n", + "# but if you aren't using numba then you should get used to this new style)\n", + "from numpy.random import default_rng\n", + "\n", + "rng = default_rng(seed=23)\n", + "vals = rng.standard_normal(10)\n", + "\n", + "# Old style\n", + "from numpy import random\n", + "\n", + "random.seed(seed)\n", + "vals2 = random.standard_normal(10)\n", + "\n", + "vals, vals2 # note that the two styles do no give the same results" + ] + }, + { + "cell_type": "markdown", + "id": "fb281250-0f08-43a8-bcb2-4b9e2c262cd9", + "metadata": {}, + "source": [ + "However this has a major drawback, if we want this to work we must always generate the same random numbers in the same order and use them in the same way if we want the output to be the same. This is a problem because we might want to make a change to our MCMC sampler in a way that changes the way it call the rng but still want to compare it to the previous version. In this case we have to use statistical tests instead.\n", + "\n", + "### Statistical Tests\n", + "If we want to verify that two different implementations of our algorithm agree or that the output matches our expectations, we can use something like a t-test to check our samples. Now this gets complicated very fast but bear with me for this simple example:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "64cd4fd1-2a22-41ec-820f-5f5cac4a68b3", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'MCFF'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", + "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mMCFF\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmcmc\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m mcmc_generator\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m### The measurement we will make ###\u001b[39;00m\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21maverage_color\u001b[39m(state):\n", + "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'MCFF'" + ] + } + ], + "source": [ + "from MCFF.mcmc import mcmc_generator\n", + "\n", + "### The measurement we will make ###\n", + "def average_color(state):\n", + " return np.mean(state)\n", + "\n", + "\n", + "### Simulation Inputs ###\n", + "N = 10 # Use an NxN system\n", + "T = 1000 # What temperatures to use\n", + "steps = 200 # How many times to sample the state\n", + "stepsize = N**2 # How many individual monte carlo flips to do in between each sample\n", + "initial_state = np.ones(shape=(N, N)) # the intial state to use\n", + "\n", + "### Simulation Code ###\n", + "average_color_data = np.array(\n", + " [\n", + " average_color(s)\n", + " for s in mcmc_generator(initial_state, steps=steps, stepsize=stepsize, T=T)\n", + " ]\n", + ")\n", + "\n", + "\n", + "from scipy import stats\n", + "\n", + "stats.ttest_1samp(average_color_data, 0).pvalue < 1 / 50" + ] + }, + { + "cell_type": "markdown", + "id": "041557a7-1965-4bea-9e61-4d0d6df335e8", + "metadata": {}, + "source": [ + "## Test Driven Development\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b729c67-246d-4620-8e00-f355432c28a7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "id": "91fa6842-f214-4fd3-bbe5-0f20d7d0d2cc", "metadata": {}, "source": [ "## Autoformaters\n", - "Resources: [](https://the-turing-way.netlify.app/reproducible-research/code-quality/code-quality-style.html)\n", + "Further reading on the topic of autoformatters:\n", + "- [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/code-quality/code-quality-style.html)\n", + "- [Essential Software Engineering for Researchers](https://imperialcollegelondon.github.io/grad_school_software_engineering_course/l1-02-tools-II/index.html)\n", "\n", "While we're doing things that will help keep our code clean and tidy in the future, I would recommend installing a code formatter like `black`. This is a program that enforces a particular formatting style on your code by simply doing it for you. At first this sounds a bit weird, but it has a few benefits:\n", "\n", @@ -216,66 +366,13 @@ "source": [ "Finally, be aware that if you try to commit code with incorrect syntax then black will just error and prevent it, this is probably a good thing but there may be the occasional time where that's a problem." ] - }, - { - "cell_type": "markdown", - "id": "0ba4802e-40e9-4cd5-8877-7a51ad2224b1", - "metadata": {}, - "source": [ - "## Testing Stochastic Code\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "e255b929-3ae7-4584-a2d4-5386d443a4af", - "metadata": { - "tags": [] - }, - "source": [ - "...\n", - "\n", - "We can use a t-test to check a sample from the MCMC sampler matches our expectations. " - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "64cd4fd1-2a22-41ec-820f-5f5cac4a68b3", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# observations = average_color_data.mean(axis = -1)\n", - "\n", - "# from scipy import stats\n", - "# stats.ttest_1samp(observations[0], 0).pvalue < 1 / 50" - ] - }, - { - "cell_type": "markdown", - "id": "0a38999f-e4c9-4b59-913c-35eb7fbffb4c", - "metadata": {}, - "source": [ - "## Test Driven Development\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1b729c67-246d-4620-8e00-f355432c28a7", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python [conda env:jupyter3.9] *", + "display_name": "Python [conda env:recode]", "language": "python", - "name": "conda-env-jupyter3.9-py" + "name": "conda-env-recode-py" }, "language_info": { "codemirror_mode": { @@ -287,7 +384,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.7" + "version": "3.9.12" } }, "nbformat": 4, diff --git a/docs/learning/08 Doing Reproducible Science.ipynb b/docs/learning/08 Doing Reproducible Science.ipynb index 3b77c47..b02793d 100644 --- a/docs/learning/08 Doing Reproducible Science.ipynb +++ b/docs/learning/08 Doing Reproducible Science.ipynb @@ -14,7 +14,9 @@ "id": "94c1cebd-8bb4-4f19-bdb3-c503cd620d22", "metadata": {}, "source": [ - "# Doing Reproducible Science" + "# Doing Reproducible Science\n", + "Further Reading on this software reproducability: \n", + "- []()\n" ] } ], diff --git a/environment.yml b/environment.yml new file mode 100644 index 0000000..59472be --- /dev/null +++ b/environment.yml @@ -0,0 +1,19 @@ +name: recode + +channels: + - defaults + - conda-forge + +dependencies: + - python=3.9 + - pytest=7.1 + - pytest-cov=3.0 + - ipykernel=6.9 + - numpy=1.21 + - scipy=1.7 + - matplotlib=3.5 + - numba=0.55 + - pre-commit + - pip=21.2 + - pip: + - --editable . #install MCFF from the local repository using pip and do it in editable mode diff --git a/environment_from_history.yml b/environment_from_history.yml new file mode 100644 index 0000000..cfbe90a --- /dev/null +++ b/environment_from_history.yml @@ -0,0 +1,14 @@ +name: recode + +channels: + - defaults + +dependencies: + - python=3.9 + - pytest + - pytest-cov + - ipykernel + - numpy + - scipy + - matplotlib + - numba diff --git a/environment_full.yml b/environment_full.yml new file mode 100644 index 0000000..663d5e9 --- /dev/null +++ b/environment_full.yml @@ -0,0 +1,94 @@ +name: recode +channels: + - defaults +dependencies: + - appnope=0.1.2=py39hecd8cb5_1001 + - asttokens=2.0.5=pyhd3eb1b0_0 + - attrs=21.4.0=pyhd3eb1b0_0 + - backcall=0.2.0=pyhd3eb1b0_0 + - blas=1.0=mkl + - brotli=1.0.9=hb1e8313_2 + - ca-certificates=2022.4.26=hecd8cb5_0 + - certifi=2022.6.15=py39hecd8cb5_0 + - coverage=6.3.2=py39hca72f7f_0 + - cycler=0.11.0=pyhd3eb1b0_0 + - debugpy=1.5.1=py39he9d5cce_0 + - decorator=5.1.1=pyhd3eb1b0_0 + - entrypoints=0.4=py39hecd8cb5_0 + - executing=0.8.3=pyhd3eb1b0_0 + - fonttools=4.25.0=pyhd3eb1b0_0 + - freetype=2.11.0=hd8bbffd_0 + - giflib=5.2.1=haf1e3a3_0 + - iniconfig=1.1.1=pyhd3eb1b0_0 + - intel-openmp=2021.4.0=hecd8cb5_3538 + - ipykernel=6.9.1=py39hecd8cb5_0 + - ipython=8.3.0=py39hecd8cb5_0 + - jedi=0.18.1=py39hecd8cb5_1 + - jpeg=9e=hca72f7f_0 + - jupyter_client=7.2.2=py39hecd8cb5_0 + - jupyter_core=4.10.0=py39hecd8cb5_0 + - kiwisolver=1.4.2=py39he9d5cce_0 + - lcms2=2.12=hf1fd2bf_0 + - libcxx=12.0.0=h2f01273_0 + - libffi=3.3=hb1e8313_2 + - libgfortran=3.0.1=h93005f0_2 + - libllvm11=11.1.0=h46f1229_1 + - libpng=1.6.37=ha441bb4_0 + - libsodium=1.0.18=h1de35cc_0 + - libtiff=4.2.0=hdb42f99_1 + - libwebp=1.2.2=h56c3ce4_0 + - libwebp-base=1.2.2=hca72f7f_0 + - llvm-openmp=12.0.0=h0dcd299_1 + - llvmlite=0.38.0=py39h8346a28_0 + - lz4-c=1.9.3=h23ab428_1 + - matplotlib=3.5.1=py39hecd8cb5_1 + - matplotlib-base=3.5.1=py39hfb0c5b7_1 + - matplotlib-inline=0.1.2=pyhd3eb1b0_2 + - mkl=2021.4.0=hecd8cb5_637 + - mkl-service=2.4.0=py39h9ed2024_0 + - mkl_fft=1.3.1=py39h4ab4a9b_0 + - mkl_random=1.2.2=py39hb2f4e1b_0 + - munkres=1.1.4=py_0 + - ncurses=6.3=hca72f7f_2 + - nest-asyncio=1.5.5=py39hecd8cb5_0 + - numba=0.55.1=py39hae1ba45_0 + - numpy=1.21.5=py39h2e5f0a9_3 + - numpy-base=1.21.5=py39h3b1a694_3 + - openssl=1.1.1p=hca72f7f_0 + - packaging=21.3=pyhd3eb1b0_0 + - parso=0.8.3=pyhd3eb1b0_0 + - pexpect=4.8.0=pyhd3eb1b0_3 + - pickleshare=0.7.5=pyhd3eb1b0_1003 + - pillow=9.0.1=py39hde71d04_0 + - pip=21.2.4=py39hecd8cb5_0 + - pluggy=1.0.0=py39hecd8cb5_1 + - prompt-toolkit=3.0.20=pyhd3eb1b0_0 + - ptyprocess=0.7.0=pyhd3eb1b0_2 + - pure_eval=0.2.2=pyhd3eb1b0_0 + - py=1.11.0=pyhd3eb1b0_0 + - pygments=2.11.2=pyhd3eb1b0_0 + - pyparsing=3.0.4=pyhd3eb1b0_0 + - pytest=7.1.2=py39hecd8cb5_0 + - pytest-cov=3.0.0=pyhd3eb1b0_0 + - python=3.9.12=hdfd78df_1 + - python-dateutil=2.8.2=pyhd3eb1b0_0 + - pyzmq=22.3.0=py39he9d5cce_2 + - readline=8.1.2=hca72f7f_1 + - scipy=1.7.3=py39h8c7af03_0 + - setuptools=61.2.0=py39hecd8cb5_0 + - six=1.16.0=pyhd3eb1b0_1 + - sqlite=3.38.5=h707629a_0 + - stack_data=0.2.0=pyhd3eb1b0_0 + - tbb=2021.5.0=haf03e11_0 + - tk=8.6.12=h5d9f67b_0 + - toml=0.10.2=pyhd3eb1b0_0 + - tomli=1.2.2=pyhd3eb1b0_0 + - tornado=6.1=py39h9ed2024_0 + - traitlets=5.1.1=pyhd3eb1b0_0 + - tzdata=2022a=hda174b7_0 + - wcwidth=0.2.5=pyhd3eb1b0_0 + - wheel=0.37.1=pyhd3eb1b0_0 + - xz=5.2.5=hca72f7f_1 + - zeromq=4.3.4=h23ab428_0 + - zlib=1.2.12=h4dc903c_2 + - zstd=1.5.2=hcb37349_0