https://forwardscattering.org
Forward Scattering - The Weblog of Nicholas Chapman
Forward Scattering - The Weblog of Nicholas Chapman The weblog of Nicholas Chapman Detecting reads from uninitialised heap memory in C++ programs at runtimePosted 26 Aug 2025Reading from uninitialised heap memory looks like this: unsigned int* v = (unsigned int*)malloc(sizeof(unsigned int) * 100); if(v[0] > 1) printf("boo"); These reads can be pretty hard bugs to detect in a lot of cases. Let's look at some tools that can help us find these bugs. First off, most malloc implementations in debug mode will fill the memory with a special byte pattern like 0xCDCDCDCD. Unfortunately this doesn't directly help you detect reads of such patterns. The Microsoft Visual C++ compiler has an option /RTCu which checks for use of uninitialised variables at runtime. See 'Run-time error checks'. It only seems to work for stack variables though, and issues no runtime error for the code above. There is also Address Sanitizer, which is built-in to Visual Studio these days (/fsanitize=address). That also fails to detect the bug, despite it being able to detect a whole lot of other stuff (see https://github.com/google/sanitizers/wiki/addresssanitizer for a list of capabilities) It looks like Memory Sanitizer will be able to detect this bug, unfortunately this is not built into MSVC, and requires the entire program and all libraries linked to it to be built with Memory Sanitizer as well apparently. So, that leaves us with good-old Valgrind, which has been around for more than 20 years! The nice thing about Valgrind is that it doesn't require compiling your program in any special way, just run your program from the command line in the usual way, prefixed by the 'valgrind' command. The downsides are it doesn't work on Windows, and it greatly slows down the execution speed of the program (by something like 20x). But it *will* find your read-from-uninitialised-heap-memory bug! Here's what the Valgrind console output looks like: ==216649== Conditional jump or move depends on uninitialised value(s) ==216649== at 0xA51E40: TestSuite::test() (TestSuite.cpp:133) ==216649== by 0x63E638: main (MainWindow.cpp:4369) ==216649== Wonderful! - what Valgrind is telling us is that the program used some uninitialised data to decide whether to do a jump or not - e.g. if the if() condition was true and the printf should be executed. In conclusion, to detect reads from uninitialised heap memory, you will need to use either Memory Sanitizer or break out the old workhorse Valgrind.Fast acos approximationPosted 5 Jun 2025Continuing with the theme of fast function approximations (see the previous post on Fresnel curve approximations), using my custom program search algorithm, I have come up with a fast approximate acos (arccosine) function. Absolute error is second / (double)filtered_data.size(); entropy += -p * std::log2(p); } return entropy; } For my data this gives: Entropy: 9.115353669970485 bits Theoretical min compressed size: 1194767.6362303714 B So the theoretical minimum is ~1.2 MB. Zstd is only acheiving ~1.4 MB, which is nowhere near the theoretical limit. This was quite disappointing to me, as I like zstd and use it a lot, and I know that zstd has some fancy ANS entropy encoding. The reason for this failure was explained to my by Conor Stokes as being due to the Lempel–Ziv–Welch stage of zstd compression, that is not well suited for entropy coding - and apparently this stage introduces a lot of overhead. Interestingly, 7-zip compresses the data much better than zstd, reaching a size of 1,298,590 B. So zstd is not the cure-all and panacea of data compression that I had hoped :) I would also be interested in a good C++/C entropy encoding library if anyone knows one. Edit: Comments on Hacker News here. Introducing SubstrataPosted 25 May 2021Substrata is my latest virtual world / metaverse. I've written a post about my vision / plans for it here: https://substrata.info/about_substrata No warning for unused variable when declared constPosted 12 Feb 2020An issue I found with the Visual Studio 2019 C++ Compiler: https://developercommunity.visualstudio.com/content/problem/785567/no-warning-for-unused-variable-when-declared-const.htmlHow to fix a bugPosted 2 Oct 2019Lets say you encounter a bug in the software you work on. Maybe a user informs you of the bug, or you find it yourself. Let's say this bug has been in your software for a little while, and is not a bug you just added in the last hour or so. This implies that your unit tests did not catch the bug. Here's the process I recommend for fixing the bug: 1) Work out what the problem is. 2) Work out the solution, edit your source files until the bug is fixed. 3) Since the bug was not caught by a unit test, make a unit test that will catch the bug. 4) Undo the bug fix from your source files, and make sure your new unit test catches the bug. 5) Redo the bug fix. Make sure the new unit test passes. 6) Commit the bug fix and the new unit test(s). 7) Think about how the bug got into your source code. Why wasn't it caught by unit tests? How can you improve your unit tests? 8) Think about if there are likely to be any similar bugs in your source code. Can you add unit tests for those possible bugs? This is not an exact prescription. The ordering of the steps is flexible, and in some cases some steps may be more aspirational than realistic.CryptoVoxels world rendered in IndigoPosted 8 May 2019 I did some virtual wandering through the CryptoVoxels world (loaded into Indigo Renderer), taking some virtual snapshots with Indigo: Realtime fractal programPosted 24 Feb 2019I open-sourced this fun little program: https://github.com/glaretechnologies/fractal It animates a Julia set in realtime using OpenGL shaders. It has some funky colouring techniques so looks a bit nicer than your average fractal. Fast rotation of a vector by a quaternionPosted 18 Jul 2018There is a fast way to compute the rotation of a vector by a quaternion, apparently noted first by Fabian Giesen. See this blog post. Since the original description/derivation seems to have been lost in the sands of time (pages 404), I thought I would re-derive it. In the derivation below, I will follow the notation of Ken Shoemake in Quaternions. Suppose we have a unit quaternion $$q = [\mathbf{v}, w]$$ The rotation of a vector, \(\mathbf{p}\) by the quaternion \(q\) is given by $$ p' = q p q^* $$ where p is treated as the quaternion $$ [(p_x, p_y, p_z), 0] = [\mathbf{p}, 0] $$ e.g. as a quaternion with the w coord zero. Expanding out \(q p q^*\) using the definition of quaternion multiplication gives $$ p' = q p q^* = ([\mathbf{v}, w][\mathbf{p}, 0])[-\mathbf{v}, w] $$ $$ p' = ([\mathbf{v}0 + \mathbf{p}w + \mathbf{v} \times \mathbf{p}, w 0 - \mathbf{v}.\mathbf{p}])[-\mathbf{v}, w] $$ $$ p' = ([\mathbf{p}w + \mathbf{v} \times \mathbf{p}, - \mathbf{v}.\mathbf{p}]) [-\mathbf{v}, w] $$ $$ p' = [(\mathbf{p}w + \mathbf{v} \times \mathbf{p})w + (-\mathbf{v})(- \mathbf{v}.\mathbf{p}) + (\mathbf{p}w + \mathbf{v} \times \mathbf{p}) \times -\mathbf{v}), (- \mathbf{v}.\mathbf{p})w - (\mathbf{p}w + \mathbf{v} \times \mathbf{p}).(-\mathbf{v})] $$ Lets treat the w component of this first: $$ p'_w = - \mathbf{v}.\mathbf{p}w - (\mathbf{p}w).(-\mathbf{v}) + (\mathbf{v} \times \mathbf{p}).(-\mathbf{v}) $$ $$ p'_w = - w\mathbf{v}.\mathbf{p} + w\mathbf{p}.\mathbf{v} - (\mathbf{v} \times \mathbf{p}).\mathbf{v} $$ We know \(\mathbf{v} \times \mathbf{p}\) is orthogonal to \( \mathbf{v} \), so $$(\mathbf{v} \times \mathbf{p}).\mathbf{v} = 0$$ Therefore $$ p'_w = 0 $$ Giving a resulting vector with zero w coordinate like we expected. Continuing with the vector part of \(p'\): $$ \mathbf{p'} = (\mathbf{p}w + \mathbf{v} \times \mathbf{p})w + (-\mathbf{v})(- \mathbf{v}.\mathbf{p}) + (\mathbf{p}w \times -\mathbf{v}) + ((\mathbf{v} \times \mathbf{p}) \times -\mathbf{v}) $$ $$ \mathbf{p'} = (\mathbf{p}w + \mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times \mathbf{p})w + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ $$ \mathbf{p'} = \mathbf{p}w^2 + (\mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times \mathbf{p})w + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ $$ \mathbf{p'} = \mathbf{p}w^2 + 2(\mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ Remember that for a unit quaternion, \(\mathbf{v}\) is a (non-unit) vector around which the quaternion gives a rotation, in particular $$||\mathbf{v}|| = \sin(\Omega) $$ Where \( \Omega \) is half the angle of the quaternion rotation, e.g. $$||\mathbf{v}|| = \sin(\Omega) = \sin(\theta/2)$$ So now we decompose the vector \( \mathbf{p} \) into the vector projection and vector rejection (see wikipedia) components with respect to the unit vector \( \hat{\mathbf{v}} = \mathbf{v}/||\mathbf{v}||\), giving $$ \mathbf{p} = \hat{\mathbf{v}}(\hat{\mathbf{v}}.\mathbf{p}) + (-\hat{\mathbf{v}} \times (\hat{\mathbf{v}} \times \mathbf{p})) $$ Substituting in $$\hat{\mathbf{v}} = \mathbf{v}/||\mathbf{v}|| = { \mathbf{v} \over \sin(\theta/2) } $$ gives $$ \mathbf{p} = ({ \mathbf{v} \over \sin(\theta/2) })(({ \mathbf{v} \over \sin(\theta/2) }).\mathbf{p}) - (({ \mathbf{v} \over \sin(\theta/2) }) \times (({ \mathbf{v} \over \sin(\theta/2) }) \times \mathbf{p})) $$ $$ \mathbf{p} = { \mathbf{v}(\mathbf{v}.\mathbf{p}) \over \sin^2(\theta/2) } - { \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) \over \sin^2(\theta/2) } $$ $$ \mathbf{p} = { \mathbf{v}(\mathbf{v}.\mathbf{p}) - \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) \over \sin^2(\theta/2) } $$ or $$ \mathbf{v}(\mathbf{v}.\mathbf{p}) = \sin^2(\theta/2)\mathbf{p} + \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) $$ Going back to our expression for \(\mathbf{p'}\): $$ \mathbf{p'} = \mathbf{p}w^2 + 2(\mathbf{v} \times \mathbf{p})w + \mathbf{v}( \mathbf{v}.\mathbf{p}) + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ Substituting in our expression for \(\mathbf{v}(\mathbf{v}.\mathbf{p})\) gives $$ \mathbf{p'} = \mathbf{p}w^2 + 2(\mathbf{v} \times \mathbf{p})w + (\sin^2(\theta/2)\mathbf{p} + \mathbf{v} \times (\mathbf{v} \times \mathbf{p})) + (\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ $$ \mathbf{p'} = \mathbf{p}(w^2 + \sin^2(\theta/2)) + 2(\mathbf{v} \times \mathbf{p})w + 2(\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ But for unit quaternions, the relationship between the w component and the rotation angle \(\theta\) is given by $$w = \cos(\Omega) = \cos(\theta/2) $$ so $$ \mathbf{p'} = \mathbf{p}(\cos^2(\theta/2) + \sin^2(\theta/2)) + (\mathbf{v} \times \mathbf{p})w + \mathbf{v} \times (\mathbf{v} \times \mathbf{p}) $$ And of course $$ \cos^2(\theta/2) + \sin^2(\theta/2) = 1 $$ so $$ \mathbf{p'} = \mathbf{p} + 2w(\mathbf{v} \times \mathbf{p}) + 2(\mathbf{v} \times (\mathbf{v} \times \mathbf{p})) $$ which is the simple, fast expression that we are after. You can also derive this from Rodrigues' rotation formula with some funky trig identities. MSVC inliner and functions called just oncePosted 28 Nov 2017I ran across this one today. Microsoft's Visual Studio C++ compiler (MSVC) doesn't inline functions marked as inline, even when the function is only called once (has exactly one call site). A little background on inliners: The inliner is the part of the compiler that inlines function calls. For every function call site (function/procedure invocation) it needs to decide whether to inline the function there, replacing the function call with the function body. Inliners will often not inline a function if the function body is sufficiently large. Otherwise the code size will increase, resulting in slower compilation times, larger executables, and more pressure on the instruction cache. This is all well and good, however, there should be an exception - if a function only has one call site, then the function should generally be inlined, especially if the call site is not in conditionally executed code. Consider this program: #include
static inline float f(float x) { // Do some random computations, to make this function sufficiently complicated const float y = std::sin(x)*x + std::cos(x); float z = 0; for(int z=0; z<(int)x; ++z) z += std::pow(x, y); z += y; return z; } int main(int argc, char** argv) { return f(argc); } The body of function f will be executed regardless of whether f is inlined. The only difference is if the call instruction, function prologue and epilogue etc.. are executed. Obviously we want to avoid that if possible, therefore f should be inlined here. The inliner should be smart enough to know that f is called exactly once, so there is no risk of code duplication. Note that f is marked as static, so the compiler doesn't have to worry about other translation units calling f. Even with full optimisation and link-time code gen (/Ox and /LTCG) VS2015 does not inline the call to f. Clang of course does a better job and inlines the call: godbolt.org/g/njG8aL. VS2017 does inline f in the example above, but fails when f is made larger, so it looks like just the size threshold for inlining has changed. This example is a toy program, but I ran across this issue in some performance-critical code I am writing, where a crucial function that only has a single call site is not being inlined. Here are the perf results from force inlining this crucial function: crucial function non-inlined --------------------- Build took 3.0740 s Build took 3.0300 s crucial function force-inlined: ------------------ Build took 2.8270 s Build took 2.8199 s As you can see there is a significant performance boost from inlining the function. Workaround You can (of course) force-inline a function with __forceinline. However it would be nice if we didn't need to rely on workarounds. Why does MSVC do this? MSVC probably refuses to inline large functions that have just a single call-site, because it's possible that such functions are error-handling code, which are seldom-executed, for example: res = doSomeCode(); if(res == SOME_UNLIKELY_ERROR_CODE) doSomeComplicatedErrorHandlingCode(); In this case we don't want to inline doSomeComplicatedErrorHandlingCode since it will just pollute the instruction cache with unexecuted code. It's also possible that the MSVC guys just never got around to this optimisation :) Regardless, MSVC could definitely use better heuristics to try and filter out this error handling case from other cases where it would be beneficial to inline (such as when the function is called unconditionally). Conclusion The MSVC inliner is missing an optimisation that can result in quite significant speedups - inlining a function when it is only called from one call site. Also, don't believe anyone when they say the compiler is always smarter than you when it comes to deciding when to inline. Issue filed with MS here, discussion on reddit here. Poor Visual Studio code generation for reading from and writing to member variablesPosted 23 Jul 2017I spend quite a lot of time trying to write fast code, but it is something of an uphill battle with Microsoft's Visual Studio C++ compiler. Here is an example of how MSVC struggles with member variable reads and writes in some cases. Consider the following code. In the method writeToArray(), we write the value 'v' into the array pointed to by the member variable 'a'. The length of the array is stored in the member variable 'N'. class CodeGenTestClass { public: CodeGenTestClass() { N = 1000000; a = new TestPair[N]; v = 3; } __declspec(noinline) void writeToArray() { for(int i=0; ia 0000000140055363 lea r9,[r9+8] 0000000140055367 mov eax,dword ptr [rcx+8] // Load this->v into eax 000000014005536A inc r8d 000000014005536D mov dword ptr [r9+rdx-8],eax // Store value in eax into array 0000000140055372 cmp r8d,dword ptr [rcx+0Ch] // Load this->N and compare with loop index. 0000000140055376 jl js::CodeGenTestClass::writeToArray+10h (0140055360h) } 0000000140055378 ret --------------------------------------------------------------------------- I have bolded the inner loop and added some comments. Rcx here is storing the 'this' pointer. What you can see is that inside the loop, the values of 'a', 'v', and 'N' are repeatedly loaded from memory, which is wasteful. Let's compare with the disassembly for writeToArrayWithLocalVars(): --------------------------------------------------------------------------- TestPair* a_ = a; // Load into local var int v_ = v; // Load into local var const int N_ = N; // Load into local var for(int i=0; iThis Inner Loop Header: Depth=1 mov dword ptr [rdx + 8*rsi], ecx mov dword ptr [rdx + 8*rsi + 8], ecx mov dword ptr [rdx + 8*rsi + 16], ecx mov dword ptr [rdx + 8*rsi + 24], ecx mov dword ptr [rdx + 8*rsi + 32], ecx mov dword ptr [rdx + 8*rsi + 40], ecx mov dword ptr [rdx + 8*rsi + 48], ecx mov dword ptr [rdx + 8*rsi + 56], ecx add rsi, 8 cmp rsi, r8 jl .LBB1_4 --------------------------------------------------------------------------- Why this happens It's hard to say for sure without seeing the source code for MSVC. But I think it's probably a failure of alias analysis. Basically, a C++ compiler has to assume the worst, in particular it must assume that any pointer can be pointing at anything else in the program memory space, unless it can prove that it is not possible under the rules of the language (e.g. would be undefined behaviour). In this particular case, we have two pointers in play - the 'this' pointer, and the 'a' pointer, and since we have a write through the 'a' pointer, it looks like MSVC is unable to determine that 'a' does not point to 'v', or 'this', or 'N'. To be able to prove that a write through 'a' does not overwrite the value in N, or V, MSVC needs to be able to do what is called alias analysis. I believe in this case it would be best done with type-based alias analysis (TBAA). Since in C++ it is undefined behaviour to write through a pointer with one type (in this case TestPair*), and read through a pointer with another type (CodeGenTestClass* for the this pointer?), therefore the write to this->a cannot store a value that is read from this->v or this->N. Unfortunately MSVC's TBAA is either absent or not strong enough to work this out. (I may be wrong about the analysis pass required here, compiler experts please feel free to correct me!). Moving the values into local variables as in writeToArrayWithLocalVars(), allows the compiler to determine that they are not aliasing. (It can determine this quite simply by noting that the address of the local variables is never taken, therefore no aliasing pointers can point at them) This allows the values to be placed into registers. One thing to note is that MSVC can do the aliasing analysis and produce a fast loop when the 'a' array is of a simple type such as int instead of TestPair. (Edit: Actually this is not the case, MSVC fails in this case also) Impact This kind of code is pretty common in C++. I extracted this particular example from some hash table code I was writing. You will see this kind of problem with MSVC whenever you are writing to and reading from member variables. (Depending on the exact types etc..). So I would say this is a pretty serious performance/codegen problem for MSVC. Edit: Comment thread on reddit. In this comment Gratian Lup clarifies that MSVC does not do TBAA. Poor code generation / missed optimisations for placement-new object construction in Visual Studio 2015Posted 19 Jul 2017See the bug report Performance bugs - the dark matter of programming bugsPosted 22 Mar 2017There is a class of bugs lurking out there that I class as performance bugs. A performance bug is when the code computes the correct result, but runs slower than it should due to a programming mistake. The nefarious thing about performance bugs is that the user may never know they are there - the program appears to work correctly, carrying out the correct operations, showing the right thing on the screen or printing the right text. It just does it a bit more slowly than it should have. It takes an experienced programmer, with a reasonably accurate mental model of the problem and the correct solution, to know how fast the operation should have been performed, and hence if the program is running slower than it should be. I started documenting a few of the performance bugs I came across a few months ago, for example (on some platforms) the insert method of std::map is roughly 7 times slower than it should be, std::map::count() is about twice as slow as it should be, std::map::find() is 15% slower than it should be, aligned malloc is a lot slower than it should be in VS2015. Types of performance bugs There are a few types of performance bugs I come across quite often: * Doing the same work repeatedly and redundantly - It's pretty common to mistakenly perform the same operation multiple times when a single time would have sufficed, for example zeroing the same memory multiple times. This is suprisingly common when you start looking out for it. * Performing unnecessary work - For example, in Qt, if you call dataChanged() on a single item in a data model, it updates the view for just that item. However if you update 2 items, it updates the entire view rectangle, even if those items are right by each other (e.g. two columns of the same row). See the bug report. * Choosing a poor algorithm, for example an \(O(n^2)\) algorithm by accident. See Accidently Quadratic. Defining a performance bug I don't have a precise and bulletproof definition for a performance bug, but it would be something like: a performance bug is one where the code is slower than the simple, textbook solution of the problem, due to performing redundant or unnecessary work, or choosing the wrong algorithm, or a logic error. In some cases there will be a fine line between a performance bug, and plain old unoptimised code. How to find performance bugs * Use a profiler. * Try single-stepping through your program in a debugger, making a note of what work is being done. Is work being repeated? John Carmack suggests this method: "An exercise that I try to do every once in a while is to “step a frame” in the game, starting at some major point like common->Frame(), game->Frame(), or renderer->EndFrame(), and step into every function to try and walk the complete code coverage." There's also an interesting quote from Carmack about a near performance bug on that same page: "[...] when I was working on the Doom 3 BFG edition release, the exactly predicted off-by-one-frame-of-latency input sampling happened, and very nearly shipped. That was a cold-sweat moment for me: after all of my harping about latency and responsiveness, I almost shipped a title with a completely unnecessary frame of latency." * Benchmark implementations of the same functions against each other. If one of the implementations is a lot slower, it may indicate a performance bug in the implementation. If you do find a perf bug in third party code, make sure to make a bug report to the maintainer(s)! This way software gets faster for everyone. Conclusion Performance bugs are out there, lurking and unseen, like the dark matter that (possibly) pervades our universe. How many performance bugs are there in the wild? It's anyone's guess. What has been your experience with performance bugs? Discuss on Hacker News or Reddit. Watch out - Qt headers suppress warnings with Visual StudioPosted 2 Nov 2016As I posted on the bug report on the Qt bug tracker: Hi, We ran into this problem recently. In my opinion this is a somewhat serious problem with Qt that should be fixed. We noticed this after fixing a bug in our code. A function was incorrectly taking an int argument instead of a double argument, and we were passing a double in. But the usual warning was not reported The usual warning would be something like: warning C4244: 'initializing' : conversion from 'double' to 'int', possible loss of data After a little experimentation I determined that the Qt headers are effectively suppressing warnings for all code after they are included, due to the use of #pragma warning in qglobal.h as noted in this bug report. I don't think it's a good idea for Qt to disable warnings like this such that it effects user code. If Qt wasn't doing this we would have probably found our bug much earlier. One possible solution is to use #pragma warning push at the start of qt headers, then #pragma warning pop at the end. In my opinion, 'polluting' the compiler warning state in the way qt does is bad form, and is similar to 'using namespace std' in a header file, which we all know is a bad idea. Although there is a QT_NO_WARNINGS option, I think it's a good idea to try fixing with the pragma pop idea above. Having an option is better than nothing, but Qt should not have dodgy behaviour by default. EDIT: Discussion on RedditFunctional Programming and PerformancePosted 20 Oct 2016I gave a talk at the Functional Programming North East meetup yesterday, on Functional Programming and Performance. The talk was on ways functional programming can allow/produce fast programs, covering the following topics: * Garbage collection and reference counting, and how this can be made easier with functional programming * Parallelism and functional programming * Some weaknesses of functional programming performance Slides are available in Open Office format here, or in PDF here. Thanks to FP North East for having me! Edit: Comments on reddit here.Visualising the topology of the rotations with the help of PortalsPosted 10 Oct 2016Summary: I present a way of visualising the topology of the space of rotations (SO(3)), using the analogy of portals from the computer game Portal. I use this analogy to explain Dirac's belt trick in a hopefully intuitive way. The space of possible rotations can be thought of as a sphere, where every point in the sphere defines a rotation in the following way - the rotation is around a vector from the centre of the sphere to the point, in the counter-clockwise direction as usual, with the angle of rotation determined by the length of the vector. This space of rotations is known as the group SO(3) - the group of special orthogonal 3x3 matrices. (Note that in mathematics, rotations themselves don't say anything about paths from an initial to final state. Rather a rotation is more like an 'instantaneous' change in orientation.) The radius of the sphere is \(\pi\), so the point at the top of the sphere corresponds to a counter-clockwise rotation of \(\pi\) radians (half a full rotation) around the vertical axis, and the point on the bottom of the sphere corresponds to a counter-clockwise rotation around the down vector. Since this vector is in the opposite direction to the vertical axis, the net result is the same rotation. This means that in the space of rotations these points are (topologically) identical. Likewise, any two points directly opposite each other on the boundary of the sphere are identical. This can be visualised as a a portal connecting any two opposite points, like in the computer game Portal: In the computer game Portal, you can construct portals, a pair of which allow the player or objects to travel ('teleport') from portal to the other. In other words they connect space a bit like a wormhole does in general relativity and science fiction. One portal is always orange, and this portal is connected to the other portal of the pair, which is always blue. Now imagine a path through the space of rotations. Consider a path that goes from the bottom of the sphere to the top. This is a path from \(-\pi\) to \(\pi\) rotation around the upwards vertical axis. It is also a loop (closed path), since it ends up back at the same rotation. I like to think of this path as travelling through the portals. This path is shown in the following diagram: A path from \(-\pi\) to \(\pi\) rotation We can also illustrate this path with a series of teapots. Each teapot is rotated by an amount given by stepping along the path a certain distance. The left teapot corresponds to the bottom of the path. The middle teapot corresponds to the middle of the path and just has the identity rotation. The right teapot corresponds to the top of the path. Note that the rotations 'wrap around', and it ends up with the same rotation as the left teapot. Now the question arises - can we contract the loop to a single point? e.g. can we continuously deform the loop, so it gets smaller and smaller, until it reaches a single point? It's clear using the portal analogy that this is not possible. If we try and move one of the portals leftwards around the sphere, to make it closer to the other portal, then the other portal moves rightwards, since the portals are constrained to be directly opposite each other! If you think about it, you will see that you can't contract a loop running through the portals to a point. Now let's suppose that we have a path that crosses the space of rotations twice, e.g. a double rotation, or rotation from \(-2\pi\) to \(2\pi\). A path from \(-2\pi\) to \(2pi\) rotation. Note that the path enters portal a and exits portal a', then enters portal b and then exits portal b'. The paths are shown slightly separated for clarity. A path from \(-2\pi\) to \(2pi\) rotation In the double rotation case, we can contract this loop to a single point! To do so, we do the following trick: We rotate one pair of portals continuously around the sphere, until they match up with the other pair. Then the loop disappears to a point! A deformation of a \(4\pi\) rotation. A deformation of a \(4\pi\) rotation. A deformation of a \(4\pi\) rotation. A deformation of a \(4\pi\) rotation. This can be demonstrated in an animation. As time advances in the animation, we are moving the portals around the sphere as above, until the portals meet, and the path has contracted to a single point, at which there is no change in rotation along the path. Animation of a path through SO(3), starting with \(-2\pi\) to \(2\pi\) rotation, deformed continuously until the path vanishes. This continuous deformation of a path from a \(-2\pi\) to \(2\pi\) rotation to a point, can be physically demonstrated by twisting a belt, or alternatively with the 'Balinese candle dance', both of which I encourage you to try! (See this video for an example.) Further reading: Rotation Group SO(3) on Wikipedia Plate trick on Wikipedia Comments and corrections welcome! Slow Visual Studio implementation of aligned mallocPosted 3 Oct 2016For some reason the aligned memory allocator supplied by VS 2015 is really slow. See the following results: SSE::alignedMalloc is our implementation, which works in the standard way - it allocates requested size + alignment, then moves up the returned pointer to so that it is aligned, while storing the original pointer so it can be freed later. _mm_malloc is the VS function that returns aligned memory. It compiles to a call to the same function as _aligned_malloc. Finally malloc is of course the standard, boring memory allocation function. As you can see, for some weird reason, VS's _mm_malloc is much slower than malloc and our aligned alloc implementation (SSE::alignedMalloc). Issue filed with MS here.Electronic painter experiment #3Posted 2 Oct 2016Now with added 3d effect. Electronic painter experiment #2Posted 23 Sep 2016Electronic painter experiment #1Posted 22 Sep 2016Azure website is so slowPosted 21 Sep 2016https://account.windowsazure.com/subscriptions takes 17 seconds to load. WTF Microsoft? You would think with all those computers it could allocate some to serving the account information efficiently.Stack overflow internal compiler error in vs2015Posted 17 Sep 2016This one was a 'fun' issue to deal with. Actually it's not that common to find a compiler issue that can be replicated with such a small amount of code in my experience, which is kind of satisfying. Compiling uint32 bitTest(uint32 y, uint32 N) { int shift = 1; for(uint32 i=0; i> (shift + 1); return shift; } In VS2015 update 3, with optimisations enabled, gives n:\indigo\trunk\utils\bitfield.cpp(21): fatal error C1063: compiler limit: compiler stack overflow Issue report filed with MS here. std::map performance bug #3 - slow findPosted 8 Aug 2016In the Visual Studio 2015 standard library, the implementation of std::map::find is unnecessarily slow. The underlying reason seems to be similar to that for bug #2 - using operations on a generic tree class that doesn't take advantage of the restrictions places on std::map - that entry keys are unique. The issue in detail: std::map::find() calls std::_Tree::lower_bound, which calls _Lbound, which looks like _Nodeptr _Lbound(const _Other& _Keyval) const { // find leftmost node not less than _Keyval _Nodeptr _Pnode = _Root(); _Nodeptr _Wherenode = this->_Myhead(); // end() if search fails while (!this->_Isnil(_Pnode)) if (_Compare(this->_Key(_Pnode), _Keyval)) _Pnode = this->_Right(_Pnode); // descend right subtree else { // _Pnode not less than _Keyval, remember it _Wherenode = _Pnode; _Pnode = this->_Left(_Pnode); // descend left subtree } return (_Wherenode); // return best remembered candidate } Which is a traversal down a binary tree. The issue is that the traversal doesn't stop when an interior node has a key that is the same as the target key. This is presumably because if you allow multiple entries with the same key, you need to keep traversing to get the leftmost of them. However since we know that entry keys are unique in std::map, the traversal could stop as soon as it traverses to an interior node with matching key. Proposed fix The fix is to check the interior nodes keys for equality with the target key, and return from the traversal immediately in that case. _Nodeptr _Lbound(const _Other& _Keyval) const { // find leftmost node not less than _Keyval _Nodeptr _Pnode = _Root(); _Nodeptr _Wherenode = this->_Myhead(); // end() if search fails while(!this->_Isnil(_Pnode)) if(_Compare(this->_Key(_Pnode), _Keyval)) _Pnode = this->_Right(_Pnode); // descend right subtree else { // NEW: else node >= keyval if(!_Compare(_Keyval, this->_Key(_Pnode))) // NEW: if !(keyval < node): return _Pnode; // NEW: then keyval == node, so return it // _Pnode not less than _Keyval, remember it _Wherenode = _Pnode; _Pnode = this->_Left(_Pnode); // descend left subtree } return (_Wherenode); // return best remembered candidate } This doesn't change the asymptotic time complexity of the find call, it's still O(log(n)). But it does somewhat decrease the number of nodes that need to be traversed on average. It does require additional key comparisons, which could result in a net slowdown in some cases, but the cost of unneeded memory accesses are likely to be higher than the cost of the additional comparisons in most cases, I believe, especially as the number of entries in the map increases and cache misses become more common. Results with the proposed fix I implemented the proposed fix above in a class copied from std::map, called modified_std::map. Doing 1M lookups in a map and modified map with 2^16 unique int keys: modified_std::map find: 0.045002 s, num_found: 1000000 std::map find: 0.052951 s, num_found: 1000000 The modified map find method is approximately 15% faster. This isn't a huge amount, but is worth optimising/fixing, for such a commonly-used operation! There is a similar issue in _Eqrange.std::map and std::unordered_map performance bug #2 - slow countPosted 7 Aug 2016How do you tell if an entry with a specific key has been inserted in a std::map and std::unordered_map? Historically the way was to call find and compare against end, e.g. if(m.find(x) != m.end()) // If present: // do something With the introduction of C++11, there is another technique - calling count. if(m.count(x) != 0) // If present: // do something Count returns the number of entries with the given key. Note also that in std::map and std::unordered_map, count can only return 0 (in the case where the entry is not present) and 1 (when it is). Multiple entries with the same key are not allowed in a map or unordered_map. You might think that both techniques for testing if an element with a given key is in a map would have similar performance, but this is not the case with some implementations. Performance result for VS2015, Release build: std::map count: 1.196503 s, keys_present: 10000000 std::map find: 0.612619 s, keys_present: 10000000 std::unordered_map count: 0.232709 s, keys_present: 10000000 std::unordered_map find: 0.134992 s, keys_present: 10000000 What we see is that count takes about twice the time that find does. Why does count take so long? Stepping through the code in the VS debugger reveals the answer. Count calls equal_range, which computes the range of elements with the same key. equal_range effectively does two tree traversals, one for the start of the range, and one for the end. This is the reason why count takes approximately twice as long (at least for std::map). I haven't looked into why std::unordered_map::count is slow but I assume it's for a similar reason. Proposed solution My proposed solution is to replace the implementation of std::map::count with something like the following: size_t count(Keyval& k) { return (find(k) == end()) ? 0 : 1; } Based on the test results above, this should make count approximately two times faster for std::map. More results Clang (clang version 3.6.0 (trunk 225608)) on Linux: std::map count: 0.715613 s, keys_present: 10000000 std::map find: 0.708204 s, keys_present: 10000000 std::unordered_map count: 0.200249 s, keys_present: 10000000 std::unordered_map find: 0.110208 s, keys_present: 10000000 Only std::unordered_map::count seems to suffer from this problem, std::map::count is fine. Xcode (Apple LLVM version 7.3.0 (clang-703.0.31)): std::map count: 0.392112 s, keys_present: 10000000 std::map find: 0.413565 s, keys_present: 10000000 std::unordered_map count: 0.142392 s, keys_present: 10000000 std::unordered_map find: 0.159157 s, keys_present: 10000000 This std lib does not seem to have the problem. Test source code: map_count_slowness.cpp. Edit: discussion on reddit here.An apparent performance bug with some std::map and std::unordered_map implementationsPosted 6 Aug 2016A few days ago my colleague Yves mentioned a perplexing finding to me - It was significantly faster to call find on a std::unordered_map, and then insert a item into the map only if the find call indicates it is not present, than to just attempt to insert an item directly with an insert call. This is strange, because the semantics of std::unordered_map::insert are that if an item with a matching key is already in the map, then the item is not updated, and insert returns a value indicating that it failed. Calling find, then insert, is effectively doing the lookup twice in the case where the item is not in the map already. (and doing the lookup once when it is). Calling insert directly should only be doing the lookup once, and hence should be of comparable speed or faster. Here is some code demonstrating the problem: const int N = 1000000; // Num insertions to attempt const int num_unique_keys = 65536; std::unordered_map m; for(int i=0; i m; for(int i=0; i serial radix sort gives a ~6.2 times speedup. serial radix sort -> parallel radix sort gives a ~2.1 times speedup. This is way below a linear speedup over the serial radix sort, considering I am using 8 threads in the parallel sort. This likely indicates that memory bandwidth is the limiting factor. Limitations of radix sort Radix sort is not always the appropriate sorting choice. It has some memory and space overheads that make it a poor choice for small arrays. Once the number of elements gets smaller than 300 or so I have measured std::sort to be faster. In addition, to be able to radix sort items, you need to be able to convert each key into a fixed length number of bits, where the ordering on the bit-key is the same as the desired ordering. Therefore you can't sort e.g. arbitrary length strings with radix sort. (Although you could probably use some kind of hybrid sort - radix sort first then finish with a quicksort or something like that) Conclusion Parallel radix sort is awesome! If you are really desperate to sort fast on the CPU and want to use/licence the code, give me an email. Otherwise we'll probably open-source this parallel radix sorting code, once we've got Indigo 4 out the door. Implementing a parallel radix is also quite fun and educational. I might write an article on how to do it at some point, but writing your own is fun as well! Further reading 'Radix sort' on Wikipedia 'Radix Sort Revisited' by Pierre Terdiman - has a nice technique for radix sorting floating point data. 'Radix Tricks' by Michael Herf - has another (better?) technique for radix sorting floating point data, and some other tricks.Hash functions, one-way functions and P = NPPosted 13 Feb 2016Take a function \(f\) that maps from some bit string of length \(N\) to another bit string also of length \(N\). Suppose that this function is a 'hash function' with some properties we want: 1) Given an output value, it takes on average \(2^{N-1}\) evaluations of different inputs passed to \(f\) before we get the given output value. Note that if we try all \(2^N\) different inputs, we will definitely find the input that gives the output we are searching for, if it exists. So we want our hash function to require half than number of evaluations on average before we find it, e.g. \({2^N}/2 = 2^{N-1}\). 2) \(f\) runs in polynomial time on the length of the input \(N\). All practical hash functions have this property. So let's consider a search problem — call it \(S\). Instances of \(S\) consist of an output string \(y\) (of \(N\) bits length), and the problem is to find the input \(x\) such that \(f(x) = y\). The first property we are assuming of \(f\) implies that any algorithm that solves this problem runs in time \(2^{N-1}\), which is exponential in \(N\), so not polynomial in \(N\). So \(S\) is not an element of \(P\), the set of problems that are solvable in polynomial time. Now, let's suppose that we have some proposed input value \(x\). We can verify that \(f(x) = y\) in polynomial time, since we have assumed that \(f\) runs in polynomial time. If \(f(x) \neq y\), the proposed solution will be rejected in polynomial time. This means that we have an algorithm (the verification algorithm) that verifies a proposed solution in polynomial time. But we know that the complexity class \(NP\) (non-deterministic polynomial) is the class of problems that can be verified in polynomial time. Therefore \(S\) is in \(NP\). Since we already proved (given our assumptions about \(f\)) that \(S\) is not in \(P\), we therefore have \(P \neq NP\). So we have the following situation: The existence of a hash function with the properties we want implies \(P \neq NP\). Since no-one knows if \(P = NP\), we can deduce that no-one knows if hash functions with the properties we want exist. Fixed length outputs Note that this proof only applies for hash functions from strings of N bits to strings of N bits. Some cryptographic hash functions in the real world are defined like this, such as the Skein hash function. However, most cryptographic functions produce a fixed length output, such as SHA-256 that has a 256 bit output length. Regardless, I think the same general principle applies. As far as I know, no hash function with a fixed length output has been proved to have the analogous properties as listed above. In other words, no hash function with fixed length output N has been proved to take exponential time on N to find an input that maps to the given output. One-way functions This property of hash functions we are thinking about is also called the 'one-way' function property. A hash function with this property is one-way in the following sense: we can compute the output \(f(x)\), from the input \(x\), easily and efficiently. However going in the 'other direction', e.g. computing the inverse, \(f^{-1}(y)\), is not computable efficiently. Remarkably we don't know if such one-way functions exist. Implications The implication of the above proof and comments is that no-one knows if cryptographic hash functions like the SHA hash functions really have the secure properties that people hope they do. Many aspects of modern software depend on such hash functions, such as message authentication codes, code signing of executables, and even systems such as bitcoin, which uses the SHA-256 hash function in its proof-of-work scheme. That these hash functions are not proved to be cryptographically secure is somewhat well known, but it's not that uncommon to hear people assuming that the security or properties of such systems have actually been proved. This is quite similar to the situation with factoring of large numbers which is used in public key cryptography, which is suspected, or commonly treated as if it is not solvable in polynomial time, even though it has not been proved. More reading: 'Cryptographic hash functions' on Wikipedia 'One way functions' on Wikipedia Introduction to Algorithms, Third Edition Edit: Discussion on RedditPermutations and sortingPosted 7 Feb 2016In this blog post I'm going to write a little bit about permutations, and some connections with information theory and sorting algorithms, including giving a sketch-proof of the \(n \log n\) lower bound on comparison-based sorting. A permutation is a re-ordering of a list. For example, (1, 3, 2), (3, 2, 1) and (2, 1, 3) are all permutations of the list (1, 2, 3). Let's say you have a list with \(n\) elements in it (e.g. the length of the list is \(n\)). How many different ways of ordering the list (e.g. how many permutations) are there of the list? When picking the first element of your permutation, you have n elements to choose from. Pick one. Then when picking the 2nd element of your permutation, you have one less to choose from, e.g. \(n-1\) elements. By the time you come to choose the last element of your permutation, you only have one element left. So the total number of different permutations is $$n \times (n - 1) \times (n - 2) \times ... \times 3 \times 2 \times 1 $$ This is the definition of n factorial: $$n! = n \times (n - 1) \times (n - 2) \times ... \times 3 \times 2 \times 1$$ So there are \(n!\) permutations of a list of n elements. Let's say we want to store which permutation we have chosen in computer memory. Since there are \(n!\) possibilities, we need to use \(\log_2(n!)\) bits to store this information. (In general, to store a choice that has x possibilities, you need to use \(\log_2(x)\) bits. For example, with 256 possibilities, you need to use \(\log_2(256) = 8\) bits.) Once of the most important features of the log function, is the following rule - the 'log of products' is the 'sum of the logs'. In other words, $$\log(a \times b) = \log(a) + \log(b)$$ Let's have a look at the expression \(\log_2(n!)\), and substitute in the definition of \(n!\): $$\log_2(n!) = \log_2(n \times (n-1) \times (n-2) \times ... \times 2 \times 1)$$ Using the rule above to break up the log: $$\log_2(n!) = \\ \log_2(n \times (n-1) \times (n-2) \times ... \times 2 \times 1) = \\ \log_2(n) + \log_2(n-1) + \log_2(n-2) + ... + \log_2(2) + \log_2(1) $$ The last line can be interpreted in an interesting way - it gives the number of bits required when storing each element choice for the permutation separately. For example, \(\log_2(n)\) is the information needed to store the first choice, which was from \(n\) options. \(\log_2(n-1)\) is the information needed to store the second choice, which was from \(n-1\) options. \( \log_2(1) \) is the information needed to store the last choice, which only has 1 option, e.g. there was no choice for it. Happily, \(\log_2(1) = 0\). No information is needed for this null choice! This is one of the main reasons (perhaps the only reason?) why the log function pops up in information theory. When you have two independent choices to make, the number of possibilities is the product of the two numbers of possibilities for each choice. But the information required for both is the sum of the two individual pieces of information. The log function is the only function with this property. Sorting algorithms The task of a sorting algorithm is to take a permutation of some original sorted list, and return the original sorted list. Consider a comparison based sorting algorithm, that at each step can compare two elements together with the less than (<) operator. At each step it can eliminate up to half of the possible permutations. In other words, it can gather one bit of information with each comparison. So to gather all the bits of information needed to determine which permutation it is looking at, it needs to make \(\log_2(n!)\) comparisons. It could do it less efficiently, but this is the lower bound on the number of comparisons which can handle the worst case. (This is a very hand-wavy 'proof', but I hope you get the general idea) Stirling's approximation Stirling's approximation tells us that $$\ln(n!) \approx n \ln(n) - n$$ Here \(\ln(x)\) is the natural logarithm function, e.g. \(\log\) with base \(e\). We can convert natural logarithms to log base 2 by dividing the log function by \(\ln(2)\), which is about 0.6931, e.g. $$\log_2(x) = \ln(x) / \ln(2)$$ or $$\log_2(x) \times \ln(2) = \ln(x)$$ Using this rule in previous equation gives: $$\ln(n!) \approx n \ln(n) - n \\ \log_2(n!) \times 0.6931 \approx n \log_2(n) \times 0.6931 - n \\ \log_2(n!) \approx n \log_2(n) - n / 0.6931 $$ When you are doing asymptotic complexity analysis, constants don't matter. So the optimal asymptotic complexity of the comparison based sort algorithm is \(n \log_2(n)\), which is the same asymptotic complexity as \(n \ln(n)\). Let's look at some actual data points. For \(n = 10\): n: 10 n!: 3628800 log_2(n!): 21.791061114716953 n*log_2(n): 33.219280948873624 Stirling's approximation (n log_2(n) - n / log(2)): 18.79233053998399 For \(n = 32\): n: 32 n!: 2.631308369336935e35 log_2(n!): 117.66326365236031 n*log_2(n): 160 Stirling's approximation (n log_2(n) - n / log(2)): 113.83375869155317 For \(n = 150\): n: 150 n!: 5.7133839564458505e262 log_2(n!): 872.8595063470792 n*log_2(n): 1084.322803574382 Stirling's approximation (n log_2(n) - n / log(2)): 867.9185474410376 One interesting thing to note is that n*log_2(n) is quite a poor estimate of the number of comparisons needed (it is quite a lot too large), at least for these relatively small n values. the (ceiling'd) log_2(n!) values give the actual number of comparisons needed. So there you go, some curious things about permutations and sorting. I hope some people found this interesting. If there are any bits of maths that you don't understand, let me know and I will go into more depth. Corrections welcome of course. Slow pscp upload speedsPosted 27 Jan 2016I sent this email to the Putty guys, but I haven't heard back from them. So I'll post it here. --------------------------- First of all, thanks for your work on Putty. It's a very useful tool for us. However pscp uploads much more slowly than it should for us. We are uploading from our office network (UK) to a server in a German data centre. Ping latency to the server is ~31ms. Uploading a ~100MB file, putty stabilises at around 220kB/s: IndigoRenderer_x64_4.0.14 | 100580 kB | 218.2 kB/s | ETA: 00:00:43 | 91% The computer it is running on is running Windows server 2008, on an i7-2600, with 16GB RAM. In comparison, scp running on a Linux machine on the same LAN, uploading to the same server, gets around 8MB/s upload speed. speedtest.net shows a ~60Mb/s upload speed from our office. This indicates to me that the pscp upload speeds are a lot slower than they could be. Is this a problem you are aware of? Are there any plans to improve the upload speed? Finding annoying programs that steal window focusPosted 22 Nov 2015Such programs are super annoying. I had this problem on my computer - some program is occasionally stealing window focus, resulting in being switched out of full-screen games at inopportune times etc.. So I made a program to detect such annoying programs - https://github.com/glaretechnologies/focusmonitor. I figured out the offending program is C:\Users\nick\AppData\Local\Akamai\installer_no_upload_silent.exe. Dear Akamai: please fix your annoying program to not steal window focus. Thank you. I have looked at this program that presumably does the same thing: http://www.happydroid.com/focus, but Windows Defender said it had a virus in it. Not sure if that is true, but I'm hesitant to run unsigned executables from random sources anyway. So I made my own.False Sharing CuriositiesPosted 18 Oct 2015I ran into an interesting effect this weekend while working on some parallel counting code. After a little experimentation I determined that it is likely to be a false sharing effect. False sharing is where the same cache line is read/modified by multiple different threads in a loop, and can cause major performance degradation. The counting code loop looked like this: for(size_t i=0; i #include #include #include void threadFunc(const int* data_, size_t num, size_t* local_counts_) { const int* const data = data_; size_t* const local_counts = local_counts_; for(size_t i=0; i& data, size_t padding_elems) { auto start_time = std::chrono::high_resolution_clock::now(); static_assert(sizeof(size_t) == 8, "sizeof(size_t) == 8"); const int num_threads = 4; // Allocate memory for counts, align to 128-byte boundaries. size_t* counts = (size_t*)_mm_malloc(num_threads * (8 + padding_elems) * sizeof(size_t), 128); std::memset(counts, 0, num_threads * (8 + padding_elems) * sizeof(size_t)); // zero counts std::thread threads[num_threads]; for(int i=0; i diff = end_time - start_time; std::cout << "padding: " << padding_elems*sizeof(size_t) << " B, elapsed: " << diff.count() << " s" << std::endl; _mm_free(counts); } void falseSharingTest() { // Generate random data to count std::default_random_engine generator; std::uniform_int_distribution distribution(0, 7); const size_t DATA_SIZE = 1 << 26; std::vector data(DATA_SIZE); for(size_t i=0; i data(N); for(int i=0; i indices(N); for(int i=0; i indices(N); for(int i=0; i=0; --t) { int k = (int)(rng.unitRandom() * t); mySwap(indices[t], indices[k]); } Timer timer; uint64 sum = 0; for(int i=0; i(x) would get the bits of x, and reinterpret them as the bits of a value of type T. For example, you would be able to write float x = 1.0f; const uint32_t y = bit_cast(x); And then y would have the value 0x3f800000. From a generated-code point of view this would be a no-op. (Or maybe some instructions would have to be emitted for moving to/from floating point registers in this case?) Edit: A bit_cast is only valid if the source and destination types are the same size. Using bit_cast on types differing in size would give a compile-time error. This kind of bit conversion needs to be done pretty often if you are writing low-level, high peformance code, or networking code. There are a number of ways to do something similar with C++ today, but they all have problems, either with correctness or elegance. Current option 1: pointer casting Possibly the most old-school way is to pointer cast, like this: float x = 1.0f; const uint32_t y = *(uint32_t*)&x; The problem here is that it is undefined behaviour to write through one pointer type (float*) and read through another (uint32_t*). (due to aliasing rules). Compilers may whinge about it and/or generate code that does not do what you want. For example, GCC says "warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]" (see code example here: https://goo.gl/wZcV2D). Current option 2: union abusing The trick here is to write into one of the union fields, then read from the other one. float x = 1.0f; union fi_union { float f; uint32_t i; }; fi_union u; u.f = x; const uint32_t y = u.i; This *may* work in practice, but is syntactically ugly, is an abuse of the idea of unions, and may be undefined behaviour. Current option 3: memcpy The idea here is to just memcpy() from the src to the dest, and rely on the compiler to recognise what you are doing, and to not actually emit the memcpy function call. Example: template inline Dest bit_cast(const Src& x) { static_assert(sizeof(Src) == sizeof(Dest), "sizeof(Src) == sizeof(Dest)"); Dest d; std::memcpy(&d, &x, sizeof(Dest)); return d; } This works well in practice, but is a little dangerous in the sense that you are relying on a compiler optimisation to avoid very slow code (e.g. it would be very slow if a function call was actually emitted). Current (non-)option 4: static_cast static_cast does not do what we want. float x = 1.0f; const uint32_t y = static_cast(x); puts the value '1' in y, not 0x3f800000 as we want. Current (non-)option 5: reinterpret_cast To be honest I'm not sure what reinterpret_cast is for. I never use it. It also doesn't do what we want: float x = 1.0f; const uint32_t y = reinterpret_cast(x); gives, in Visual Studio 2012: "error C2440: 'reinterpret_cast' : cannot convert from 'float' to 'uint32_t' Conversion is a valid standard conversion, which can be performed implicitly or by use of static_cast, C-style cast or function-style cast" Semantics (Added in an edit) The bit_cast would give the same result as the bit_cast function above: template inline Dest bit_cast(const Src& x) { static_assert(sizeof(Src) == sizeof(Dest), "sizeof(Src) == sizeof(Dest)"); Dest d; std::memcpy(&d, &x, sizeof(Dest)); return d; } Because we know 'd' (the destination) is in a different memory location from 'x' (the source), there shouldn't be any aliasing problems. Conclusion So in conclusion, the alternatives are basically hacky workarounds to something missing in the original language. Older alternatives are increasingly rendered invalid due to fussier optimising compilers. I think the problem should be solved once and for all with a built-in language feature - bit_cast. Edit: reddit discussion here. Edit: posted to 'ISO C++ Standard - Future Proposals' list here.A little Winter demoPosted 11 Aug 2015 The work-in-progress inlining pass in Winter, combined with the closure and first-class function support described earlier, allows this program: def compose(function f, function g) : \(float x) : f(g(x)) def addOne(float x) : x + 1.0 def mulByTwo(float x) : x * 2.0 def main(float x) float : let z = compose(addOne, mulByTwo) in z(x) To compile to this: .LCPI0_0: .long 1065353216 # float 1 .text .globl main_float_ .align 16, 0x90 .type main_float_,@function main_float_: # @main_float_ # BB#0: # %decr_function.exit40 vaddss %xmm0, %xmm0, %xmm0 movabsq $.LCPI0_0, %rax vaddss (%rax), %xmm0, %xmm0 retq The compose function isn't as pretty as in Haskell, but it works, and as you can see it can be inlined to result in fast code. Closure support in WinterPosted 6 Aug 2015 Winter now has proper runtime support for first class functions, and anonymous functions (lambda expressions) with lexical closures. A simple example: def main() float : let f = \(float x) : x*x in f(2.0) returns 4.0. In this example an anonymous function (that computes the square of its argument) is created and assigned to f. It is then called with the argument 2.0. Since functions are 'first class', they can be returned from functions: def makeFunc(float x) function : \() : x def main() float : let f = makeFunc(2.0) in f() In this example makeFunc returned a function. To support lexical scoping, variables are captured: def makeFunc() function : let a = [10, 11, 12, 13]va in \(int x) int : a[x] def main(int x) int : let f = makeFunc() in f(x) In the example above the anonymous function returned from makeFunc() captures the variable-length array a, which is indexed into when the function is actually called with f(x). Closures are allocated on the heap by default. If they can be proven not to escape the current function (which is currently computed with a simple type-based escape analysis), they are allocated on the stack. The runtime layout for a closure looks like this currently (in LLVM IR): %anon_func_29_float__closure = type { i64, i64, float (float, %base_captured_var_struct*)*, void (%base_captured_var_struct*)*, %anon_func_29_float__captured_var_struct } It has 5 fields: a refcount, some flags (stack or heap allocated), the function pointer, a destructor pointer for cleaning up the captured vars, and a structure containing the captured vars. All this abstraction doesn't have to come at the cost of runtime inefficiency. Take this program: def main(array a, array b) array : map(\(float x) : x*x, a) It generates the optimised assembly: vmovups (%rdx), %xmm0 vmulps %xmm0, %xmm0, %xmm0 vmovups %xmm0, (%rcx) movq %rcx, %rax retq Unicode string handling in WinterPosted 6 Jul 2015There is a fantastic article by Matt Giuca about mistakes programming language designers make and made with regards to Unicode support (or lack thereof): The importance of language-level abstract Unicode strings. He notes that only Haskell gets it right. (Maybe Rust got it right since then?) Winter gets it right as well - strings are treated as sequences of Unicode chars, not sequences of bytes. The encoding (UTF-8) is a hidden implementation detail. Some example Winter code snippets: codePoint(elem("a\u{393}c", 0))) # returns (int)'a'. codePoint(elem("a\u{393}c", 1))) # returns 0x393. codePoint(elem("a\u{393}c", 2))) # returns (int)'c'. Note that \u{393} in the above is a Unicode escape sequence specifying the Unicode character U+393. Edit: Reddit discussion.Don't use LLVM's BBVectorize passPosted 28 Jun 2015Don't use it, it's buggy. I ran into a miscompilation as described here: llvm.org/bugs/show_bug.cgi?id=23845 and it is going to be removed soon anyway: groups.google.com/forum/#!topic/llvm-dev/en-Rbu42K9wAuto-vectorisation and parallelisation in WinterPosted 24 May 2015I have reached an interesting milestone in our functional programming language Winter - Winter now can do auto-parallelisation, auto-vectorisation, and even both at the same time :) Auto-parallelisation means that code is automatically executed in parallel in multiple threads. Auto-vectorisation means that vector (SIMD) instructions are automatically used instead of scalar instructions. This means that e.g. 4 or 8 multiplications can be done with one instruction. Taking some Winter code, which applies a function to an input array, generating a result array: def square(float x) float : x*x def main(array a, array b) array : map(square, a) We're using a large array size here so we get a decent, measurable amount of work to do. The assembly from the inner loop looks like this: .LBB2_3: # %vector.body # =>This Inner Loop Header: Depth=1 vmovaps (%rdx,%rax), %ymm0 vmovaps 32(%rdx,%rax), %ymm1 vmovaps 64(%rdx,%rax), %ymm2 vmovaps 96(%rdx,%rax), %ymm3 vmulps %ymm0, %ymm0, %ymm0 vmulps %ymm1, %ymm1, %ymm1 vmulps %ymm2, %ymm2, %ymm2 vmulps %ymm3, %ymm3, %ymm3 vmovaps %ymm0, (%rcx,%rax) vmovaps %ymm1, 32(%rcx,%rax) vmovaps %ymm2, 64(%rcx,%rax) vmovaps %ymm3, 96(%rcx,%rax) subq $-128, %rax addq $-32, %r10 jne .LBB2_3 Very nice code here, LLVM is doing a good job. You can see there are 4 256-bit memory loads being done in parallel, then 4 256-bit 8-element multiplications being done in parallel, then 4 stores being done. This code runs at memory speed, which is about 5.7 GB/s in this case on my machine. Even a much sloppier inner loop will still run at memory speed when there is this little work to do per element. Resulting perf measurements are: JITed code elapsed: 0.187149 s JITed bandwidth: 5.73737 GiB/s C++ ref elapsed: 0.196158 s C++ ref bandwidth: 5.47385 GiB/s Both the JITed Winter code and the C++ code are memory limited so run at similar speeds. Taking another example, where more work is done per array element: def f(float x) float : pow(x, 2.2) def main(array a, array b) array : map(f, a) pow() is a pretty slow function (100 cycles or more). This allows us to see the influence of auto-parallelisation. Winter auto-parallelises this map function, dividing the array into slices, and passing each slice to a worker thread, on which the actual work is done. With auto-parallelisation disabled, results are as follows: pow(x, 2.2), single threaded map(): JITed code elapsed: 6.32286 s JITed bandwidth: 0.169819 GiB/s C++ ref elapsed: 1.90434 s C++ ref bandwidth: 0.563838 GiB/s The reason the C++ version is faster, is because Visual C++ vectorises the pow() function, whereas LLVM doesn't (I just filed a bug report on this here.) With auto-parallelisation enabled: pow(x, 2.2), multi threaded map() JITed code elapsed: 1.35347 s JITed bandwidth: 0.793327 GiB/s C++ ref elapsed: 1.94178 s C++ ref bandwidth: 0.552969 GiB/s You can see from the highlighted result that the program runs about 4.67x faster, which is roughly what you would expect using 8 threads on 8 logical cores, considering that 4 of them are hyperthreading cores. No changes in the winter code were required to use this parallelisation, it's completely automatic. So there you have it - a proof-of-concept of auto-vectorisation and auto-parallelisation in Winter. The code is not robust yet - I just got it working this weekend. But I think it's quite a promising result. It also shows, I think, that functional programming languages can be just as fast (or faster) than optimised, multithreaded C++ code. Edit: Discussion on reddit. Fuzz TestingPosted 30 Apr 2015 I have spent the last week or so fixing bugs in Winter (our programming language) found by our new fuzz testing code. Fuzz testing is the process of sending lots of random input to a program, to try and uncover crashes or other undesirable behaviour from the program. If you are writing a compiler, I highly recommend fuzz testing it, it seems like a very effective way of finding bugs. Some more details about our fuzzing code: It's an in-process fuzzer - random programs are generated, compiled, and run if compilation succeeded (e.g. if the program is valid), all in the same process. Because it's in-process, as soon as a crash occurs, the fuzzing run is over. Each random program string is saved to disk before it is run, so to find the program that crashed the process, you just have to get the last line from the file on disk. It's multithreaded, because you need all the power you can get to find any bugs as fast as possible. The fuzzer runs through about 20,000 random programs per second per core, so running on 4 cores gives around 80,000 random programs per second. This relatively fast speed is possible because invalid programs are rejected early before computationally intensive stuff like LLVM optimisation and codegen starts. Valid programs only run at about 300 programs per second per core, but most random programs are invalid, so the average speed comes out to 20K/core/s The random program generation strategy is something like: Randomly pick a program from the unit test suite programs (several hundred covering all language features) Then repeatedly mutate program randomly a few times by picking from among a few options: 1) Insert a random character at a random position in the program. 2) Delete a random chunk of characters from the program 3) Copy a chunk of the program to another random position in the program. 4) Read a section of a random other program from the test suite and insert into the program. 5) Insert a random language keyword or syntactical elements (square bracket pair etc..) As mentioned I found this fuzzing code to be extremely effective in finding bugs. I found 40-50 crash bugs that took something like a week to fix, and triggered improvements and refactorings to avoid crashes. Our programming language Winter is a lot more robust now. if you are wondering how there can be so many bugs in a language compiler/runtime, bear in mind that Winter is quite complex, with generic functions, operator overloading, function overloading, type coercion, JITing, higher order functions etc.. I also tried out American Fuzzy Lop (AFL). I ran it for 24 hours or so, in which time it got through 20% or so of the first pass. It didn't find any crashes. I'm not sure if this is because it's not very good at finding crashes, or if I removed all the crash bugs :) It's definitely a lot slower than our in-process multithreaded fuzzing though. Note that in theory AFL uses a more powerful kind of fuzzing - coverage-guided fuzzing, which can be considered a white-box approach, since it uses knowledge of what code paths are taken for each input program. Therefore I would highly recommend integrating an in-process fuzzer into your test suite, and after that maybe check out AFL or similar. (If you want to know more about Winter there are some presentation slides in PDF format from a talk I gave a while ago here: Indigo Shader Language and Winter) Finding the fastest Turing Machine that computes a given function is uncomputable. (And the consequences for superoptimisation)Posted 14 Mar 2015 Assume, for a contradiction, that we have some function optimise(f), that returns the fastest Turing Machine that computes the (partial) function f. The exact definition of 'fastest' doesn't matter. Now consider the function \( \operatorname{returnOneOnHalt_f}(x) \), that simulates the execution of function f on input x. If f halts, then \( \operatorname{returnOneOnHalt_f} \) returns 1 and halts. If the function f doesn't halt then \( \operatorname{returnOneOnHalt_f} \) doesn't halt. Now consider what happens when we run optimise() with \( \operatorname{returnOneOnHalt_f} \) as input, e.g. \( \operatorname{optimise}( \operatorname{returnOneOnHalt_f} \)). For a given function f, if f halts on all inputs, then \( \operatorname{returnOneOnHalt_f}(x) = 1 \). Therefore the fastest Turing machine that computes this function is the one that just writes 1 to the tape and then halts. On the other hand, if f does not halt on all inputs, then there is at least one input value for which the optimsed \( \operatorname{returnOneOnHalt_f}\) does not halt. This Turing machine is more complex than the machine that just writes one and halts. For example it will execute more than one step as it does not halt. Therefore we have created an effective procedure for determining if f halts on all inputs - just look at the optimised \( \operatorname{returnOneOnHalt_f} \) and see if it is the 'write one to tape and halt' machine. However we know that such an effective procedure is impossible due to the Halting Theorem. Therefore we have a contradiction, and we cannot have such a function optimise(). Pseudocode for such a procedure looks like this: function makeReturnsOneOnHalt(f) return function that simulates f, and when f halts, returns 1. end function doesHaltOnAllInputs(f) returnsOneOnHalt_f = makeReturnsOneOnHalt(f) optimised_returnsOneOnHalt_f = optimise(returnsOneOnHalt_f) return optimised_returnsOneOnHalt_f == returnsOne end Consequences for superoptimisation Superoptimisation is a compiler optimisation technique that is supposed to find the shortest, or fastest program to compute a function. In other words, it doesn't just make a program somewhat faster, but actually finds the optimal program. The term was introduced in the paper 'Superoptimizer -- A Look at the Smallest Program' by Henry Massalin in 1987. Interestingly the paper defines superoptimisation to give the shortest program, as opposed to the fastest. If the programs being considered are 'Turing-complete' (or 'Turing-machine equivalent'), then either kind of superoptimisation (optimising for program length of for program speed) become impossible (uncomputable). Finding the shortest program is uncomputable as a consequence of Kolmogorov Complexity theory. As the proof above demonstrates, finding the fastest program that computes a given function is also uncomputable. This means that superoptimisation must necessarily be restricted to subsets of all possible functions, such as the total functions. A limitation of LLVM with regard to marking sret functions as pure/readonlyPosted 14 Mar 2015I found an interesting limitation of LLVM recently: llvm.org/bugs/show_bug.cgi?id=22853Reassociation optimisations in Visual Studio 2012Posted 30 Dec 2014 Multiplication in C++ is left to right associative - e.g. y = a * b * c means y = (a * b) * c Note that although the mathematical operation of multiplication is associative, e.g. (a * b) * c = a * (b * c), the same is not true for floating point multiplication, due to intermediate rounding error. Therefore in default circumstances, the compiler may not optimise y = a * b * c to y = a * (b * c) as it may give different results. However, if 'fast math' mode is enabled, compilers may in theory perform this 'reassociation'. This can be important for a number of reasons - one is that b and c may be constants, and the y = a * (b * c) may be inside a loop. Performing reassociation means that the b*c multiplication may be 'hoisted' outside the loop, so only one multiplication is done instead of two. Clang performs this reassociation optimisation: float f() { const float b = 2.f; const float c = 3.f; float sum = 1.f; for(int i=0; iThis Inner Loop Header: Depth=1 movaps %xmm0, %xmm3 addss %xmm1, %xmm0 mulss %xmm2, %xmm0 addss %xmm3, %xmm0 decl %eax jne .LBB0_1 ret (See the program on http://gcc.godbolt.org/) Note that there is only one multiplication (the mulss instruction) inside the loop. Also b * c is precomputed. (note the 'float 6' constant). Visual Studio 2012, however, gives: static float func() { 00000001408750A0 sub rsp,18h const float b = 2.f; const float c = 3.f; float sum = 1.f; 00000001408750A4 movss xmm4,dword ptr [__real@3f800000 (0140D9C67Ch)] 00000001408750AC movss xmm5,dword ptr [__real@40000000 (0140D9CBF8h)] 00000001408750B4 movaps xmmword ptr [rsp],xmm6 00000001408750B8 mov eax,80h 00000001408750BD movaps xmm3,xmm4 00000001408750C0 movss xmm6,dword ptr [__real@40400000 (0140D9CD58h)] 00000001408750C8 nop dword ptr [rax+rax] for(int i=0; ihttps://forwardscattering.org
Edit your site?
What are you doing?