https://lemire.me
Daniel Lemire's blog – Daniel Lemire is a computer science professor at the Data Science Laboratory of the Université du Québec (TÉLUQ) in Montreal. His research is focused on software performance.
Daniel Lemire's blog – Daniel Lemire is a computer science professor at the Data Science Laboratory of the Université du Québec (TÉLUQ) in Montreal. His research is focused on software performance. Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the Data Science Laboratory of the Université du Québec (TÉLUQ) in Montreal. His research is focused on software performance. Menu and widgets My home page GitHub profile Support my work!I do not accept any advertisement. However, you can you can sponsor my open-source work on GitHub. Join over 12,500 email subscribers: You can follow this blog on telegram. You can find me on twitter as @lemire or on Mastodon. Search for: Recent Posts The PhD Paradox: A Journey into Academia’s Upside-Down World Replace strings by views when you can Compressing floating-point numbers quickly by converting them to brain floats Parsing tiny and very large floating-point values: a programming-language comparison Faster random integer generation with batching Recent Commentsreder2000 on Reflection-based JSON in C++ at Gigabytes per SecondDr. Jochen L. Leidner, Professor of AI on The PhD Paradox: A Journey into Academia’s Upside-Down WorldTaras on A simple WebSocket benchmark in PythonDaniel Lemire on Computing the number of digits of an integer even fasterDaniel Lloyd on Computing the number of digits of an integer even fasterPages A short history of technology About me Book recommendations Cognitive biases Interviews and talks My bets My favorite articles My favorite quotes My rules Newsletter Predictions Privacy Policy Recommended video games Terms of use Write good papers Archives Archives Select Month September 2024 (3) August 2024 (4) July 2024 (9) June 2024 (5) May 2024 (8) April 2024 (8) March 2024 (6) February 2024 (4) January 2024 (4) December 2023 (4) November 2023 (5) October 2023 (5) September 2023 (4) August 2023 (5) July 2023 (9) June 2023 (7) May 2023 (7) April 2023 (15) March 2023 (9) February 2023 (5) January 2023 (7) December 2022 (10) November 2022 (9) October 2022 (3) September 2022 (5) August 2022 (3) July 2022 (7) June 2022 (4) May 2022 (4) April 2022 (4) March 2022 (2) February 2022 (4) January 2022 (3) December 2021 (2) November 2021 (7) October 2021 (12) September 2021 (5) August 2021 (2) July 2021 (4) June 2021 (5) May 2021 (8) April 2021 (6) March 2021 (5) February 2021 (4) January 2021 (6) December 2020 (11) November 2020 (10) October 2020 (6) September 2020 (6) August 2020 (4) July 2020 (6) June 2020 (7) May 2020 (6) April 2020 (7) March 2020 (8) February 2020 (7) January 2020 (7) December 2019 (10) November 2019 (6) October 2019 (7) September 2019 (9) August 2019 (9) July 2019 (10) June 2019 (9) May 2019 (10) April 2019 (8) March 2019 (15) February 2019 (9) January 2019 (10) December 2018 (9) November 2018 (8) October 2018 (10) September 2018 (9) August 2018 (10) July 2018 (14) June 2018 (9) May 2018 (11) April 2018 (11) March 2018 (10) February 2018 (7) January 2018 (15) December 2017 (9) November 2017 (16) October 2017 (13) September 2017 (20) August 2017 (11) July 2017 (8) June 2017 (9) May 2017 (10) April 2017 (11) March 2017 (11) February 2017 (6) January 2017 (8) December 2016 (8) November 2016 (4) October 2016 (6) September 2016 (10) August 2016 (6) July 2016 (4) June 2016 (6) May 2016 (5) April 2016 (10) March 2016 (9) February 2016 (8) January 2016 (5) December 2015 (8) November 2015 (4) October 2015 (8) September 2015 (5) August 2015 (6) July 2015 (5) June 2015 (2) May 2015 (4) April 2015 (4) March 2015 (5) February 2015 (5) January 2015 (3) December 2014 (6) November 2014 (4) October 2014 (3) September 2014 (5) August 2014 (5) July 2014 (4) June 2014 (2) May 2014 (6) April 2014 (7) March 2014 (3) February 2014 (5) January 2014 (5) December 2013 (8) November 2013 (5) October 2013 (5) September 2013 (5) August 2013 (3) July 2013 (4) June 2013 (4) May 2013 (3) April 2013 (7) March 2013 (6) February 2013 (6) January 2013 (8) December 2012 (2) November 2012 (5) October 2012 (4) September 2012 (6) August 2012 (4) July 2012 (4) June 2012 (3) May 2012 (3) April 2012 (6) March 2012 (5) February 2012 (3) January 2012 (8) December 2011 (3) November 2011 (5) October 2011 (5) September 2011 (4) August 2011 (8) July 2011 (3) June 2011 (5) May 2011 (6) April 2011 (6) March 2011 (5) February 2011 (4) January 2011 (10) December 2010 (7) November 2010 (6) October 2010 (3) September 2010 (3) August 2010 (5) July 2010 (4) June 2010 (7) May 2010 (5) April 2010 (7) March 2010 (8) February 2010 (5) January 2010 (7) December 2009 (4) November 2009 (6) October 2009 (10) September 2009 (8) August 2009 (11) July 2009 (9) June 2009 (7) May 2009 (7) April 2009 (7) March 2009 (7) February 2009 (14) January 2009 (14) December 2008 (16) November 2008 (25) October 2008 (13) September 2008 (15) August 2008 (14) July 2008 (15) June 2008 (14) May 2008 (15) April 2008 (20) March 2008 (17) February 2008 (12) January 2008 (19) December 2007 (20) November 2007 (24) October 2007 (19) September 2007 (13) August 2007 (23) July 2007 (18) June 2007 (15) May 2007 (19) April 2007 (9) March 2007 (7) February 2007 (27) January 2007 (20) December 2006 (19) November 2006 (14) October 2006 (9) September 2006 (10) August 2006 (25) July 2006 (10) June 2006 (18) May 2006 (27) April 2006 (22) March 2006 (11) February 2006 (11) January 2006 (38) December 2005 (23) November 2005 (23) October 2005 (18) September 2005 (25) August 2005 (39) July 2005 (17) June 2005 (16) May 2005 (9) April 2005 (13) March 2005 (30) February 2005 (20) January 2005 (29) December 2004 (11) November 2004 (18) October 2004 (14) September 2004 (17) August 2004 (13) July 2004 (16) June 2004 (16) May 2004 (12) Boring stuff Log in Entries feed Comments feed WordPress.org The PhD Paradox: A Journey into Academia’s Upside-Down World Imagine a world where becoming a doctor isn’t about years of rigorous study, but about showcasing your life’s work. That’s how doctorates used to roll. You’d write a book, make a groundbreaking discovery, and voila, a doctorate was yours. Fast forward to today, and we’ve flipped the script. Now, a PhD is less about what you’ve done and more about preparing you for a career in academia, often at a subsidized cost. Sounds great, right? Here’s the catch: this system works like a charm as long as universities are expanding. But what happens when they hit the brakes? You guessed it – a PhD glut. With more PhDs than professorships, the job market turns into a gladiatorial arena where only the most politically savvy survive. This isn’t just about competition; it’s about who can navigate the labyrinthine politics of academia. Universities, with their pristine campuses and lofty ideals, market themselves as bastions of brilliance and nurturing. But peel back the curtain, and you might find a different story. Professors, often out of touch with the real world, teach subjects they’ve never truly experienced. Take entrepreneurship, for example. You’d think those teaching it would have started a business, right? Nope. Many haven’t even stepped into a startup, let alone run one. Then there’s the publishing game. Tenured professors, the supposed engines of new knowledge, might not even produce a paper a year when you account for co-authorships. And when they do publish, well, let’s just say the quality can be…variable. Even at the pinnacle of academia, like Harvard, the standards can slip, as seen with its former president’s less-than-stellar publication record. So why do we keep pushing our youth into this system? It’s all about signaling. A degree, especially a PhD, is like a badge, a shiny sticker that says, “I’m educated.” But here’s the kicker – this badge might not make you more productive or happier. In fact, less time in school and more time in the real world could be the real recipe for success. Imagine if we recruited professors not just for their academic credentials but for their real-world achievements. People who’ve actually built things that work, could revolutionize how we teach software engineering or entrepreneurship. But we’re not there yet. We’re still caught in a system that values form over function, prestige over practicality. Our love affair with academia might be making us less productive and more miserable. Maybe it’s time we rethought this whole PhD business, not as a rite of passage into an elite club, but as a tool for real-world impact. After all, isn’t education supposed to prepare us for life, not just for more education? Posted on September 11, 2024Author Daniel LemireCategories 5 Comments on The PhD Paradox: A Journey into Academia’s Upside-Down World Replace strings by views when you can C++ programmers tend to represent strings using the std::string class. Though the implementation might vary, each instance of an std::string might use 32 bytes. Though it is not a large amount of memory, it can add up. In the Node.js runtime, as part of the build tools, there is a function which precomputes the string representation of all 16-bit integers, followed by a comma. std::vector
precompute_string() { size_t size = 1 << 16; std::vector code_table(size); for (size_t i = 0; i < size; ++i) { code_table[i] = std::to_string(i) + ','; } return code_table; } Creating 65536 strings uses 2 megabytes on most systems. What could we do instead? We could create a single string buffer made of the string ‘0,1,2,…,65535,’, and record the offsets within the string so that we can locate quickly where a given value is. That is, given the integer 500, I want to get immediately the location of the substring ‘500,’. I need two offsets: the offset of the current value and the offset of the next value. The unique string requires only 382106 bytes, which is quite a bit, but several times less than the 2 megabytes needed by the array of std::string instances. In C++, you might code it as follows. For simplicity, we can roll our own integer-to-string routine. std::pair, std::array> precompute_string_fast() { std::array str; std::array off; off[0] = 0; char *p = &str[0]; constexpr auto const_int_to_str = [](uint16_t value, char *s) -> uint32_t { uint32_t index = 0; do { s[index++] = '0' + (value % 10); value /= 10; } while (value != 0); for (uint32_t i = 0; i < index / 2; ++i) { char temp = s[i]; s[i] = s[index - i - 1]; s[index - i - 1] = temp; } s[index] = ','; return index + 1; }; for (int i = 0; i < 65536; ++i) { size_t offset = const_int_to_str(i, p); p += offset; off[i + 1] = off[i] + offset; } return {str, off}; } We do not actually need to store the offsets. We can compute them on the fly instead quite economically. However, it takes some effort and unless you are stressed for memory, it is likely better to compute the offsets as they require only 256 KB. I wrote a benchmark to see how long it takes to precompute the data. On my M2 macBook, I get the following numbers: array of std::string 0.57 ms one big string + offsets 0.26 ms So constructing just one string might be twice as fast. What about query time? Retrieving a reference to an std::string instance is trivial in the case of the array of std::string instances. For the big string case, we return a compute std::string_view instance. auto GetCodeFast = [&fast_table, &offsets](uint16_t index) { return std::string_view(&fast_table[offsets[index]], offsets[index + 1] - offsets[index]); }; Though it looks like a fair amount of work, I find that processing 65536 randomly generated values is significantly faster with the one-big-string approach. array of std::string (query) 0.12 ms one big string + offsets (query) 0.08 ms There are other ways to solve this problem. For example, instead of recomputing or computing offsets, we can use a fixed number of bytes per element (say 6 or 8 bytes) and either store the number of digits or compute it on the fly. You can also try to use even less memory by not storing a string like ’66,’ multiple times. All these variations have different trade offs and which is best depends on your application. Generating many non-trivial objects is a performance anti-pattern. Put all your data together when you can. Posted on September 9, 2024September 11, 2024Author Daniel LemireCategories 12 Comments on Replace strings by views when you can Compressing floating-point numbers quickly by converting them to brain floats We sometimes have to work a large quantity of floating-point numbers. This volume can be detrimental to performance. Thus we often want to compress these numbers. Large-language models routinely do so. A sensible approach is to convert them to brain floating point numbers. These are 16-bit numbers that are often capable of representing accurately a wide range of numbers. Compared to the common 64-bit floating-point numbers, it is a net saving of 4×. You can do quite better than a 4× factor by using statistical analysis over your dataset. However, brain floats have the benefit of being standard and they are supported at the CPU level. If you have a recent AMD processor (Zen 4 or better) or a recent Intel server processor (Sapphire Rapids), you have fast instructions to convert 32-bit floating-point numbers to and from 16-bit brain float numbers. Specifically, you have access to Single Instruction, Multiple Data (SIMD) instructions that can convert several numbers at once (with one instruction). We have had instructions to go from 64-bit numbers to 32-bit numbers and back for some time. So we can go from 64-bit to 32-bit and then to 16-bit, and similarly in reverse. Using Intel intrinsic functions, in C, you might compress 8 numbers with this code: // Load 8 double-precision floats __m512d src_vec = _mm512_loadu_pd(&src[i]); // Convert to 16-bit floats with rounding __m128bh dst_vec = _mm256_cvtneps_pbh(_mm512_cvt_roundpd_ps(src_vec, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)); // Store the result _mm_storeu_si128((__m128i*)&dst[i], *(__m128i*)&dst_vec); Going the other way and converting eight 16-bit numbers to eight 64-bit numbers is similar: // Load 8 half-precision floats __m128i src_vec = _mm_loadu_si128((__m128i*)&src[i]); // Convert to double-precision floats __m512d dst_vec = _mm512_cvtps_pd(_mm256_cvtpbh_ps(*(__m128bh*)&src_vec)); // Store the result _mm512_storeu_pd(&dst[i], dst_vec); How precise is it? It tried it on a geojson file representing the border of Canada: it is a collection of coordinates. The worst absolute error happens when the number -135.500305 is approximated by -136. The brain float format will not distinguish between the numbers -65.613617 and -66.282776, they both get represented as -66.5. Whether that’s acceptable depends on your application. Is it fast? I wrote a benchmark where I convert all the numbers in the coordinates of the Canadian border. I use a Zen 4 processor (AMD EPYC 9R14 @ 2.6GHz) with GCC 13. compression 16 GB/s 2 billion floats/s decompression 7.4 GB/s 0.9 billion floats/s It is likely that better code can get even better performance, but it is already quite fast. Posted on September 2, 2024September 6, 2024Author Daniel LemireCategories 5 Comments on Compressing floating-point numbers quickly by converting them to brain floats Parsing tiny and very large floating-point values: a programming-language comparison Most programming languages support floating-point numbers. You typically have the ability to turn a string into a floating-point number. E.g., “3.1416” could be parsed as a number close to pi. However strings typically cannot be represented exactly or at all. For example, “1e-1000” is too small and “1e1000” is too large for even 64-bit floating-point types. Most languages represent these strings as the number zero and the value ‘infinity’. Let us consider Python as an example: >>> float("1e-1000") 0.0 >>> float("1e1000") inf The Go language gives the same result with the caveat that 1e1000 triggers an error (which you can ignore). Consider the following Go code: package main import ( "fmt" "strconv" ) func main() { f, err := strconv.ParseFloat("1e-1000", 64) fmt.Println(f, err) f, err = strconv.ParseFloat("1e1000", 64) fmt.Println(f, err) } It prints out: 0 +Inf strconv.ParseFloat: parsing "1e1000": value out of range The C language also gives you 0 and infinity, but both are consider out of range. Let us consider the following C code… #include #include #include int main(void) { const char *p = "1e-1000 1e1000"; printf("Parsing '%s':\n", p); char *end; for (double f = strtod(p, &end); p != end; f = strtod(p, &end)) { printf("'%.*s' -> ", (int)(end-p), p); p = end; if (errno == ERANGE){ printf("range error, got "); errno = 0; } printf("%f\n", f); } } It prints out Parsing '1e-1000 1e1000': '1e-1000' -> range error, got 0.000000 ' 1e1000' -> range error, got inf What about C++? Let us consider the following code. #include #include #include int main() { for(std::string str : {"1e-1000", "1e1000"}) { double value = -1; printf("parsing %s\n", str.c_str()); auto r = std::from_chars(str.data(), str.data() + str.size(), value); if(r.ec == std::errc::result_out_of_range) { printf("out of range "); } printf("%f\n", value); } return EXIT_SUCCESS; } What does this output? The answer is: “it depends”. Under LLVM/libc++, the code does not build because it is still lacking support for floating parsing. (You are expected to use the C function strtod.) Under Visual Studio, you get the same result as C: parsing 1e-1000 out of range 0.000000 parsing 1e1000 out of range inf GCC/glibc++ relies on the fast_float library which follows the same behavior. However, the GCC folks added a special case handling, and they discard the parsed value before returning, thus you cannot distinguish 1e1000 and 1e-1000 when parsing strings with GCC/glibc++: both are unknown values ‘out of range’. You get the following with GCC: parsing 1e-1000 out of range -1.000000 parsing 1e1000 out of range -1.00000 The -1 value is just the bogus value that I had used to initialize the variable. The C++ language architects are aware of this issue, search for Floating point from_chars API does not distinguish between overflow and underflow. Interestingly, both GCC/glibc++ and Microsoft Visual Studio will happily return infinity when parsing the string "inf" and not trigger an out of range error. Posted on August 26, 2024August 26, 2024Author Daniel LemireCategories 3 Comments on Parsing tiny and very large floating-point values: a programming-language comparison Faster random integer generation with batching We often generate random integers. Quite often these numbers must be within an interval: e.g., an integer between 0 and 100. One application is a random shuffle. A standard algorithm for a fair random shuffle is the Knuth algorithm: void shuffle(mytype *storage, uint64_t size) { for (uint64_t i = size; i > 1; i--) { uint64_t nextpos = random(i); // random value in [0,i) std::swap(storage[i - 1], storage[nextpos]); } } It is how the C++ function std::shuffle is implemented, as well as any similar standard shuffle function. Notice how it requires you to generate random integers in even changing intervals. Meanwhile, most modern (pseudo) random integer functions generate 64-bit integers. Indeed, our processors are 64-bit bits running on 64-bit operating systems. Thus we must convert these 64-bit integers to whatever else we need, whether they are integers in an interval or a floating-point number. You might be tempted to using a modulo operation (random() % 100). While it generates a random number between 0 and 100, it also uses a statistical bias. Further, if the value 100 is not know at compile-time (i.e., it is a changing value), it may compile down to a division instruction, which is relatively expensive. There is a better approach that requires no division most of the time and is used by many programming language standard libraries: uint64_t random_bounded(uint64_t range) { __uint128_t random64bit, multiresult; uint64_t leftover; uint64_t threshold; random64bit = rng(); // 64-bit random integer multiresult = random64bit * range; leftover = (uint64_t)multiresult; if (leftover < range) { threshold = -range % range; while (leftover < threshold) { random64bit = rng(); multiresult = random64bit * range; leftover = (uint64_t)multiresult; } } return (uint64_t)(multiresult >> 64); // [0, range) } If you are using GCC for C++ programming under Linux (glibc++), you are already using this trick. If you are a macOS user with LLVM and libc++, you are probably relying on a slower approach. Is that the best we can do? Think about the scenario where you must generate an integer in the interval [0,100) and then an integer in the interval [0,99). Could you generate both such integers form a single 64-bit random integer? Indeed, you can, and you can avoid the division instructions once more. The code is somewhat similar although a bit more complicated. In theory, the algorithm needs to compute the product of the two ranges, but we only compute it if needed. // product_bound can be any integer >= range1*range2 // it may be updated to become range1*range2 std::pair random_bounded_2(uint64_t range1, uint64_t range2, uint64_t &product_bound) { __uint128_t random64bit, multiresult; uint64_t leftover; uint64_t threshold; random64bit = rng(); // 64-bit random integer multiresult = random64bit * range1; leftover = (uint64_t)multiresult; uint64_t result1 = (uint64_t)(multiresult >> 64); // [0, range1) multiresult = leftover * range2; leftover = (uint64_t)multiresult; uint64_t result2 = (uint64_t)(multiresult >> 64); // [0, range2) if (leftover < product_bound) { product_bound = range2 * range1; if (leftover < product_bound) { threshold = -product_bound % product_bound; while (leftover < threshold) { random64bit = rng(); multiresult = random64bit * range1; leftover = (uint64_t)multiresult; result1 = (uint64_t)(multiresult >> 64); // [0, range1) multiresult = leftover * range2; leftover = (uint64_t)multiresult; result2 = (uint64_t)(multiresult >> 64); // [0, range2) } } } return std::make_pair(result1, result2); } It has one limitation: the product of the two ranges must fit in a 64-bit word. Still, it be used to shuffle arrays, as long as they don’t exceed too much 1 billion elements: void shuffle_2(mytype *storage, uint64_t size) { uint64_t i = size; for (; i > 1 << 30; i--) { uint64_t index = random_bounded(i, g); // index is in [0, i-1] std::swap(storage[i - 1], storage[index]); } // Batches of 2 for sizes up to 2^30 elements uint64_t product_bound = i * (i - 1); for (; i > 1; i -= 2) { auto [index1, index2] = random_bounded_2(i, i - 1, product_bound, g); // index1 is in [0, i-1] // index2 is in [0, i-2] std::swap(storage[i - 1], storage[index1]); std::swap(storage[i - 2], storage[index2]); } } Of course, you can extend this trick to 3, 4, 5, 6… elements, but it gets more complicated. Is it fast? Let us test with a standard random number generator is C++ (std::mt19937_64). I wrote a C++ benchmark. Let us first consider an M2 macBook with Apple/LLVM 15: number of elements standard shuffle (ns/element) batched shuffle (ns/element) 10000 12.4 2.4 1000000 12.0 2.5 So we are four to five times faster than the standard library. Not bad. What about Linux with an Intel processor and GCC 12? Then our batched approach is about 30% faster than the standard library. Notice how the Linux/GCC base results are much better: that’s because it is already using a well optimized routine. Still: 30% is not a bad gain. number of elements standard shuffle (ns/element) batched shuffle (ns/element) 10000 2.2 1.7 1000000 3.5 2.7 Of course, the batched generation of random numbers applies to other important operations. Shuffling is just the most obvious one. This is joint work with Nevin Brackett-Rozinsky. References: Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random Integer Generation, Software: Practice and Experience (to appear) Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, Volume 29 Issue 1, February 2019 Appendix: Code #ifndef TEMPLATE_SHUFFLE_H #define TEMPLATE_SHUFFLE_H #include /** * Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random Integer Generation, Software: Practice and Experience (to appear) * Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, Volume 29 Issue 1, February 2019 */ namespace batched_random { template uint64_t random_bounded(uint64_t range, URBG &&rng) { __uint128_t random64bit, multiresult; uint64_t leftover; uint64_t threshold; random64bit = rng(); multiresult = random64bit * range; leftover = (uint64_t)multiresult; if (leftover < range) { threshold = -range % range; while (leftover < threshold) { random64bit = rng(); multiresult = random64bit * range; leftover = (uint64_t)multiresult; } } return (uint64_t)(multiresult >> 64); // [0, range) } // product_bound can be any integer >= range1*range2 // it may be updated to become range1*range2 template std::pair random_bounded_2(uint64_t range1, uint64_t range2, uint64_t &product_bound, URBG &&rng) { __uint128_t random64bit, multiresult; uint64_t leftover; uint64_t threshold; random64bit = rng(); multiresult = random64bit * range1; leftover = (uint64_t)multiresult; uint64_t result1 = (uint64_t)(multiresult >> 64); // [0, range1) multiresult = leftover * range2; leftover = (uint64_t)multiresult; uint64_t result2 = (uint64_t)(multiresult >> 64); // [0, range2) if (leftover < product_bound) { product_bound = range2 * range1; if (leftover < product_bound) { threshold = -product_bound % product_bound; while (leftover < threshold) { random64bit = rng(); multiresult = random64bit * range1; leftover = (uint64_t)multiresult; result1 = (uint64_t)(multiresult >> 64); // [0, range1) multiresult = leftover * range2; leftover = (uint64_t)multiresult; result2 = (uint64_t)(multiresult >> 64); // [0, range2) } } } return std::make_pair(result1, result2); } // This is a template function that shuffles the elements in the range [first, // last). // // It is similar to std::shuffle, but it uses a different algorithm. template extern void shuffle_2(RandomIt first, RandomIt last, URBG &&g) { uint64_t i = std::distance(first, last); for (; i > 1 << 30; i--) { uint64_t index = random_bounded(i, g); // index is in [0, i-1] std::iter_swap(first + i - 1, first + index); } // Batches of 2 for sizes up to 2^30 elements uint64_t product_bound = i * (i - 1); for (; i > 1; i -= 2) { auto [index1, index2] = random_bounded_2(i, i - 1, product_bound, g); // index1 is in [0, i-1] // index2 is in [0, i-2] std::iter_swap(first + i - 1, first + index1); std::iter_swap(first + i - 2, first + index2); } } } // namespace batched_random #endif // TEMPLATE_SHUFFLE_H Posted on August 17, 2024August 23, 2024Author Daniel LemireCategories 15 Comments on Faster random integer generation with batching Reflection-based JSON in C++ at Gigabytes per Second JSON (JavaScript Object Notation) is a popular format for storing and transmitting data. It uses human-readable text to represent structured data in the form of attribute–value pairs and arrays. E.g., {"age":5, "name":"Daniel", toys:["wooden dog", "little car"]}. Ingesting and producing JSON documents can be a performance bottleneck. Thankfully, a few JSON parsers such as simdjson have shown that we can process JSON at high speeds, reaching gigabytes per second. However, producing and ingesting JSON data can remain a chore in C++. The programmer often needs to address potential errors such as unexpected content. Yet, often, the programmer only needs to map the content to and from a native C/C++ data structure. E.g., our JSON example might correspond to the following C++ structure: struct kid { int age; std::string name; std::vector toys; }; Hence, we would often like to write simple code such as kid k; store_json(k, out) or kid k; load_json(k, in) Ideally, we would like this code to be fast and safe. That is, we would like to write and read our data structures at gigabytes per second. Further, we want proper validation especially since the JSON data might come from the Internet. A library such as simdjson already allows elegant deserialization, e.g., we can make the following example work: struct kid { int age; std::string name; std::vector toys; }; void demo() { auto json_str = R"({"age": 12, "name": "John", "toys": ["car", "ball"]})"_padded; simdjson::ondemand::parser parser; auto doc = parser.iterate(json_str); kid k = doc.get(); } However, for the time being, it requires a bit of effort on the part of the programmer, as they need to write a custom function. Similarly, going from the C++ structures to JSON can be made convenient, but typically only after the programmer has done a bit of work, writing custom code. E.g., in the popular JSON for Modern C++ (nlohmann/json) library, we need to provide glue functions: void to_json(json&j, const person&p) { j = json{{"name", p.name}, {"address", p.address}, {"age", p.age}}; } void from_json(const json&j, person&p) { j.at("name").get_to(p.name); j.at("address").get_to(p.address); j.at("age").get_to(p.age); } We know that this extra effort (writing data-structure aware code) is unnecessary because other programming languages (Java, C#, Zig, Rust, Python, etc.) allow us to ‘magically’ serialize and deserialize with almost no specialized code. One key missing feature in C++ is sufficiently powerful reflection. Reflection in programming languages refers to a mechanism that allows code to introspect its own structure. That is, if you cannot conveniently tell C++ that if a class has a std::string name attribute, then you should automatically search for a string value matching the key name in the JSON. Thankfully, C++ is soon getting reflection in C++26. And it is getting powerful reflection: reflective metaprogramming. That is, future versions of C++ will allow us to solve the serialization and deserialization problem we described at compile time: a software library can automate the production and consumption of JSON to and from a native data structure. Such a feature will simplify the life of the C++ programmer. Furthermore, because it can be done automatically as part of well tested library at compile time, it can generate fast and safe code. Herb Sutter wrote about this exiciting development: This is huge, because reflection (including generation) will be by far the most impactful feature C++ has ever added since C++98, and it will dominate the next decade and more of C++ usage. Though many programming languages have reflection, the reflective metaprogramming provided by the upcoming C++ standard is unique to our knowledge. Some programming languages have long had powerful reflection capabilities, but they are mostly a runtime feature: they do not allow the equivalent of generating code. Other programming languages allow compile-time logic generation but they do not always allow optimal performance. We expect that the C++ approach might be especially appropriate for high performance. Though current C++ compilers do not yet support reflection, we have access to prototypical compilers. In particular, engineers from Bloomberg maintain a fork of LLVM, the popular compiler framework. Though such an implementation should not be used in production and is subject to bugs and other limitations, it should be sufficient to test the performance: how fast can serialization-based JSON processing be in C++? Importantly, we are not immediately concerned with compilation speed: the Bloomberg fork is deliberately suboptimal in this respect. To test out this new C++ feature, we wrote a prototype that can serialize and deserialize data structures to and from JSON strings automatically. Given an an instance of kid, we can convert it to a JSON string without any effort, the compile-time reflection does the work: kid k{12, "John", {"car", "ball"}}; std::print("My JSON is {}\n", simdjson::json_builder::to_json_string(k)); Benchmark Results with Speed-up Automatically converting your C++ data types into JSON strings is convenient. But is it fast? To find out, we have used 2 instances for benchmarking, an “Artificial instance” that was obtained using the JSON Generator Website and the other one is the twitter.json file that is used by many other serialization libraries for benchmarking. We test on two systems (Apple and Intel/Linux) with the experimental LLVM Bloomberg compiler. M3 MAX Macbook Pro: Test Instance Library Speed (MB/s) speedup Twitter Serialization nlohmann/json 100 our C++26 serializer 1900 19× Artificial Serialization nlohmann/json 50 our C++26 serializer 1800 36× Intel Ice Lake: Test Instance Library Speed (MB/s) speedup Twitter Serialization nlohmann/json 110 our C++26 serializer 1800 16× Artificial Serialization nlohmann/json 50 our C++26 serializer 1400 28× Our benchmark showed that we are roughly 20× faster than nlohmann/json. This was achieved by combining the new reflection capabilities with some bit twiddling tricks and a string_builder class to help minimize the overhead of memory allocations. Importantly, our implementation requires little code. The C++26 reflection capabilities do all the hard work. In our benchmarks, we further compare against hand-written functions as well as an existing C++ reflection library (reflect-cpp). Hand-written code can be slightly faster than our reflection-based code, but the difference is small (10% to 20%). We can be twice as fast as reflect-cpp although it is also a very fast library. Conclusion Our results suggest that C++26 with reflection is a powerful tool that should allow C++ programmers to easily serialize and deserialize data structures at gigabytes per second. In the near future, the simdjson library will adopt reflection for both serialization and deserialization. Indeed, we expect that this will help users get high performance in some important cases with little effort. Credit Joint work with Francisco Geiman Thiesen. We are grateful to @the-moisrex for teaching us about tag dispatching and initiating its adoption in the simdson library. The simdjson library is a community-based effort. Appendix: complete code example struct kid { int age; std::string name; std::vector toys; }; /** * The following function would print: * * I am 12 years old * I have a car * I have a ball * My name is John * My JSON is {"age":12,"name":"John","toys":["car","ball"]} * */ void demo() { auto json_str = R"({"age": 12, "name": "John", "toys": ["car", "ball"]})"_padded; simdjson::ondemand::parser parser; auto doc = parser.iterate(json_str); kid k = doc.get(); std::print("I am {} years old\n", k.age); for (constauto &toy : k.toys) { std::print("I have a {}\n", toy); } std::print("My name is {}\n", k.name); std::print("My JSON is {}\n", simdjson::json_builder::to_json_string(k)); } References: Langdale, Geoff, and Daniel Lemire. Parsing gigabytes of JSON per second, The VLDB Journal 28.6 (2019): 941-960. Keiser, John, and Daniel Lemire. On‐demand JSON: A better way to parse documents?, Software: Practice and Experience 54.6 (2024): 1074-1086. Wyatt Childers, Peter Dimov, Dan Katz, Barry Revzin, Andrew Sutton, Faisal Vali, Daveed Vandevoorde, Reflection for C++26, WG21 P2996R5, 2024-08-14 Posted on August 13, 2024September 23, 2024Author Daniel LemireCategories 2 Comments on Reflection-based JSON in C++ at Gigabytes per Second Converting ASCII strings to lower case at crazy speeds with AVX-512 AMD Zen 4 and Zen 5, as well as server-side recent Intel processors, support an advanced set of instructions called AVX-512. They are powerful SIMD (Single Instruction, Multiple Data) instructions. Importantly, they allow ‘masked’ operations. That is, you can compute a mask and only do an operation on bytes indicated by the mask. Thus you can easily store only the first k bytes of a block of 64 bytes of memory as one instruction. Tony Finch recently described how you can take an ASCII string of arbitrary length and convert them to lower case quickly using AVX-512. Finch’s results is that for both tiny and large strings, the AVX-512 approach is faster. In his work, Finch assumes that the length of the string is known up front. However, C strings are stored as a pointer to the beginning of the string with a null character (\0) indicating its end. Thus the string love is stored in memory as love\0. Can we extend his work to C strings? With AVX-512 is that you can load 64 bytes at a time, instead of loading individual bytes. In general, it is unsafe to read beyond the scope of allocated memory. It may crash your application if you are loading into a memory page that does not belong to your process. How do you know when to stop reading blocks of 64 bytes? The trick is that it is always safe to do aligned loads. That is, if you load at an address that is divisible by 64 bytes, you will never cross a memory page because memory pages are always divisible by 64 on Intel and AMD systems. To convert ASCII letters to lower case, we use the fact that the letters from A to Z in ASCII are in a continuous range as code point values (values stored in memory), and so are the letters from a to z. Thus if you can identify the upper case letters, it suffices to add a constant to them to make them lower case. Finch wrote a function which converts 64 ASCII bytes to lower case when a block of 64 bytes (c) has been loaded: static inline __m512i tolower64(__m512i c) { __m512i A = _mm512_set1_epi8('A'); __m512i Z = _mm512_set1_epi8('Z'); __m512i to_lower = _mm512_set1_epi8('a' - 'A'); __mmask64 ge_A = _mm512_cmpge_epi8_mask(c, A); __mmask64 le_Z = _mm512_cmple_epi8_mask(c, Z); __mmask64 is_upper = _kand_mask64(ge_A, le_Z); return (_mm512_mask_add_epi8(c, is_upper, c, to_lower)); } This function efficiently converts a 64-byte block of characters (represented as a __m512i vector) to lowercase using SIMD instructions. The variables A and Z are vectors filled with the characters ‘A’ and ‘Z’ respectively. The variable to_lower contains the difference between ‘a’ and ‘A’ which is 32. The variable ge_A is a mask where bits are set to 1 if the corresponding element is greater than or equal to ‘A’. The variable le_Z is a mask where bits are set to 1 if the corresponding element is less than or equal to ‘Z’. The variable is_upper combines the two masks to identify characters that are both greater than or equal to ‘A’ and less than or equal to ‘Z’, indicating uppercase letters. In the final step, we add the value to_lower only for the values identified by the mask is_upper. This effectively converts uppercase letters to lowercase. LLVM might compile it to three instructions: vpaddb, vpcmpltub and vpaddb. Depending on the compiler, you might get better results with this equivalent alternative: __m512i tolower64(__m512i c) { __m512i ca = _mm512_sub_epi8(c, _mm512_set1_epi8('A')); __mmask64 is_upper = _mm512_cmple_epu8_mask(ca, _mm512_set1_epi8('Z' - 'A')); __m512i to_lower = _mm512_set1_epi8('a' - 'A'); return (_mm512_mask_add_epi8(c, is_upper, c, to_lower)); } You may even use a generalized version based on table lookups: __m512i tolower64(__m512i c) { __mmask64 le_7t = _mm512_cmple_epu8_mask(c, _mm512_set1_epi8 (0x7f)); __m512i byteconst_00_3f = _mm512_set_epi64 (0x3f3e3d3c3b3a3938, 0x3736353433323130, 0x2f2e2d2c2b2a2928, 0x2726252423222120, 0x1f1e1d1c1b1a1918, 0x1716151413121110, 0x0f0e0d0c0b0a0908, 0x0706050403020100); __m512i byteconst_40_7f = _mm512_set_epi64 (0x7f7e7d7c7b7a7978, 0x7776757473727170, 0x6f6e6d6c6b6a6968, 0x6766656463626160, 0x5f55d5c5b7a7978, 0x7776757473727170, 0x6f6e6d6c6b6a6968, 0x6766656463626140); return _mm512_mask2_permutex2var_epi8 (byteconst_00_3f, c, le_7t, byteconst_40_7f); } Of course, we still need to use this function to process an actual string, not a block of 64 bytes. Let us first consider a naive function that does the same task, character by character: size_t lower(char *srcorig) { char *p = srcorig; for (; *p; ++p) { *p = *p > 0x40 && *p < 0x5b ? *p | 0x20 : *p; } return p - srcorig; } This function uses the fact that instead of an addition, we can just do a bitwise OR to change the case of an ASCII letter. In this particular case, we do not null terminated the result but we return the length of the string. Let us now consider a possible AVX-512 implementation. size_t lower64(const char *srcorig, char *dstorig) { uintptr_t address = reinterpret_cast(srcorig); uintptr_t aligned_address = address / 64 * 64; // round down uintptr_t notincluded = address - aligned_address; // [0,64) const char *src; if(notincluded) { src = reinterpret_cast(aligned_address); __mmask64 init_mask = _cvtu64_mask64((~UINT64_C(0)) << notincluded); __m512i src_v = _mm512_maskz_loadu_epi8(init_mask, src); __mmask64 is_zero = _mm512_mask_cmpeq_epu8_mask(init_mask, src_v, _mm512_setzero_si512()); __m512i dst_v = tolower64(src_v); if (is_zero) { __mmask64 zero_mask = (is_zero - 1) ^ is_zero; _mm512_mask_storeu_epi8(dstorig - notincluded, zero_mask & unit_mask, dst_v); return __tzcnt_u64(is_zero) + (src - srcorig); } _mm512_mask_storeu_epi8(dstorig - notincluded, init_mask, dst_v); src += 64; dstorig += 64 - notincluded; } else { // fast path src = reinterpret_cast(srcorig); __m512i src_v = _mm512_loadu_epi8(src); __mmask64 is_zero = _mm512_cmpeq_epu8_mask(src_v, _mm512_setzero_si512()); __m512i dst_v = tolower64(src_v); if (is_zero) { __mmask64 zero_mask = (is_zero - 1) ^ is_zero; _mm512_mask_storeu_epi8(dstorig, zero_mask, dst_v); return __tzcnt_u64(is_zero); } _mm512_storeu_epi8(dstorig, dst_v); src += 64; dstorig += 64; } while (true) { __m512i src_v = _mm512_loadu_epi8(src); __m512i dst_v = tolower64(src_v); __mmask64 is_zero = _mm512_cmpeq_epu8_mask(src_v, _mm512_setzero_si512()); if (is_zero) { __mmask64 zero_mask = (is_zero - 1) ^ is_zero; _mm512_mask_storeu_epi8(dstorig, zero_mask, dst_v); return __tzcnt_u64(is_zero) + (src - srcorig); } _mm512_storeu_epi8(dstorig, dst_v); src += 64; dstorig += 64; } } The code converts a string of characters to lowercase using AVX-512 instructions. It works in 64-byte chunks for efficiency. We have two pointers are parameters, srcorig is a pointer to the original source string, dstorig is a pointer to the destination buffer for the lowercase string. Initially we calculates the alignment offset of srcorig to a 64-byte boundary. We initialize pointers and masks based on the alignment offset. We have a fast path for the case where the string is already aligned on a 64-byte boundary. Initially, we load a 64-byte chunk into a __m512i vector, possibly reading prior to the beginning of the string. We converts the chunk to lowercase using tolower64. We also check if an element is null, if that is the case, we will store and return a string of length smaller than 64 bytes. In the main loop, we process process 64-byte chunks in a loop until a null character is encountered. That is, we load a 64-byte chunk into an __m512i vector, we convert the chunk to lowercase using tolower64. We check if the loaded chunk contains a null character and ends the process if that is the case, calculating and returning the number of processed characters. If not, we store the converted chunk to the destination buffer. The gotcha with this approach is that you will read before the beginning of the string if it is not already aligned on a 64-byte boundary and some tools might warn you. However, the code remains safe. You just have to tell your tool that the warnings should be omitted. How fast is the AVX-512 code? I am using an Intel Ice Lake processor and LLVM 16. In my benchmark, I use fixed strings of various size. My benchmark repeatedly processes the same string which omits the branch mispredictions that would occur in practice, so the real speed might be lower. I report the speed in GB/s. N naive AVX-512 4 0.9 1.6 18 0.9 4 145 1.3 20 970 1.3 34 Thus, as you can see, the AVX-512 can be 20 times faster than the conventional approach on small strings while remaining competitive on tiny strings. To my knowledge, only the AVX-512 instruction set allows this magical performance. It is significant advantage for recent AMD and Intel processors. Sadly, Intel no longer include AVX-512 in its non-server processors. Credit: I chatted with Robert Clausecker about these issues about a year ago. Posted on August 3, 2024August 9, 2024Author Daniel LemireCategories 13 Comments on Converting ASCII strings to lower case at crazy speeds with AVX-512 Evolution of iPhone storage capacity People who should know better often underestimate how fast our storage capacity has grown. We have been able to get 1 TB of storage on iPhones for the last three generations. 2010 iPhone 4 32 GB 2012 iPhone 5 64 GB 2014 iPhone 6 128 GB 2016 iPhone 7 256 GB 2018 iPhone XS 512 GB 2019 iPhone 11 Pro 512 GB 2020 iPhone 12 Pro 512 GB 2021 iPhone 13 Pro 1 TB 2022 iPhone 14 Pro 1 TB 2023 iPhone 15 Pro 1 TB Further reading: In 2011, I predicted that the iPhone would have 1TB of storage in 2020. Posted on July 28, 2024July 28, 2024Author Daniel LemireCategories Leave a comment on Evolution of iPhone storage capacity Storage costs are plummeting Storage costs are plummeting like a skydiver in freefall—between 10 and 100 times cheaper with each passing decade. Meanwhile, the programmer population is growing at a leisurely pace, like a tortoise in a marathon, increasing by about 50% per decade. And the Linux kernel? It is maybe doubling in size every ten years. The net result: we are using storage of data (videos, images, model weights) while code is taking a backseat, fading into the background. We just cannot code fast enough to fill our increasingly large disks. Source: World in Data. Posted on July 27, 2024July 29, 2024Author Daniel LemireCategories 2 Comments on Storage costs are plummeting How big are your docker images? Docker is a standard to deploy software on the cloud. Developers start with an existing image and add their own code before deploying their systems. How big are typical uncompressed images? python:alpine (latest, aarch64) 58 MiB chainguard/bun (latest, aarch64) 90 MiB node:alpine (latest, aarch64) 141 MiB golang:alpine (latest, aarch64) 219 MiB Method: docker inspect -f "{{ .Size }}" docker.io/library/myimage Note: When you use Go, the resulting binary can be put into a new tiny image. I have used alpine images in this table because they are smaller but many developers report problems with alpine images. Posted on July 27, 2024August 7, 2024Author Daniel LemireCategories 2 Comments on How big are your docker images? How much of your binary executable is just ASCII text? We sometimes use binary executable which can span megabytes. I wondered: how much text is contained in these binary files? To find out, I wrote a Python script which adds up the size of all sequences of at least 16 ASCII characters in the file. My heuristic is simple but is not quite perfect: some long sequences might not be actual text and some short ASCII strings might be missed. Nevertheless, it should be good enough to get some insight. I downloaded macOS binaries for some popular JavaScript runtimes. I find that tens of megabytes are used for what is likely ASCII strings. About a third of the Node 22 binary is ASCII text. It is likely that much of this content is made of debug symbols and unminified JavaScript code. Node 22 33 MB Node 20 24 MB Node 18 18 MB Deno 1.32 22 MB Bun 1.1 5 MB An alternative method is to use the strings command as in strings - node | wc. It is a different metric but gives similar numbers. Posted on July 27, 2024July 27, 2024Author Daniel LemireCategories 1 Comment on How much of your binary executable is just ASCII text? Safer code in C++ with lifetime bounds For better performance in software, we avoid unnecessary copies. To do so, we introduce references (or pointers). An example of this ideas in C++ is the std::string_view class. As the name suggests, a std::string_view instance is merely a ‘view’: it points at some string, but it does not own or otherwise manage the underlying memory. The downside is that we must track ownership: when the owner of the memory is gone, we should not be left holding the std::string_view instance. With modern tools, it is trivial to detect such bugs (e.g., using a sanitizer). However, it would be nicer if the compiler could tell us right away. A few C++ compilers (Visual Studio and LLVM) support lifetime-bound annotation to help us. Let us consider an example. Suppose you would like to parse a URL (e.g., ‘https://www.google.com/path’), but you are only interested in the host (e.g. ‘www.google.com’). You might write code like so using the ada-url/ada parsing library: std::string_view my_get_host(std::string_view url_string) { auto url = ada::parse(url_string).value(); return url.get_host(); } This code is not generally safe. The parser will store the result of the parse inside a temporary object but you are returning an std::string_view which points at it. You have a dangling reference. To get a recent version of LLVM/clang (18+) to warn us, we just need to annotate the function get_host like so: #ifndef __has_cpp_attribute #define ada_lifetime_bound #elif __has_cpp_attribute(msvc::lifetimebound) #define ada_lifetime_bound [[msvc::lifetimebound]] #elif __has_cpp_attribute(clang::lifetimebound) #define ada_lifetime_bound [[clang::lifetimebound]] #elif __has_cpp_attribute(lifetimebound) #define ada_lifetime_bound [[lifetimebound]] #else #define ada_lifetime_bound #endif ... std::string_view get_host() const noexcept ada_lifetime_bound; You can review the complete code update on GitHub. And then we get a warning at compile time: fun.cpp:8:10: warning: address of stack memory associated with local variable 'url' returned [-Wreturn-stack-address] 8 | return url.get_host(); It is hardly perfect at this point in time. It does not always warn you, but progress is being made. This feature and others will help us catch errors sooner. Credit: Thanks to Denis Yaroshevskiy for making me aware of this new compiler feature. Further reading: LLVM documentation. Posted on July 26, 2024July 26, 2024Author Daniel LemireCategories 4 Comments on Safer code in C++ with lifetime bounds Does C++ allow template specialization by concepts? Recent versions of C++ (C++20) have a new feature: concepts. A concept in C++ is a named set of requirements that a type must satisfy. E.g., ‘act like a string’ or ‘act like a number’. In C++, we have two closely related terms: traits and concepts. For example, std::is_floating_point is a type trait that checks if a type is a floating-point type. A concept in C++ is used to specify requirements for template parameters. So std::floating_point is the concept corresponding to the trait std::is_floating_point. When used in conjunction with templates, concepts can be quite nice. Let us look at an example. Suppose that you want to define a generic function ‘clear’ which behaves some way on strings and some other way on other types. You might use the following code: template void clear(T & t); template concept not_string = !std::is_same_v; template <> void clear(std::string & t) { t.clear(); } template void clear(T& container) requires not_string { for(auto& i : container) { i = typename T::value_type{}; } } We have a generic clear function that takes a reference to any type T as an argument. We define a concept not_string which applies to any type expect std::string. We specialize for the std::string case: template <> void clear(std::string & t) { t.clear(); }: It directly calls the clear() member function of the string to empty its contents. Next we define template void clear(T& container) requires not_string { ... }: it is a generic implementation of clear for any type T that satisfies the not_string concept. It iterates over the container and sets each element to its default value. When you call clear with a std::string, the specialized version is used, directly calling std::string::clear(). For any other type that satisfies the not_string concept (i.e., is not a std::string), the generic implementation is used. It iterates over the container and sets each element to its default value. This assumes that T is a container-like type with a value_type and iterator support. There are easier ways to achieve this result, but it illustrates the idea. Up until recently, I thought that the template with the ‘requires’ keyword was a template specialization, just like when you specialize the template for the std::string type. Unfortunately, it may be more complicated. Let us repeat exactly the same code but we put the clear function in a class ‘A’: struct A { template void clear(T & t); }; template <> void A::clear(std::string & t) { t.clear(); } template void A::clear(T& container) requires not_string { for(auto& i : container) { i = typename T::value_type{}; } } This time, your compiler might fail with the following or the equivalent: out-of-line definition of 'clear' does not match any declaration in 'A' You might fix the issue by adding the template with the constraint to the class definition. struct A { template void clear(T & t); template void clear(T& container) requires not_string; }; I find this unpleasant and inconvenient because you must put in your class declaration all of the concepts you plan on supporting. And if someone wants to support another concept, they need to change your class declaration to add it. Posted on July 22, 2024July 23, 2024Author Daniel LemireCategories 1 Comment on Does C++ allow template specialization by concepts? Scan HTML even faster with SIMD instructions (C++ and C#) Earlier this year, both major Web engines (WebKit/Safari and Chromium/Chrome/Edge/Brave) accelerated HTML parsing using SIMD instructions. These ‘SIMD’ instructions are special instructions that are present in all our processors that can process multiple bytes at once (e.g., 16 bytes). The problem that WebKit and Chromium solve is to jump to the next target character as fast as possible: one of <, &, \r and \0. In C++, you can use std::find_first_of for this problem whereas in C#, you might use the SearchValues class. But we can do better with a bit of extra work. The WebKit and Chromium engineers use vectorized classification (Langdale and Lemire, 2019): we load blocks of bytes and identify the characters you seek using a few instructions. Then we identify the first such character loaded, if any. If no target character is found, we just move forward and load another block of characters. I reviewed the new methods used by the Web engines in C++ and C# in earlier posts along with minor optimizations. The results are good, and it suggests that WebKit and Chromium engineers were right to adopt this optimization. But can we do better? Let us consider an example to see how the WebKit/Chromium approach works. Suppose that my HTML file is as follows: low_nibble_mask = Vector128.Create(0, 0, 0, 0, 0, 0, (byte)0x26, 0, 0, 0, 0, 0, (byte)0x3c, (byte)0xd, 0, 0); Vector128 v0f = Vector128.Create((byte)0x0F); Vector128 bit_mask = Vector128.Create((byte)16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1); const int stride = 16; while (start + (stride - 1) < end) { Vector128 data = AdvSimd.LoadVector128((byte*)start); Vector128 lowpart = AdvSimd.Arm64.VectorTableLookup(low_nibble_mask, data & v0f); Vector128 matchesones = AdvSimd.CompareEqual(lowpart, data); if (matchesones != Vector128.Zero) { Vector128 matches = AdvSimd.And(bit_mask, matchesones); int m = AdvSimd.Arm64.MaxAcross(matches).ToScalar(); start += 16 - m; return; } start += stride; } while (start < end) { if (*start == '<' || *start == '&' || *start == '\r' || *start == '\0') { return; } start++; } } The function takes two pointers (ref byte* start and byte* end) that mark the beginning and end of the byte array. The main loop continues as long as start is at least 16 bytes away from end. This ensures there’s enough data for vectorized operations. We load in the variable ‘data’ 16 bytes from the memory pointed to by start. We use a vectorized lookup table and a comparison to quickly identify the target characters.The code checks if any element in matchesones is not zero. If there’s a match, then we locate the first one (out of 16 characters), we advance start and return. If no match is found, we advance by 16 characters and repeat. We conclude with a fallback look that processes the leftover data (less than 16 bytes). As an optimization, it is helpful to use a local variable for the reference to the first pointer. Doing so improves the perfomance substantially: C# is not happy when we repeatedly modify a reference. Thus, at the start of the function, you may set byte* mystart = start, use mystart throughout, and then, just before a return, you set start = mystart. The .NET runtime library has also a fast SearchValues class to help search characters. SearchValues searchValues = SearchValues.Create( stackalloc byte[] { 0, 13, 38, 60 }); ReadOnlySpan data = allLinesUtf8; while (!data.IsEmpty) { int first = data.IndexOfAny(searchValues); data = data.Slice(first >= 0 ? first + 1 : 1); } I wrote a benchmark (in C#) that you can run if you have an ARM-based processor. Conventional 1.4 GB/s SearchValues 4.2 GB/s SIMD (ARM NEON) 6.3 GB/s Incredibly, the SIMD-based function is over 4 times faster than the conventional function in these tests, and the accelerated C# function about 15% slower than the C++ version. The non-SIMD C# version is also slightly slower than the C++ version. Harold Aptroot provided support for x64 processor (up to AVX2) so I extended my benchmark to an Intel Ice Lake system: Conventional 1.0 GB/s SearchValues 3.8 GB/s SIMD (x64 AVX2) 7.6 GB/s This time, the SIMD version is over 7 times faster than the scalar. In fact, it matches the performance numbers that I get with C/C++. It is also important for performance to write the code in such a way that the C# compiler tends to inline the scanning function, since it is called repeatedly. Initially, I had written the benchmark with some abstraction, using a delegate function, but it limited the best possible speed. In other words, .NET/C# allows you to write fast code using SIMD instructions. It may be well worth the effort. Further reading: Miha Zupan opened a pull request in the Microsoft .NET runtime to improve the performance of SearchValues in cases like the one I considered. Suggested reading. I have a follow-up blog post where I explain how we can go much faster (29 GB/s). Posted on July 5, 2024August 23, 2024Author Daniel LemireCategories 4 Comments on Scan HTML faster with SIMD instructions: .NET/C# Edition How much memory does a call to ‘malloc’ allocate? In C, we allocate memory on the heap using the malloc function. Other programming languages like C++ or zig (e.g., std.heap.c_allocator) may call on malloc underneath so it is important to understand how malloc works. Furthermore, the same concepts apply broadly to other memory allocators. In theory, you could allocate just one byte like so: char * buffer = (char*) malloc(1); How much memory does this actually allocate? On modern systems, the request allocates virtual memory which may or may not be actual (physical) memory. On many systems (e.g., Linux), the physical memory tends to be allocated lazily, as late as possible. Other systems such as Windows are more eager to allocate physical memory. It is also common and easy to provide your own memory allocators, so the behavior varies quite a bit. But how much virtual memory does my call to malloc(1) typically? There is likely some fixed overhead per allocation: you can expect 8 bytes of metadata per allocation although it could be less or more depending on the allocator. You cannot use this overhead in your own program: it is consumed by the system to keep track of the memory allocations. Even if you have a highly efficient allocator with little overhead per allocation, the pointer itself must typically be tracked and, on a 64-bit system, that represents 8 bytes of data. If you asked for 1 bytes, you are probably getting a large chunk of usable memory: maybe between 16 bytes and 24 bytes. Indeed, most memory allocations are aligned (rounded up) so that the address is divisible by max_align_t (typically 16 bytes) and there is a minimum size that you may get. And, indeed, the C language has a function called realloc which can be used to extend a memory allocation, often for free because the memory is already available. You can ask how much memory is available. Under Linux, you can use the malloc_usable_size while under FreeBSD and macOS, you can use malloc_size. So I can write a small programs that asks how much (virtual) memory was actually granted given a request. For one byte, my macOS laptop gives out 16 bytes while my x64 Linux server seemingly gives 24 bytes. If I plot the memory actually granted versus the memory requested, you see a staircase where, on average, you get 8 extra bytes of memory. That is to be expected if the pointer returned must by at an address divisible by 16 (max_align_t). Overall, you should avoid broad generalizations about how Linux or macOS work or compare, and simply keep in mind the broad picture. What do we conclude? You probably should avoid allocating on the heap tiny blocks of memory (i.e., smaller than 16 bytes). Furthermore, you may not want to optimize the allocation size down to a few bytes since it gets rounded up in any case. Finally, you should make use of realloc if you can as you can often extend a memory region, at least by a few bytes, for free. Posted on June 27, 2024July 5, 2024Author Daniel LemireCategories 7 Comments on How much memory does a call to ‘malloc’ allocate? Performance tip: avoid unnecessary copies Copying data in software is cheap, but it is not at all free. As you start optimizing your code, you might find that copies become a performance bottleneck. Let me be clear that copies really are cheap. It is often more performant to copy that data than to track the same memory across different threads. The case I am interested in is when copies turn a trivial operation into one that is relatively expensive. Recently, the fast JavaScript runtime (Bun) optimized its base64 decoding routines.Base64 is a technique used to represent binary data, like images or files, as ASCII text. It is ubiquitous on the Internet (email formats, XML, JSON, HTML and so forth). In Node.js and in Bun, you can decode a base64 string Buffer.from(s, "base64"). Importantly, when decoding base64 strings, you can be almost certain that you are dealing with a simple ASCII string. And systems like Node.js and Bun have optimized string representations for the ASCII case, using one byte per character. We had optimized base64 decoding in Node.js some time ago (credit to Yagiz Nizipli for his work)… but I was surprised to learn that Bun was able to beat Node.js by a factor of three. Because both Node.js and Bun use the same base64 decoding, I was confused. I mistakenly thought, based on quick profiling, that Node.js would copy the base64 data from an ASCII format (one byte per character) to a UTF-16 format (two bytes per character) despite our best attempt at avoiding copies. You can review my technically incorrect analysis on X. What I like is the second half of my analysis: The story illustrates why our software is slower than it should be. We have layers of abstractions to fight against. Sometimes you win, sometimes you lose. These layers are there for a reason, but they are not free. To make matters worse… these abstraction layers often thicken over time… and the friction goes up. To be clear, I do not claim that the Node.js code is optimal. In fact, I know it can be better. But it is not trivial to make it go fast. I sometimes hear people say… “well, it is C++ and C++ is hard”. No. The C++ part is easy relatively speaking. The difficulty is at a higher level. It is not a matter of syntax. It is a matter of architecture. I say that I was technically incorrect. Why was I incorrect? It turns out that the copy was not happening as part of base64 decoding but in a completely separate function. There is an innocuous function in Node.js called StringBytes::Size which basically must provide an upper on the memory needed by the Buffer.from function. I went back in time and looked at how this function looked in the original Node.js (0.10): case BASE64: { String::AsciiValue value(str); data_size = base64_decoded_size(*value, value.length()); break; } Can you see what it does? It takes the string it receives, it makes a copy, and from the copy, it computes how many bytes the decoded version will need. That original code tried to make an ‘ASCII’ copy so I presume it created a copy using one byte per character. It still was a shame, but not terribly so. But very soon after (Node.js 0.11), it was changed to a version that converted to 16-byte strings, for increased safety: case BASE64: { String::Value value(str); data_size = base64_decoded_size(*value, value.length()); break; } This new version is potentially much more expensive because it may copy an ASCII string to a temporary UTF-16 (2-byte per character) string. It uses more memory, but it is also a more complicated function. Internally, the JavaScript engine implements this with the C++ template std::copy_n. The generated code is probably fine, but it is hardly “down to the metal”. As long as everything was not too optimized, it ran just fine… but as you optimized the base64 decoding, you found that this simple length computation took up more than half of the running time. So, in concrete terms, what are we talking about as far as performance goes? Let me consider a JavaScript benchmark: import { bench, run } from "mitata"; function makeBenchmark(size, isToString) { const base64Input = Buffer.alloc(size, "latin1").toString("base64"); const base64From = Buffer.from (base64Input, "base64"); bench(`Buffer. from(${size} bytes, 'base64')`, () => { Buffer.from(base64Input, "base64"); }); } [1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ; await run(); I install the latest version of Bun (bun upgrade --canary). I compare Node.js 22 with a patched version. I use my Apple MacBook for testing (ARM-based M2 processor). You can see that by simply avoiding the unnecessary copy, I boosted the base64 decoding from 2.5 GB/s to 6.0 GB/s. Not bad for removing a single line of code. function time speed Node.js 18 6’400 µs 1.3 GB/s Node.js 22 3’300 µs 2.5 GB/s Node.js with fix 1’200 µs 7.0 GB/s Bun 950 µs 8.8 GB/s Sometimes people observe at this point that the performance of Node.js 18 was already fine: 1.3 GB/s is plenty fast. It might be fast enough, but you must take into account that we are measuring a single operation that is likely part of a string of operations. In practice, you do not just ingest base64 data. You do some work before and some work after. Maybe you decoded a JPEG image that was stored in base64, and next you might need to decode the JPEG and push it to the screen. And so forth. To have an overall fast system, every component should be fast. You may observe that Bun is still faster than Node.js, even after I claim to have patched this issue. But there are likely other architecture issues that Bun does not have. Remember that both Node.js and Bun are using the same library in this instance: simdutf. It is maybe interesting to review Bun’s code (in Zig): const outlen = bun.base64.decodeLen(slice); const to = allocator.alloc(u8, outlen) catch return &[_]u8{}; const wrote = bun.base64.decode(to[0..outlen], slice).count; return to[0..wrote]; It is far simpler than the equivalent in Node where memory is allocated in one function, and then the resulting buffer is passed to another function where the decoding finally happens. It is likely that Bun is faster because it has a simpler architecture where it is easier to see where the unnecessary work happens. Update. After the publication of this blog post, Leszek Swirski added v8::String::ValueView to v8 which should reduce the need for copies. Update 2. I contributed a fix to Node.js for this issue. Posted on June 22, 2024September 12, 2024Author Daniel LemireCategories Leave a comment on Performance tip: avoid unnecessary copies Validating gigabytes of Unicode strings per second… in C#? We have been working on a fast library to validate and transcode Unicode and other formats such as base64 in C++: simdutf. We wondered: could we achieve the same good results in C#? Microsoft’s .NET framework has made great strides in leveraging advanced instructions. For instance, if your processor supports AVX-512, you can instantiate 512-bit registers right in C#! The standard .NET runtime library effectively utilizes these features, demonstrating that they practice what they preach. Most strings on the Internet are Unicode strings stored in UTF-8. When you ingest such strings (from disk or from the network), you need to validate them. To test the waters, we set our eyes on UTF-8 validation. With John Keiser, I helped designed a fast UTF-8 validation algorithm designed for modern-day CPUs. We call the algorithm ‘lookup’. It may require less than one instruction per byte to validate even challenging input. The lookup validation algorithm is part of Oracle GraalVM, Google Fuchsia, the Node.js and Bun JavaScript runtimes and so forth. The .NET library has its own fast UTF-8 validation function: Utf8Utility.GetPointerToFirstInvalidByte. It is highly optimized. As the name implies, it finds the location of the first byte where an error might occur. It also computes some parameters from which you can tell how the input could be transcoded. The function is an internal function, but we can expose it by copying and pasting it. Could we beat the .NET runtime, at least some of the time? It seems that we can! Maybe you want to know what our code looks like? Here is a simplified example where we load 64 bytes, check whether they are ASCII. Vector512 currentBlock = Avx512F.LoadVector512(pInputBuffer + processedLength); ulong mask = currentBlock.ExtractMostSignificantBits(); if (mask == 0) { // got 64 ASCII bytes } else { // oh oh, got non-ASCII bytes } Of course, the whole algorithm is more complicated, but not that much more… It is maybe 30 lines of code. We implement various versions of the algorithm, one for ARM processors, one for older x64 processors, and so forth. For benchmarking, we use valid strings. The first one we check is twitter.json, a JSON file that is mostly ASCII with some non-trivial Unicode content within strings. We also use various synthetic strings representative of various languages. On an Intel Ice Lake system, our validation function is up to 13 times faster than the standard library. On twitter.json, we are 2.4 times faster. data set SimdUnicode AVX-512 (GB/s) .NET speed (GB/s) speed up Twitter.json 29 12 2.4 x Arabic-Lipsum 12 2.3 5.2 x Chinese-Lipsum 12 3.9 3.0 x Emoji-Lipsum 12 0.9 13 x Hebrew-Lipsum 12 2.3 5.2 x Hindi-Lipsum 12 2.1 5.7 x Japanese-Lipsum 10 3.5 2.9 x Korean-Lipsum 10 1.3 7.7 x Latin-Lipsum 76 76 — Russian-Lipsum 12 1.2 10 x On an Apple M2 system, our validation function is 1.5 to four times faster than the standard library. data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up Twitter.json 25 14 1.8 x Arabic-Lipsum 7.4 3.5 2.1 x Chinese-Lipsum 7.4 4.8 1.5 x Emoji-Lipsum 7.4 2.5 3.0 x Hebrew-Lipsum 7.4 3.5 2.1 x Hindi-Lipsum 7.3 3.0 2.4 x Japanese-Lipsum 7.3 4.6 1.6 x Korean-Lipsum 7.4 1.8 4.1 x Latin-Lipsum 87 38 2.3 x Russian-Lipsum 7.4 2.7 2.7 x Observe how the standard library provides a function that is already quite fast: it can run at gigabytes per second. We are several times faster, but evidently, C# makes it possible to write highly optimized functions. You can run your own benchmarks by grabbing our code from https://github.com/simdutf/SimdUnicode/. It is a pleasure doing this performance-oriented work in C#. It is definitively one of my favorite programming languages right now. One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. I expect good performance on recent ARM-based Qualcomm processors. The SimdUnicode library is joint work with Nick Nuon. Posted on June 20, 2024June 24, 2024Author Daniel LemireCategories 2 Comments on Validating gigabytes of Unicode strings per second… in C#? Rolling your own fast matrix multiplication: loop order and vectorization If you must multiply matrices, you should use dedicated libraries. However, we sometimes need to roll our own code. In C++, you can quickly write your own Matrix template: template struct Matrix { Matrix(size_t rows, size_t cols) : data(new T[rows * cols]), rows(rows), cols(cols) {} T &operator()(size_t i, size_t j) { return data.get()[i * cols + j]; } const T &operator()(size_t i, size_t j) const { return data.get()[i * cols + j]; } std::unique_ptr data; size_t rows; size_t cols; }; How do you implement a matrix multiplication? A matrix multiplication is a sequence of three loops. If you do not want to get fancy, you have therefore six possibilities: template void multiply_ikj(const Matrix &a, const Matrix &b, Matrix &c) { for (size_t i = 0; i < a.rows; i++) { for (size_t k = 0; k < a.cols; k++) { for (size_t j = 0; j < b.cols; j++) { c(i, j) += a(i, k) * b(k, j); } } } } template void multiply_ijk(const Matrix &a, const Matrix &b, Matrix &c) { for (size_t i = 0; i < a.rows; i++) { for (size_t j = 0; j < b.cols; j++) { for (size_t k = 0; k < a.cols; k++) { c(i, j) += a(i, k) * b(k, j); } } } } template void multiply_kij(const Matrix &a, const Matrix &b, Matrix &c) { for (size_t k = 0; k < a.cols; k++) { for (size_t i = 0; i < a.rows; i++) { for (size_t j = 0; j < b.cols; j++) { c(i, j) += a(i, k) * b(k, j); } } } } template void multiply_kji(const Matrix &a, const Matrix &b, Matrix &c) { for (size_t k = 0; k < a.cols; k++) { for (size_t j = 0; j < b.cols; j++) { for (size_t i = 0; i < a.rows; i++) { c(i, j) += a(i, k) * b(k, j); } } } } template void multiply_jki(const Matrix &a, const Matrix &b, Matrix &c) { for (size_t j = 0; j < b.cols; j++) { for (size_t k = 0; k < a.cols; k++) { for (size_t i = 0; i < a.rows; i++) { c(i, j) += a(i, k) * b(k, j); } } } } template void multiply_jik(const Matrix &a, const Matrix &b, Matrix &c) { for (size_t j = 0; j < b.cols; j++) { for (size_t i = 0; i < a.rows; i++) { for (size_t k = 0; k < a.cols; k++) { c(i, j) += a(i, k) * b(k, j); } } } } If you use an optimizing compiler and you tell it to compile specifically for your processor, you should get some fast code, at least in some instances. Which order is best? The exact result depends on your data type (double, float, int), on the size of the matrices, on your compiler and your hardware. I wrote a benchmark where I use 100 by 100 matrices containing double values. I use GCC 12 (with full optimization -O3) and an Intel Ice Lake processor. I tell the compiler to optimize for the exact processor I have thus I expect that it will use advanced AVX-512 instructions when possible. The net result in my experiment is that the best ordering is ikj. The worst ordering is jik. If you were to compute manually and naively the matrix multiplications, you would need to do 100 times 100 times 100 multiplications, so 1 million multiplications and 1 million additions. Interestingly, the best ordering (ikj) uses roughly a million of instructions to load the data, do the multiplications, the additions and storing the data. order speed frequency instructions per cycle instructions/mult. ikj 7240 mult/s 2 GHz 3.3 916k ijk 3190 mult/s 2 GHz 2.4 1526k kij 3160 mult/s 2 GHz 2.5 1561k kji 3090 mult/s 2 GHz 2.4 1519k jki 3120 mult/s 3.2 GHz 3.5 3526k jik 1280 mult/s 3.2 GHz 1.7 4331k The resulting assembly shows that for ikj, the instruction vfmadd213pd is generated by the compiler. There are fancier routines that the compiler could use so the result is likely far from optimal. Further work. Justine Tunney suggests manually unrolling the outer loops. One might also use an OpenMP to get good parallelism (e.g., #pragma omp pararallel for collapse(2) if (m*n*k > 300000)). Posted on June 13, 2024June 19, 2024Author Daniel LemireCategories 10 Comments on Rolling your own fast matrix multiplication: loop order and vectorization Posts navigation Page 1 Page 2 … Page 113 Next page Terms of use Proudly powered by WordPressenusen-US1727215246https://lemire.me
Hlela isayithi yakho?
Wentani?