Side effects, strings, and spawned processes

About a week ago, I was reading the book Programming in Scala, because my friend Nolan likes Scala a lot, and I wanted to give it a try. Chapter 7, "Built-in Control Structures," ends with an example of refactoring a program from an imperative style to a more functional style. As I was reading, the following paragraph made me stop and think:

"The imperative style reveals itself in Listing 7.18 in two ways. First, invoking printMultiTable has a side effect: printing a multiplication table to the standard output. In Listing 7.19, we refactored the function so that it returns the multiplication table as a string. Since the function no longer prints, we renamed it multiTable. As mentioned previously, one advantage of side-effect-free functions is they are easier to unit test. To test printMultiTable, you would need to somehow redefine print and println so you could check the output for correctness. You could test multiTable more easily, by checking its string result."

I should admit up front that I was approaching Scala with a serious bias against its host platform, the Java Virtual Machine. To me, the JVM had gained a reputation for being big and memory-hungry, and for encouraging systems that consisted of a single, monolithic process including everything and the kitchen sink. I further disliked the JVM because its memory usage and startup time made the Unix ideal of small, focused, short-lived processes untenable.

So I was predisposed to think that Scala programmers, constrained by the JVM, would approach some problems in the wrong way. And that is exactly what I thought when I read the paragraph quoted above. The authors stated that the imperative-style example would be difficult to test because it would require one to override Scala's basic functions for writing to standard output. And I thought, "Only if you run your tests inside the same process." It occurred to me that the Unixy way to test this program would be to run it from another process which would receive and check its output. I remembered reading about a tool called "expect" which could do tests this way.

So it seemed to me that when approaching it the Unix way, writing to standard output wasn't a side effect at all. This got me thinking about shell programming. It occurred to me that from a shell script's point of view, a process invocation was like a function call, and what the process wrote to standard output wasn't a side effect, but simply the function's return value. Indeed, from this point of view, global variables and mutable collections inside a process wouldn't be a big deal, because they would be like local variables in a function. Accompanying these thoughts was a feeling of superiority; I felt pretty good about having come to see things the Unix way. I saw Scala as a nice language trapped in a terrible environment that warped one's perspective by making the Unix way impractical.

These thoughts stayed in the back of my mind as I moved onto other things. Over the next few days, I learned a little more about Scala and the available Web application frameworks for it. One difference between the Lift and Play frameworks was particularly striking. Whereas Play's default template system uses string concatenation (with auto-escaping of HTML entities to prevent XSS attacks), Lift's default template system parses the template into a DOM and manipulates that DOM. The approach that Play uses is much more common, but it seems to me that Lift's approach is more correct.

Contemplating this difference made me think of Glyph Lefkowitz's essay Data In, Garbage Out, in which he convincingly argued that structured data, including HTML, should be manipulated as structured data, not cobbled together (potentially incorrectly) by concatenating strings. I think now most of us web developers, including those of us working with the more popular Python template systems, have gotten this wrong. The auto-escaping in our template systems may shield us from most, if not all, XSS attacks, but it's still all too easy for us to slip up and write a template that produces ill-formed markup.

The night after I reread Glyph's post, my thoughts returned to the question of whether writing text to standard output is a side effect to be avoided, or just the normal outcome of the short-lived processes that are the Unix ideal. And I suddenly realized that what I had considered the Unix ideal was in fact not ideal at all, because it consists of a lot of string manipulation. The Unix shell may provide a programming language, but for all practical purposes, it's a language in which the only data type is a string. That's no way to program. I think I already understood that on some level, since I don't do much shell programming on a daily basis and have often thought that larger shell scripts tend to be flimsy. But thinking about Glyph's blog post on strings drove the point home. Indeed, I realized that the authors of Programming in Scala didn't go far enough with the example that concludes chapter 7. A function that returns a multiplication table as a map (I believe the correct Scala type notation would be "Map[(Int, Int), Int]") would be far easier to test than a function that returns the table as a string.

With that, I got to thinking that Stanislav might have been right to call Unix archaic. He had linked to The UNIX-HATERS Handbook, so I read it. In chapter 8, "csh, pipes, and find," I found confirmation of my recent conclusion that passing strings between processes was a poor substitute for passing structured data between functions:

"At the risk of sounding like a hopeless dream keeper of the intergalactic space, we submit that the correct model is procedure call (either local or remote) in a language that allows first-class structures (which C gained during its adolescence) and functional composition. Pipes are good for simple hacks, like passing around simple text streams, but not for building robust software. For example, an early paper on pipes showed how a spelling checker could be implemented by piping together several simple programs. It was a tour de force of simplicity, but a horrible way to check the spelling (let alone correct it) of a document. Pipes in shell scripts are optimized for micro-hacking. They give programmers the ability to kludge up simple solutions that are very fragile. That’s because pipes create dependencies between the two programs: you can’t change the output format of one without changing the input routines of the other."

So the way I had just come to view the practice of passing strings between processes was correct; others had realized it long ago (this book was published in 1994).

Having willingly given up my real religion last year, I was ready to have my lesser faith in Unix challenged. So I approached The UNIX-HATERS Handbook with an open mind, wanting to know how much of what I thought I knew about Unix was wrong. Reading this book was quite a revelation. Some things have clearly improved since 1994; Sendmail isn't nearly so widely used, Usenet is mostly forgotten, and the GNU utilities seem to be generally better than their original Unix counterparts. But it seemed to me that we had been duped into thinking of Unix as the pinnacle of reliability and good operating system design. Had Windows (especially pre-XP) really set our expectations that low?

I didn't have to look far to find a challenge to one of my most dogmatically held beliefs about Unix. In chapter 1, in a list of myths about Unix, was the sentence, "Processes are cheap." To a Unix aficionado, this sentence succinctly encapsulates the belief that because of the implementation of the fork() system call, most notably the use of copy-on-write virtual memory pages, it is feasible to have a large number of processes, ideally short-lived processes, all running the same program. Unix lovers consider this multiprocess approach superior to multithreading, because it is more resilient to failures in individual processes, and in some cases (when dealing with C programs) also more secure. To this day, the Postfix mail transport agent uses a separate process for each SMTP connection (both incoming and outgoing), and the Dovecot POP3/IMAP server uses a separate process for each session. In fact, Dovecot uses one process for the login sequence, then another for the POP3/IMAP session itself, to be really secure.

I was already starting to wonder if Unix processes were truly cheap, even when running programs written in C, because of a series of recent incidents on my employer's mail server, which runs Postfix and Dovecot. Seeing that sentence listed among Unix myths answered the question definitively; processes weren't cheap in the 80s to early 90s, when most programs were written in C. If processes that run C programs seem at all cheap now, it's probably because CPU power and RAM have continued to grow more abundant, and most of us have gotten used to software that wastes these resources.

Chapter 12, "Security," discussed how easy it is for multiple processes to bring down a system. The section "Denial of Service" presented the fork bomb (though not by that name), a super-simple program that exploits fork() to bring any Unix system to a halt. But what really got my attention was this passage:

"Not exciting enough? Well, thanks to the design of the Unix network system, you can paralyze any Unix computer on the network by remote control, without even logging in. Simply write a program to open 50 connections to the sendmail daemon on a remote computer and send random garbage down these pipes. Users of the remote machine will experience a sudden, unexplained slowdown. If the random data cause the remote sendmail program to crash and dump core, the target machine will run even slower."

I knew that Sendmail used the process-per-connection model of concurrency, as do Postfix and Dovecot. But did it really only take 50 SMTP connections to bring down a system? This further confirmed that there was no golden age of cheap processes on Unix. Processes weren't cheap in 1994, and they're not cheap now. And now, interpreters, JIT compilers, and certain automatic memory management techniques make processes even less cheap.

That was Wednesday night. The following evening, there was another load spike that brought my employer's main mail server down. After rebooting the (virtual) machine, I wondered, not for the first time, if any major email provider was running a mail server with a process-per-connection concurrency model. Some probing revealed that Google, Yahoo, Microsoft, Comcast, and GMX are all running custom SMTP servers, not an off-the-shelf Unix SMTP server like Postfix, qmail, or Sendmail. And Apple is running an email server product from Oracle. Of course, I don't know this for sure, but I'm willing to bet a small amount that all of these servers are multithreaded and/or event-driven, not using a process per connection.

Friday, as a result of that latest mail server failure, my attention turned to making our flagship Web application more resilient to such failures. This application, which is written in Python, had been using the process-per-conection model to handle HTTP requests for years. I tried the event-driven model (using Eventlet) a little over a year ago, with disastrous consequences; simulating threads using coroutines and monkey-patching of Python libraries was clearly too risky a change. And if I remember correctly, my previous unsuccessful attempt at multithreading for this application had been in 2007. But I had known for months that it was far too easy for a failure of one subsystem, such as the mail server, to exhaust the relatively small pool of worker processes. And now, my resistance to the multithreaded approach was almost gone.

But I still had to convince myself that multithreading was a good solution. After all, some Python web developers seemed to believe rather strongly that the only two good solutions were process-per-connection and event-driven. And after all, PHP isn't multithreaded, and there are huge sites running PHP. But wait a minute; those sites, such as Wikipedia and WordPress.com, have workloads that are disproportionatly read-only. These sites can put a caching layer in front of PHP, rendering moot the question of whether PHP uses processes or threads. And what about Facebook? They use PHP, but they did their own implementation, and it's multithreaded. And wasn't it obvious that bytecode interpretation, reference counting, and some kinds of garbage collection all thwart a Unix system's attempts to share memory among processes, and that JIT compilation would also have this effect? Wasn't it obvious that with a language like Python, memory usage was bound to be far from optimal in the process-per-connection model? Surely it wasn't a coincidence that in the runtime environments with the most sophisticated virtual machines, namely Java and .NET, Web applications used the multithreaded model without exception. With my last doubts resolved, and having discussed the matter with management, I proceeded to move our flagship Web application to the multithreaded model.

I completed the transition yesterday afternoon, then spent a few hours away from the computer. During that time, I started having second thoughts about what I had just done. Surely I had just started on a slippery slope that would end in unmanageable, kitchen-sink behemoths, as in the JVM ecosystem that I had loathed just a week before. But I was tired (indeed, I had a short nap yesterday afternoon), so I probably wasn't being especially rational. Still, those thoughts bothered me.

I had to prove to myself, again, that threads really were much cheaper than processes, enough so to warrant the potentially increased difficulty of debugging. I wrote two Python programs, one that spawned 4096 threads which did nothing but sleep for 5 minutes, and another which did the same thing with 4096 forked processes. The memory usage of the fork-based program was at least an order of magnitude greater, and this was a program whose child processes didn't really do anything. If each of those 4096 processes had, say, imported a large module, I suspect that this would have immediately brought my local Linux VM to its knees. But the test brought that VM down soon enough, when the 4096 processes all tried to exit. I suspect that this happened because each process ran Python's full shutdown routine, which decreased the reference counts of all objects and deallocated them. This surely triggered the copy-on-write semantics, causing a massive increase in memory allocation. So, yes, it was obvious that a large number of processes wasn't feasible with Python.

Finally, I thought about the fork bomb attack presented in chapter 12 of The UNIX-HATERS Handbook, and wondered if something just as bad could be done with threads. To make sure that Python's global interpreter lock didn't invalidate the results, I did this experiment in C. I wrote a small program, threadbomb.c, which replicated the behavior of a fork bomb but with threads instead of processes. Within 30 seconds of running the program, or maybe more like 15 seconds, the system was rendered unusable. But to my surprise, that state only lasted for a couple of minutes; the program was then aborted, and all was back to normal. I then wrote and ran the conventional fork bomb, which rendered the system unusable even faster. Only this time, there was no recovery. +1 for threads, at least on Linux.

So why have I written this long, rambling essay? Well, I've heard that the best way to organize one's thoughts is to write them down. I want to make sure I'm thinking clearly about the subjects I've covered here, so I figured I'd write about it. If anyone is still reading, the best advice I can give is this: Question every assumption and belief, especially the ones you've never questioned before. Be willing to try new approaches; you might find that they're better than you had assumed.

Page 1 / 1