306
A very fucking cool post from Randall Munroe of xkcd about the intersection between sex, gender and categorisation.
How I made Street Hoarding, a Node.js and Redis application, or, a super simple explanation of asynchronicity, event loops, non-blocking IO, JavaScript, Comet and Node
Second update: thanks to James Coglan again, I have modified the code. Now, the client message requests are held open until there is a new message to return, thus reducing the load on the server.
Update: Thanks to James Coglan for providing some lovely technical corrections to this article.
Go to the Street Hoarding homepage and you will see a message in big letters. If you wish, you can type another message in the text box at the bottom of the page, press return, and see it take the place of the old message. Anyone else on the site at that moment will see your words within a few hundredths of a second. It’s kind of like a community pin-up board, or some hoarding on a building site, or a promiscuous IM client with a very short memory.
Some of the key parts of the code were taken from Ryan Dahl’s demo chat app for Node.js.
My aim with this article is to explain how everything works to someone who is like I was before I wrote Street Hoarding: hazy about asynchronicity, event loops, non-blocking IO, JavaScript and Node.js.
There are two elements: the client and the server.
The client
This is an HTML page that lays out the main message and the text box. It is also the JavaScript that runs on the user’s browser. The JavaScript has two key functions.
longPoll()
runs the whole time the user has the webpage open. It takes some data. If this data is null, it is ignored. If it is not null and has a message component, that message is displayed on the webpage through a jQuery update to the message div. Either way, an XMLHttpRequest request is then made with jQuery to the /latest_message
url on the server. This request takes some time, but it is asynchronous. That is very important. When longPoll()
is run, the data is processed, the URL request is made and, then, the execution of longPoll()
continues past the $.ajax()
call, and control is passed back to the computer processor so it can carry on doing other work. When a success response comes back, the success function inside the $.ajax()
function is run. This pauses for a moment, then calls longPoll()
again, passing it the data that the server responded with. Next time through, the message inside this data will be written to the HTML page and the user will see it.
tryToSendMessage()
is called when the user submits a new message via the text field on the HTML page. It first sends an (asynchronous, as always) request to the server to ask it whether the message the user entered has ever been said before. If it has, it just tells the user they aren’t being original and finishes. Otherwise, it sends an (asynchronous) request to the /send_message
URL, passing the user’s message as a parameter, thus telling the server to save the message.
There are some improvements that could be made to this code. First, when the message is sent to the server, it does not get updated in the user’s browser immediately. That will have to wait until the longPoll()
function gets its next response from the server. Second, the user has no idea whether the /send_message
request was successful until longPoll()
updates the webpage. Third, the message is actually sent twice. The uniqueness check request and send message request could have been combined into a single /send_message
request that had the server respond with either an indication of success or a message saying that the message was not unique.
I keep on saying the word asynchronous. Everyone who talks about Node.js goes on about asynchronous execution and this other thing, non-blocking input and output (IO). What is crazy is that we haven’t even got to the Node.js stuff, yet. This is all browser magic that we’ve had for the last whatever years.
So, let’s go back a bit.
When a request goes from the client to the server - either asking for the latest message or sending a new message - the computer processor doesn’t hang around waiting for a response. Instead, it moves on and deals with other tasks. The processor returns its attention to the request when the response comes in. Thus, the input and output are non-blocking. Which is to say, waiting for data to arrive or be sent does not hold the processor up from its other tasks. From this, we get asynchronicity - lines of code can get executed out of order. If there is a pause whilst a function waits for something to happen and that something does not require the computer’s processor, other work can be done in the meantime.
How does this work?
There is this thing called an event loop and every browser has one. This is a function that just goes around and around, taking note of things that happen like a woman alone in a house at night straining to hear every floorboard creak and passerby’s creep. Code, like jQuery, that is running is the browser, can register its interest in different types of event. So, when the $.ajax()
jQuery function is called in longPoll()
, it sends out the request and then tells the event loop it would be very interested in hearing about any HTTP responses that come back from the server. The event loop eventually gets the response and passes it to jQuery which looks at the response to see if the request was a success, and then runs one of the two functions that we defined in longPoll()
.
This is where JavaScript plays its part. JavaScript has - and you may have heard this term before - first class functions. These confer several abilities, but the one we care about is that functions can be passed as arguments. In longPoll()
, functions are passed as the fifth and sixth properties of the $.ajax()
call. The first is to be run in the case of a response that indicates an error, the second in the case of a successful response.
Now, back to the request from our client code. Using a browser means we are in an event loop. Making an HTTP request means we have time when the processor is not being used. Using jQuery means that control is handed back to the browser after a request is sent. The browser regaining control means we have non-blocking IO. Non-blocking IO means that the event loop continues to run whilst it awaits a response. The event loop continuing to run means that other tasks can be dealt with in the mean time.
That deals with the wonders of non-blocking IO and asynchronicity on the client. If we already have all that stuff, I ask, Why is Node.js so special? and I answer, Because this is now easy to do on the server, too.
The server
There is some crazy fucking shit going on in the first line. It uses fu, an imported piece of JavaScript code that acts as a mini router. When you pass a url and a function to fu.get()
, you are saying: when the Node.js server gets a request that was sent to this URL, run this function.
A digression on how the router works that explains some things about JavaScript and the Street Hoarding code but that it’s not really necessary to read to get the main points of this article
fu.get()
takes a URL and a function and adds the function to a hash, keyed with the URL. fu.listen()
starts the server defined by the server variable and makes it listen to events coming to the passed host (probably localhost) on the passed port. We’ve eaten our way around the jam filling, so it’s time to get sticky fingers.
createServer()
, a Node.js function, is called with an anonymous function that takes a request and response, and the resulting server object is assigned to the server variable. That anonymous function gets the URL on the passed request object, req
, and looks in getMap
to find the corresponding function to run. For example, in the latest message code defined above, the url is /latest_message
and the function is the rest of the code snippet.
We now meet a second special feature of JavaScript: prototyping. The passed response object, res
, has two new methods added to it on the fly: simpleText()
and simpleJSON()
. The methods themselves are not that interesting - they just create a string to return to the client as a response to its request - it is the fact that they are stuck on the res
object without such as a by your leave that I just know is making your head explode.
Finally, the handler function in getMap
that corresponded to the requested url is called with the request and the super-charged-with-new-functions response.
Latest message
So, the function passed to fu.get()
extracts the since parameter that the client sent with the request. This indicates when the client last received a user message from the server. If the server has received a message from a user since then, sendLatestMessageToClient()
is called.
sendLatestMessageToClient()
creates a new Redis client. It calls redisClient.stream.addListener()
to connect the Redis client to the Redis server, passing a function as the second argument. Note the asynchronicity. The Redis library does not hang around waiting while the Redis client connects to the Redis server. Instead, behind the scenes, it passes control back to the server event loop which, at some point in the future, gets an I’ve Finished My Work And My Name Is The Redis Client Connection Function event which then calls the function passed as the second argument.
This function calls redisClient.lindex()
which retrieves the first item in the messages list in the database. Three arguments are passed: the key of the messages list, a `` to indicate the first item in the list, and yet another callback function. redisClient.lindex()
retrieves the first message (did you notice the auxiliary bout of asynchronicity?), and the callback is run which closes the Redis client and runs the simpleJSON()
function to send the message back to the client. (Those of us who read the digression are like fully in a special secret club what knows how totally mind-fucking it is that the res
object has a simpleJSON()
function hanging around on it; those who did not read the digression will keep their heads fuck-free.)
New message
The function passed to fu.get()
extracts the message from the request and calls storeMessage()
, passing the message and yet another function to call back later.
storeMessage()
goes through the familiar routine of creating a Redis client, requesting a connection to the Redis server, calling a Redis function (redisClient.lpush
, this time), closing the Redis client and calling back the function passed as the second argument which:
Wait, stop a second. Do you remember how I rather trailed off five paragraphs ago when I wrote, “If the server has received a message from a user since then, sendLatestMessageToClient()
is called”? By which I mean, I didn’t say what happened if the server had not received a new message since the last message was sent to the user. Let’s have a look.
Right. Latest message requests that would normally be answered with the message that the client is already displaying are held open. I know that was a long sentence, and this is a long article, and you are tired, but I hope that those last two words didn’t slip by you. Held open. A response is not sent immediately. Instead, a new item is pushed onto the messageRequests
array: a hash of the res
object and the sendLatestMessageToClient()
function.
So, back to the /send_message
code to see how it deals with the held message requests. The code extracts the user’s message, stores it and sends a success response back to the client. For each message request that has been pushed onto messageRequests
, sendLatestMessageToClient()
is called. This sends the latest message (probably the one received a few lines ago) back to the client, thus ending the request. This is Comet: the client sends a request and no response is sent until there is something useful to send, thus the request is held open.
Ryan Dahl did two really cool things. First, he wrote a library that lets you code an event-driven server in JavaScript. However, this was not new. Second, and more importantly, he wrote the core libraries so that they are non-blocking. The problem with other event-driven programming libraries is that you can’t be sure whether the auxiliary libraries you want to use are non-blocking. If they are, you will stall your event loop and it will stop dealing with incoming events and everything will fall apart.
So, from the re-written libraries, we get non-blocking IO, which allows an event loop. The event loop allows the server to run in a single process. A single process means low memory usage.
Freeing disk space on your Linux server
The websites that I host on Slicehost, Playmary and Street Hoarding, keep crashing because my slice keeps running out of disk space.
To find out where disk space is being used:
- Get to the root of your machine by running
cd /
- Run
sudo du -h --max-depth=1
- Note which directories are using a lot of disk space.
cd
into one of the big directories.- Run
ls -l
to see which files are using a lot of space. Delete any you don’t need. - Repeat steps 2 to 5.
Fourth EP
I finished the new mary rose cook music record months ago, but, since moving to Berlin, I haven’t got around to photocopying the inlays and burning the CDs and doing the stapling. Soon.
I wrote the note in the photograph and pinned it to my bedroom wall when I came back to London, two months after having a cardiac arrest.
111
I have just added my first Playmary track that includes an image as the comment. I have no idea why I haven’t done this before.
This evening, I got off the U-Bahn at Görlitzer Bhf. and My Hometown by Bruce Springsteen came on my headphones and I looked down the platform and saw the clear, twilight Berlin sky and the sun falling out of the sky. I stood a moment to take a photograph.
Syndicate
I have just re-discovered Syndicate and spent the afternoon playing a graphically crippled SNES version on my MacBook Pro. You can get the emulator here and the ROM here.
I played Syndicate on my PowerMac when I was fifteen. I have no idea where I got my copy - it was an old game even then. I was in love with it and another game, Myth II, around the same time.
Though they are very different, I liked the same things about them: they let you use a few basic tools to create your own stories and solve problems in your own way, and they are set in a world that makes you lick your lips.
In Myth II, you have archers with flaming arrows and dwarves with satchel charges. You can light the grass on fire to corral the enemy into a narrow gully. You can use the satchel charges to blow up the front and back of a company of enemy soldiers, trapping the survivors in the middle. You can lure the enemy into an area and then tell your archers to fire their flaming arrows to ignite hidden charges.
You can create your own scenarios of destruction, and plot the enemy’s demise like a story. Once you get good at a level, the carnage takes on an air of theatre, of ballet.
Each level in Myth II is just about wiping out the enemy’s army, but Syndicate gives you more story to work with. The hyper-capitalist company for which you work needs a politician assassinating. He is to attend a mega-mall opening. You could lie in wait by the road and blow up his car as he passes. Or, you could hide your guns and blend in with the crowd and shoot him as he cuts the ribbon and then escape in his limousine. Or, you could take over the minds of his bodyguards and get them to kill him for you, then slip away unnoticed.
Though Myth II has very little story in the levels, the mise-en-scène is wonderful. The landscapes are so barren. They are like the muddy no man’s land between the trenches in the First World War. Their sparseness draws attention to the soldiers like a stage draws attention to the actors. The environment dictates the story as in the theatre-like battle-fields of Flags of Our Fathers, or the cold Detroit in Narc that made the world into an inhospitable place that left the characters naked and aggressive and scared, a place where things happen that no one will see.
Writing an mp3 crawler in Clojure
I’ve written an mp3 crawler to help me learn Clojure. It’s 150 lines. I’m sure could be much shorter. There are some URL parsing bugs.
Like all my projects, the code I talk about in this article is open source. Get it from GitHub.
The basic flow
- Start with a URL, like saidthegramophone.com
- Request the page and find all the URLs on it.
- Save all the ones that point at mp3s.
- Note down how many mp3s were yielded.
- Throw away ones that definitely don’t point at other HTML pages (images, Javascript).
- Throw away ones that are at hosts that don’t seem to yield many mp3s.
- Add the rest to the list of URLs to crawl.
- Go to step 2 with the next item on that list.
Interesting points
Agents
URLs are requested by asynchronous agents in batches of twenty. Thus, they can be crawled much more quickly. crawl-batch-of-urls
maps the twenty items in urls-to-crawl
to the request-url
function. This function creates a new http-agent and tells it to download the (HTML) content at the url. crawl-batch-of-urls
then waits up to ten seconds for all the agents in the batch to finish, then passes them back.
Host scores
A record is kept of the mp3-richness of each host the crawler encounters. Each mp3 found on a host scores it a point. Each crawl of a URL on the host loses it a point. So, say www.saidthegramophone.com/archives/in_this_box_or_another.php was crawled and five mp3s were found, four points would be added to the score for www.saidthegramophone.com
update-host-scores
updates a hash of hosts and scores after a new URL is crawled:
gen-host-scores
-
- is called twice at the beginning of the program’s execution:
Each time through the main execution loop, the urls crawled and urls saved thus far are written to disk. Thus, for the first call, an empty hash map is the starting point and each of the urls crawled costs its host one point. The second time, the hash of scores calculated the first time is the starting point and each of the mp3s found scores its host one point.
Being encouraged to think better
Through its immutable data structures and passable functions, Clojure is always pushing me to re-use code and employ recursion. I felt very cool when I was able to write the following function that accepts a function to filter a sequence:
The main loop
Scrawl, the function that runs the show. If the list of passed url-crawl-agents is empty, a new batch is created and scrawl
is called again. If the next agent on the list failed to complete its data request, it is thrown away and scrawl
is called again. Otherwise, the function calculates all the required data and calls itself.
Bands are better live
I go to gigs/concerts/shows a lot. Gigs are just better than records.
There is the sound of the audience booing or chatting or whooping or heckling or clapping. Listen to the effect that the audience had on Bob Dylan at his “Royal Albert Hall” gig in 1966.
You can see the musicians making their music in front of you: how that ringing guitar sound is produced, or how he pulls of that riff, or how the drum player and the bass player have to make eye contact before the time signature change. I saw Battles play at All Tomorrow’s Parties last year and saw that Ian Williams does actually play the keyboard and guitar simultaneously.
The musicians play with more conviction because they are performing and they are having an effect not just on the air but on the people in front of them, and the low lights and emotional atmosphere give them license to scream the scream they felt when they first wrote the song.
The songs are different versions from those played on the record six months before because they can be adjusted in response to a changing idea of what sounds good, or at the discovery of a richer melody or simpler arrangement. The album version of Sunset Rubdown’s Us Ones In Between has the piano marking out the melody and the rhythm. However, this live version has the piano nowhere and the song completely driven by a guitar string being alternately tightened and loosened.
Perhaps most tellingly, if a band has a live album, it is usually my favourite. Here are some examples on a special Playmary I made.
Audio recordings capture a good portion of the musical advantages of live gigs. YouTube is great for gig videos, but the experience is too diffused by video-hopping and varying sound quality and the ten minute limit: songs are good, albums are great.
I’m not quite sure where this is leading.
Famous bands are well documented and, just as importantly, well distributed. It is easy to buy a Bruce Springsteen or Bob Dylan live album. What if every gig was recorded and then put up on the ‘net? A lead going from the sound desk into a cassette recorder and, later, a lead from the cassette recorder to a computer would be enough. A quick upload to a website and it would be available to everyone.