There are a lot of APIs in Node, but some of them are more important than others. These core APIs will form the backbone of any Node app, and you’ll find yourself using them again and again.
The first API we are going to look at is
the Events API. This is
because, while abstract, it is a fundamental piece of making every other
API work. By having a good grip on this API, you’ll be able to use all the
other APIs effectively.
If you’ve ever programmed JavaScript in the browser, you’ll have used events before. However, the event model used in the browser comes from the DOM rather than JavaScript itself, and a lot of the concepts in the DOM don’t necessarily make sense out of that context. Let’s look at the DOM model of events and compare it to the implementation in Node.
The DOM has a user-driven event model based on user interaction, with a set of interface elements arranged in a tree structure (HTML, XML, etc.). This means that when a user interacts with a particular part of the interface, there is an event and a context, which is the HTML/XML element on which the click or other activity took place. That context has a parent and potentially children. Because the context is within a tree, the model includes the concepts of bubbling and capturing, which allow elements either up or down the tree to receive the event that was called.
For example, in an HTML list, a click event on
an <li> can be captured by a
listener on the <ul> that is its
parent. Conversely, a click on the <ul> can be bubbled down to a listener on
the <li>. Because JavaScript
objects don’t have this kind of tree structure, the model in Node is much
simpler.
Because the
event model is tied to the DOM in browsers, Node created the EventEmitter class to provide
some basic event functionality. All event functionality in Node revolves
around EventEmitter because it is
also designed to be an interface class for other classes to extend. It
would be unusual to call an EventEmitter instance directly.
EventEmitter has a handful of methods, the
main two being on and emit. The class provides these methods for use
by other classes. The on
method creates an event listener for an event, as shown in Example 4-1.
The on
method takes two parameters: the name of the event to listen for and the
function to call when that event is emitted. Because
EventEmitter is an interface pseudoclass, the class
that inherits from EventEmitter is
expected to be invoked with the new keyword.
Let’s look at Example 4-2 to see how we create a
new class as a listener.
We begin this example by including the
utils module so we can use the inherits method.
inherits provides a way for the
EventEmitter class to add its methods to the Server class we created. This
means all new instances of Server can be used as
EventEmitters.
We then include the events module. However, we want to access just
the specific EventEmitter class
inside that module. Note how EventEmitter is capitalized to show it is a
class. We didn’t use a createEventEmitter method, because we aren’t
planning to use an EventEmitter directly. We simply
want to attach its methods to the Server class we are going to make.
Once we have included the modules we need,
the next step is to create our basic Server class. This offers just one simple
function, which logs a message when it is initialized. In a real
implementation, we would decorate the Server class prototype with the functions that
the class would use. For the sake of simplicity, we’ve skipped that. The
important step is to use sys.inherits
to add EventEmitter as a superclass
of our Server class.
When we want to use the Server class, we instantiate it with new Server(). This instance of Server will have access to the methods in the
superclass (EventEmitter), which
means we can add a listener to our instance using the on method.
Right now, however, the event listener we
added will never be called, because the abc event isn’t fired. We can fix this by
adding the code in Example 4-3 to emit the event.
Firing the event listener is as simple as calling the emit method that the Server instance inherited from EventEmitter. It’s important to note that
these events are instance-based. There are no
global events. When you call the on method, you attach to a specific EventEmitter-based object. Even the various
instances of the Server class don’t
share events. s from the code in
Example 4-3 will not share the same events as
another Server instance, such as one
created by var z = new
Server();.
An important part of using events is dealing with callbacks. Chapter 3 looks at best practices in much more depth, but we’ll look here at the mechanics of callbacks in Node. They use a few standard patterns, but first let’s discuss what is possible.
When calling emit, in
addition to the event name, you can also pass an arbitrary list of
parameters. Example 4-4 includes three such
parameters. These will be passed to the function listening to the event.
When you receive a request event from
the http server, for example, you
receive two parameters: req and
res. When the request event was
emitted, those parameters were passed as the second and third arguments
to the emit.
It is important to understand how Node calls
the event listeners because it will affect your programming style. When
emit() is called with arguments, the
code in Example 4-5 is used to call each
event listener.
This code uses both of the JavaScript
methods for calling a function from code. If emit() is passed with three or fewer
arguments, the method takes a shortcut and uses call. Otherwise, it uses the slower apply to pass all the arguments as an array. The important thing to recognize here,
though, is that Node makes both of these calls using the this argument directly. This means that the
context in which the event listeners are called is the context of
EventEmitter—not
their original context. Using Node REPL, you can see what is happening
when things get called by EventEmitter (Example 4-6).
Example 4-6. The changes in context caused by EventEmitter
> var EventEmitter = require('events').EventEmitter,
... util = require('util');
>
> var Server = function() {};
> util.inherits(Server, EventEmitter);
> Server.prototype.outputThis= function(output) {
... console.log(this);
... console.log(output);
... };
[Function]
>
> Server.prototype.emitOutput = function(input) {
... this.emit('output', input);
... };
[Function]
>
> Server.prototype.callEmitOutput = function() {
... this.emitOutput('innerEmitOutput');
... };
[Function]
>
> var s = new Server();
> s.on('output', s.outputThis);
{ _events: { output: [Function] } }
> s.emitOutput('outerEmitOutput');
{ _events: { output: [Function] } }
outerEmitOutput
> s.callEmitOutput();
{ _events: { output: [Function] } }
innerEmitOutput
> s.emit('output', 'Direct');
{ _events: { output: [Function] } }
Direct
true
>The sample output first sets up a Server class. It includes functions to
emit the output event. The outputThis method is attached to the output event as an event listener. When we
emit the output event from various contexts, we stay
within the scope of the EventEmitter
object, so the value of this that
s.outputThis has access to is the one
belonging to the EventEmitter. Consequently, the
this variable must be passed in as a
parameter and assigned to a variable if we wish to make use of it in
event callback functions.
One of the core tasks of Node.js is to act as a web server. This is such a key part of the system that when Ryan Dahl started the project, he rewrote the HTTP stack for V8 to make it nonblocking. Although both the API and the internals for the original HTTP implementation have morphed a lot since it was created, the core activities are still the same. The Node implementation of HTTP is nonblocking and fast. Much of the code has moved from C into JavaScript.
HTTP uses a pattern that is common in Node.
Pseudoclass factories provide an easy way to create a new server.[7] The http.createServer()
method provides us with a new instance of the HTTP
Server class, which is the class we use to define the
actions taken when Node receives incoming HTTP requests. There are a few
other main pieces of the HTTP module and other Node modules in general.
These are the events the Server class
fires and the data structures that are passed to the callbacks. Knowing
about these three types of class allows you to use the HTTP module
well.
Acting as an HTTP server is probably the most common current use case for Node. In Chapter 1, we set up an HTTP server and used it to serve a very simple request. However, HTTP is a lot more multifaceted than that. The server component of the HTTP module provides the raw tools to build complex and comprehensive web servers. In this chapter, we are going to explore the mechanics of dealing with requests and issuing responses. Even if you end up using a higher-level server such as Express, many of the concepts it uses are extensions of those defined here.
As we’ve already seen, the first step in
using HTTP servers is to create a new server using the http.createServer() method. This returns a new instance of the Server class, which
has only a few methods because most of the functionality is going to be
provided through using events. The http server class has six events and three
methods. The other thing to notice is how most of the methods are used
to initialize the server, whereas events are used during its
operation.
Let’s start by creating the smallest basic HTTP server code we can in Example 4-7.
This example is not
good code. However, it illustrates some important points. We’ll fix the
style shortly. The first thing we do is require the http module. Notice how we can chain methods
to access the module without first assigning it to a variable. Many
things in Node return a function,[8] which allows us to invoke those functions immediately.
From the included http module, we
call createServer. This doesn’t have
to take any arguments, but we pass it a function to attach to the
request event. Finally, we tell the
server created with createServer to
listen on port 8125.
We hope you never write code like this in real situations, but it does show the flexibility of the syntax and the potential brevity of the language. Let’s be a lot more explicit about our code. The rewrite in Example 4-8 should make it a lot easier to understand and maintain.
This example implements the minimal web
server again. However, we’ve started assigning things to named
variables. This not only makes the code easier to read than when it’s
chained, but also means you can reuse it. For example, it’s not uncommon
to use http more than once in a file.
You want to have both an HTTP server and an HTTP client, so reusing the
module object is really helpful. Even though JavaScript doesn’t force
you to think about memory, that doesn’t mean you should thoughtlessly
litter unnecessary objects everywhere. So rather than use an anonymous
callback, we’ve named the function that handles the request event. This is less about memory usage
and more about readability. We’re not saying you shouldn’t use anonymous
functions, but if you can lay out your code so it’s easy to find, that
helps a lot when maintaining it.
Remember to look at Part I of the book for more help with programming style. Chapters 1 and 2 deal with programming style in particular.
Because we didn’t pass the request event listener as part of the factory
method for the http Server object, we
need to add an event listener explicitly. Calling the on method from EventEmitter does this. Finally, as with the
previous example, we call the listen method with the port we want to
listen on. The http class provides
other functions, but this example illustrates the most important
ones.
The http
server supports a number of events, which are associated with either the
TCP or HTTP connection to the client. The connection and close events indicate the buildup or teardown of a TCP connection to a
client. It’s important to remember that some clients will be using HTTP
1.1, which supports keepalive. This means that their TCP connections may
remain open across multiple HTTP requests.
The request, checkContinue, upgrade, and clientError events are associated with HTTP requests. We’ve already used the
request event, which signals a new
HTTP request.
The checkContinue event indicates a special event.
It allows you to take more direct control of an HTTP request in which
the client streams chunks of data to the server. As the client sends
data to the server, it will check whether it can continue, at which
point this event will fire. If an event handler is created for this
event, the request event will
not be emitted.
The upgrade event is emitted when a client asks
for a protocol upgrade. The http
server will deny HTTP upgrade requests unless there is an event handler
for this event.
Finally, the clientError event passes on any error events
sent by the client.
The HTTP server can throw a few events. The
most common one is request, but you
can also get events associated with the TCP connection for the request as well as
other parts of the request life cycle.
When a new TCP stream is created for a
request, a connection event is
emitted. This event passes the TCP stream for the request as a
parameter. The stream is also available as a request.connection variable for each request
that happens through it. However, only one connection event will be emitted for each
stream. This means that many requests
can happen from a client with only one connection
event.
Node is also great when you want to make
outgoing HTTP connections. This is useful in many contexts, such as
using web services, connecting to document store databases, or just
scraping websites. You can use the same http module when doing HTTP requests, but
should use the http.ClientRequest
class. There are two factory methods for this class: a
general-purpose one and a convenience method. Let’s take a look at the
general-purpose case in Example 4-9.
The first thing you can see is that an
options object defines a lot of the
functionality of the request. We must provide the host name (although an IP address is also
acceptable), the port, and the
path. The method is optional and defaults to a value of
GET if none is specified. In essence,
the example is specifying that the request should be an HTTP GET request to http://www.google.com/ on port 80.
The next thing
we do is use the options object to
construct an instance of http.ClientRequest using
the factory method http.request().
This method takes an options
object and an optional callback argument. The passed callback listens to
the response event, and when a response
event is received, we can process the results of the request. In the
previous example, we simply output the response object to the console.
However, it’s important to notice that the body of the HTTP request is
actually received via a stream in the response object. Thus, you can subscribe to
the data event of the response object to get the data as it becomes
available (see the section Readable streams for more
information).
The final important point to notice is that
we had to end() the request. Because this was a GET request, we didn’t write any data to the
server, but for other HTTP methods,
such as PUT or POST, you may need to. Until we call the
end() method, request won’t initiate the HTTP request, because it doesn’t know whether
it should still be waiting for us to send data.
Since GET is such a common HTTP use case, there is a special factory method to support it in
a more convenient way, as shown in Example 4-10.
This example of http.get() does exactly the same thing as
the previous example, but it’s slightly more concise. We’ve lost the
method attribute of the config
object, and left out the call request.end() because it’s implied.
If you run the previous two examples, you
are going to get back raw Buffer
objects. As described later in this chapter, a Buffer is a special
class defined in Node to support the storage of arbitrary, binary
data. Although it’s certainly possible to work with these, you often
want a specific encoding, such as UTF-8 (an encoding for Unicode
characters). You can specify this with the response.setEncoding() method (see Example 4-11).
Example 4-11. Comparing raw Buffer output to output with a specified encoding
> var http = require('http');
> var req = http.get({host:'www.google.com', port:80, path:'/'}, function(res) {
... console.log(res);
... res.on('data', function(c) { console.log(c); });
... });
> <Buffer 3c 21 64 6f 63 74 79 70
...
65 2e 73 74>
<Buffer 61 72 74 54 69
...
69 70 74 3e>
>
> var req = http.get({host:'www.google.com', port:80, path:'/'}, function(res) {
... res.setEncoding('utf8');
... res.on('data', function(c) { console.log(c); });
... });
> <!doctype html><html><head><meta http-equiv="content-type
...
load.t.prt=(f=(new Date).getTime());
})();
</script>
>In the first case, we do not pass ClientResponse.setEncoding(), and we get
chunks of data in Buffers. Although
the output is abridged in the printout, you can see that it isn’t just
a single Buffer, but that several
Buffers have been returned with
data. In the second example, the data is returned as UTF-8 because we
specified res.setEncoding('utf8').
The chunks of data returned from the server are still the same, but
are given to the program as strings
in the correct encoding rather than as raw Buffers. Although the printout may not make
this clear, there is one string for
each of the original Buffers.
Not all HTTP is GET. You might also need to call POST,
PUT, and other HTTP methods that alter data on the other
end. This is functionally the same as making a GET request, except you are going to write
some data upstream, as shown in Example 4-12.
This example
is very similar to Example 4-10, but uses
the http.ClientRequest.write() method. This
method allows you to send data upstream, and as explained earlier, it
requires you to explicitly call http.ClientRequest.end() to indicate
you’re finished sending data. Whenever ClientRequest.write() is called, the data is
sent upstream (it isn’t buffered), but the server will not respond
until ClientRequest.end() is
called.
You can stream data to a server using
ClientRequest.write() by coupling
the writes to the data event of a
Stream. This is ideal if you need
to, for example, send a file from disk to a remote server over
HTTP.
The ClientResponse object stores a variety of information about the request. In general,
it is pretty intuitive. Some of its obvious properties that are often
useful include statusCode (which contains the HTTP
status) and header (which is
the response header object). Also hung off of ClientResponse are various streams and
properties that you may or may not want to interact with directly.
The URL
module provides tools for easily parsing and dealing with URL
strings. It’s extremely useful when you have to deal with URLs. The
module offers three methods: parse,
format, and resolve. Let’s start by looking at Example 4-13,
which demonstrates parse
using Node REPL.
Example 4-13. Parsing a URL using the URL module
> var URL = require('url');
> var myUrl = "http://www.nodejs.org/some/url/?with=query¶m=that&are=awesome
#alsoahash";
> myUrl
'http://www.nodejs.org/some/url/?with=query¶m=that&are=awesome#alsoahash'
> parsedUrl = URL.parse(myUrl);
{ href: 'http://www.nodejs.org/some/url/?with=query¶m=that&are=awesome#alsoahash'
, protocol: 'http:'
, slashes: true
, host: 'www.nodejs.org'
, hostname: 'www.nodejs.org'
, hash: '#alsoahash'
, search: '?with=query¶m=that&are=awesome'
, query: 'with=query¶m=that&are=awesome'
, pathname: '/some/url/'
}
> parsedUrl = URL.parse(myUrl, true);
{ href: 'http://www.nodejs.org/some/url/?with=query¶m=that&are=awesome#alsoahash'
, protocol: 'http:'
, slashes: true
, host: 'www.nodejs.org'
, hostname: 'www.nodejs.org'
, hash: '#alsoahash'
, search: '?with=query¶m=that&are=awesome'
, query:
{ with: 'query'
, param: 'that'
, are: 'awesome'
}, pathname: '/some/url/'
}
>The first thing we do, of course, is require
the URL module. Note that the names
of modules are always lowercase. We’ve created a url as a string containing all the parts that
will be parsed out. Parsing is really easy: we just call the parse method from the URL module on the string. It returns a data
structure representing the parts of the parsed URL. The components it
produces are:
The href
is the full URL that was originally
passed to parse. The protocol is the
protocol used in the URL (e.g.,
http://, https://, ftp://, etc.). host is the fully qualified hostname of the
URL. This could be as simple as the
hostname for a local server, such as print
server, or a fully qualified domain name such as www.google.com. It might also include a port
number, such as 8080, or username and
password credentials like un:pw@ftpserver.com. The various parts of the
hostname are broken down further into auth, containing just the user credentials;
port, containing just the port; and
hostname, containing the hostname
portion of the URL. An important
thing to know about hostname is that
it is still the full hostname, including the top-level domain (TLD;
e.g., .com, .net, etc.) and the specific server. If the
URL were http://sport.yahoo.com/nhl, hostname would not give you just the TLD
(yahoo.com) or just the host
(sport), but the entire hostname
(sport.yahoo.com). The URL module doesn’t have the capability to
split the hostname down into its components, such as domain or
TLD.
The next set of components of the URL
relates to everything after the host.
The pathname is the entire filepath
after the host. In http://sports.yahoo.com/nhl, it is /nhl. The next component is the search component, which stores the HTTP GET parameters in the URL. For example,
if the URL were http://mydomain.com/?foo=bar&baz=qux, the
search component would be ?foo=bar&baz=qux. Note the inclusion of
the ?. The query parameter is similar to the search component. It contains one of two
things, depending on how parse was
called.
parse
takes two arguments: the url string
and an optional Boolean that determines whether the queryString should be parsed using the
querystring module, discussed in the
next section. If the second argument is false, query will just contain a string similar to
that of search but without the
leading ?. If you don’t pass anything
for the second argument, it defaults to false.
The final component is the fragment portion of the URL. This is the part
of the URL after the #. Commonly,
this is used to refer to named anchors in HTML pages. For instance, http://abook.com/#chapter2 might refer to the
second chapter on a web page hosting a whole book. The hash component in this case would contain
#chapter2. Again, note the included
# in the string. Some sites, such as
http://twitter.com, use more complex
fragments for AJAX applications, but the same rules apply. So the URL
for the Twitter mentions account, http://twitter.com/#!/mentions, would have a
pathname of / but a hash of #!/mentions.
The querystring module is a very simple helper module to deal with query strings.
As discussed in the previous section, query strings are the parameters
encoded at the end of a URL. However, when reported back as just a
JavaScript string, the parameters are fiddly to deal with. The querystring module provides an easy way to
create objects from the query strings. The main methods it offers are parse and
decode, but some internal helper
functions, —such as escape,
unescape, unescapeBuffer, encode, and stringify, are also exposed. If you have a
query string, you can use parse to
turn it into an object, as shown in Example 4-14.
Here, the class’s parse function turns the query string into an
object in which the properties are the keys and the values correspond to
the ones in the query string. You should notice a few things, though.
First, the numbers are returned as strings, not numbers. Because
JavaScript is loosely typed and will coerce a string into a number in a
numerical operation, this works pretty well. However, it’s worth bearing
in mind for those times when that coercion doesn’t work.
Additionally, it’s important to note that
you must pass the query string without the leading ? that demarks it in the URL. A typical URL
might look like http://www.bobsdiscount.com/?item=304&location=san+francisco.
The query string starts with a ? to
indicate where the filepath ends, but if you include the ? in the string you pass to parse, the first key will start with a
?, which is almost certainly not what
you want.
This library is really useful in a bunch of
contexts because query strings are used in situations other than URLs.
When you get content from an HTTP
POST that is x-form-encoded, it
will also be in query string form. All the browser manufacturers have
standardized around this approach. By default, forms in HTML will send
data to the server in this way also.
The querystring module is also used as a helper
module to the URL module.
Specifically, when decoding URLs, you can ask URL to turn the query string into an object
for you rather than just a string. That’s described in more detail in
the previous section, but the parsing that is done uses the parse method from querystring.
Another important part of querystring is encode (Example 4-15).
This function takes a query string’s key-value pair object and
stringifies it. This is really useful when you’re working with HTTP requests, especially POST data. It makes it easy to work with a
JavaScript object until you need to send the data over the wire and then
simply encode it at that point. Any JavaScript object can be used, but
ideally you should use an object that has only the data that you want in
it because the encode method will add
all properties of the object. However, if the property value isn’t a
string, Boolean, or number, it won’t be serialized and the key will just
be included with an empty value.
I/O is one of the core pieces that makes Node different from other frameworks. This section explores the APIs that provide nonblocking I/O in Node.
Many components in Node provide continuous
output or can process continuous input. To make these components act in
a consistent way, the stream API
provides an abstract interface for them. This API provides common
methods and properties that are available in specific implementations of
streams. Streams can be readable, writable, or both. All streams
are EventEmitter
instances, allowing them to emit events.
The readable stream API is a set of methods and events that provides
access to chunks of data as they are sent by an underlying data
source. Fundamentally, readable streams are about emitting data events. These events represent the
stream of data as a stream of events. To make this manageable, streams
have a number of features that allow you to configure how much data
you get and how fast.
The basic stream in Example 4-16 simply reads data from a file in chunks.
Every time a new chunk is made available, it is exposed to a callback
in the variable called data. In
this example, we simply log the data to the console. However, in real
use cases, you might either stream the data somewhere else or spool it
into bigger pieces before you work on it. In essence, the data event simply
provides access to the data, and you have to figure out what to do
with each chunk.
Let’s look in more detail at one of the common patterns used in dealing with streams. The spooling pattern is used when we need an entire resource available before we deal with it. We know it’s important not to block the event loop for Node to perform well, so even though we don’t want to perform the next action on this data until we’ve received all of it, we don’t want to block the event loop. In this scenario (Example 4-17), we use a stream to get the data, but use the data only when enough is available. Typically this means when the stream ends, but it could be another event or condition.
The filesystem module is obviously very helpful because you need it in order to access files on disk. It closely mimics the POSIX style of file I/O. It is a somewhat unique module in that all of the methods have both asynchronous and synchronous versions. However, we strongly recommend that you use the asynchronous methods, unless you are building command-line scripts with Node. Even then, it is often much better to use the async versions, even though doing so adds a little extra code, so that you can access multiple files in parallel and reduce the running time of your script.
The main issue that people face while dealing with asynchronous calls is ordering, and this is especially true with file I/O. It’s common to want to do a number of moves, renames, copies, reads, or writes at one time. However, if one of the operations depends on another, this can create issues because return order is not guaranteed. This means that the first operation in the code could happen after the second operation in the code. Patterns exist to make ordering easy. We talked about them in detail in Chapter 3, but we’ll provide a recap here.
Consider the case of reading and then deleting a file (Example 4-18). If the delete (unlink) happens before the read, it will be impossible to read the contents of the file.
Notice that we are using the asynchronous methods, and although we have created callbacks, we haven’t written any code that defines in which order they get called. This often becomes a problem for programmers who are not used to programming in event loops. This code looks OK on the surface and sometimes it will work at runtime, but sometimes it won’t. Instead, we need to use a pattern in which we specify the ordering we want for the calls. There are a few approaches. One common approach is to use nested callbacks. In Example 4-19, the asynchronous call to delete the file is nested within the callback to the asynchronous function that reads the file.
This approach is often very effective for discrete sets of operations. In our example with just two operations, it’s easy to read and understand, but this pattern can potentially get out of control.
Although Node is JavaScript, it is
JavaScript out of its usual environment. For instance, the browser
requires JavaScript to perform many functions, but manipulating binary
data is rarely one of them. Although JavaScript does support bitwise
operations, it doesn’t have a native representation of binary data. This
is especially troublesome when you also consider the limitations of the
number type system in JavaScript, which might otherwise lend itself to
binary representation. Node introduces the Buffer class to make
up for this shortfall when you’re working with binary data.
Buffers are an extension to the V8 engine,
which means that they have their own set of pitfalls. Buffers are
actually a direct allocation of memory, which may mean a little or a
lot, depending on your experience with lower-level computer languages.
Unlike the data types in JavaScript, which abstract some of the ugliness
of storing data, Buffer provides
direct memory access, warts and all. Once a Buffer is created, it is a fixed size.
If you want to add more data, you must clone the Buffer into a larger
Buffer. Although some of these features may seem
frustrating, they allow Buffer to
perform at the speed necessary for many data operations on the server.
It was a conscious design choice to trade off some programmer
convenience for performance.
We thought it was important to include this quick primer on working with binary data for those who haven’t done much of it, or as a refresher for those of us who haven’t in a long time (which was true for us when we started working with Node). Computers, as almost everyone knows, work by manipulating states of “on” and “off.” We call this a binary state because there are only two possibilities. Everything in computers is built on top of this, which means that working directly with binary can often be the fastest method on the computer. To do more complex things, we collect “bits” (each representing a single binary state) into groups of eights, often called an octet or, more commonly, a byte.[9] This allows us to represent bigger numbers than just 0 or 1.
By creating sets of 8 bits, we are able to represent any number from 0 to 255. The rightmost bit represents 1, but then we double the value of the number represented by each bit as we move left. To find out what number it represents, we simply sum the numbers in column headers (Example 4-20).
You’ll also see the use of hexadecimal notation, or “hex,” a lot. Because bytes
need to be easily described and a string of eight 0s and 1s isn’t very
convenient, hex notation has become popular. Binary notation is base
2, in that there are only two possible states per digit (0 or 1). Hex
uses base 16, and each digit in hex can have a value from 0 to F,
where the letters A through F (or their lowercase equivalents) stand
for 10 through 15, respectively. What’s very convenient about hex is
that with two digits we can represent a whole byte. The right digit
represents 1s, and the left digit represents 16s. If we wanted to
represent decimal 149, it is (16 x 9) + (5 x
1), or the hex value 95.
In JavaScript, you can create a number from a hex value using the
notation 0x in front of the hex
value. For instance, 0x95 is
decimal 149. In Node, you’ll commonly see Buffers represented by hex values in console.log()
output or Node REPL. Example 4-22 shows how you
could store 3-octet values (such as an RGB color value) as a
Buffer.
So how does binary relate to other kinds of data? Well, we’ve seen how binary can represent numbers. In network protocols, it’s common to specify a certain number of bytes to convey some information, using particular bits in fixed places to indicate specific things. For example, in a DNS request, the first two bytes are used as a number for a transaction ID, whereas the next byte is treated as individual bits, each used to indicate whether a specific feature of DNS is being used in this request.
The other extremely common use of binary is to represent strings. The two most common “encoding” formats for strings are ASCII and UTF (typically UTF-8). These encodings define how the bits should be converted into characters. We’re not going to go into too much of the gory detail, but essentially, encodings work by having a lookup table that maps the character to a specific number represented in bytes. To convert the encoding, the computer has to simply convert from the number to the character by looking it up in a conversion table.
ASCII characters (some of which are nonvisible “control characters,” such as Return) are always exactly 7 bits each, so they can be represented by values from 0 to 127. The eighth bit in a byte is often used to extend the character set to represent various choices of international characters (such as ȳ or ȱ).
UTF is a little more complex. Its character set has a lot more characters, including many international ones. Each character in UTF-8 is represented by at least 1 byte, but sometimes up to 4. Essentially, the first 128 values are good old ASCII, whereas the others are pushed further down in the map and represented by higher numbers. When a less common character is referenced, the first byte uses a number that tells the computer to check out the next byte to find the real address of the character starting on the second sheet of its map. If the character isn’t on the second sheet of the map, the second byte tells the computer to look at the third, and so on. This means that in UTF-8, the length of a string measured in characters isn’t necessarily the same as its length in bytes, as is always true with ASCII.
It is important to remember is that once you copy things to a Buffer, they will be stored as their binary
representations. You can always convert the binary representation in
the buffer back into other things, such as strings, later. So a
Buffer is defined only by its size,
not by the encoding or any other indication of its meaning.
Given that Buffer is opaque, how big does it need to be
in order to store a particular string of input? As we’ve said, a UTF
character can occupy up to 4 bytes, so to be safe, you should define a
Buffer to be four times the size of
the largest input you can accept, measured in UTF characters. There
may be ways you can reduce this burden; for instance, if you limit
your input to European languages, you’ll know there will be at most 2
bytes per character.
Buffers
can be created using three possible parameters: the length of
the Buffer in bytes, an array of bytes to copy into
the Buffer, or a string to copy into the
Buffer. The first and last methods are by far the
most common. There aren’t too many instances where you are likely to
have a JavaScript array of bytes.[10]
Creating a Buffer of a particular size is a very common scenario and easy to deal with.
Simply put, you specify the number of bytes as your argument when
creating the Buffer (Example 4-23).
As you can see from the previous example,
when we create a Buffer we get a
matching number of bytes. However, because the Buffer is just getting an allocation of
memory directly, it is uninitialized and the
contents are left over from whatever happened to occupy them before.
This is unlike all the native JavaScript types, which initialize all
memory so that when you create a new primitive or object, it doesn’t
assign whatever was already in the memory space to the primitive or
object you just created. Here is a good way to think about it. If you
go to a busy cafe and you want a table, the fastest way to get one is
to sit down as soon as some other people vacate one. However, although
it’s fast, you are left with all their dirty dishes and the detritus
from their meals. You might prefer to wait for one of the staff to
clear the table and wipe it down before you sit. This is a lot like
Buffers versus native types.
Buffers do very little to make
things easy for you, but they do give you direct and fast access to
memory. If you want to have a nicely zeroed set of bits, you’ll need
to do it yourself (or find a helper library).
Creating a Buffer using byte length is most common when
you are working with things such as network transport protocols that
have very specifically defined structures. When you know exactly how
big the data is going to be (or you know exactly how big it could be)
and you want to allocate and reuse a Buffer for performance reasons, this is the
way to go.
Probably the most common way to use a
Buffer is to create it with a
string of either ASCII or UTF-8 characters. Although a Buffer can hold any data, it is particularly
useful for I/O with character data because the constraints we’ve
already seen on Buffer can make
their operations much faster than operations on regular strings. So
when you are building really highly scalable apps, it’s often worth
using Buffers to hold strings. This
is especially true if you are just shunting the strings around the
application without modifying them. Therefore, even though strings
exist as primitives in JavaScript, it’s still very common to keep
strings in Buffers in Node.
When we create a Buffer with a string, as shown in Example 4-24, it defaults to UTF-8. That is, if you
don’t specify an encoding, it will be considered a UTF-8 string. That
is not to say that Buffer pads the
string to fit any Unicode character (blindly allocating 4 bytes per
character), but rather that it will not truncate characters. In this
example, we can see that when taking a string with just lowercase
alpha characters, the Buffer uses
the same byte structure, whatever the encoding, because they all fall
in the same range. However, when we have an “é,” it’s encoded as 2
bytes in the default UTF-8 case or when we specify UTF-8 explicitly.
If we specify ASCII, the character is truncated to a single byte.
Example 4-24. Creating Buffers using strings
> new Buffer('foobarbaz');
<Buffer 66 6f 6f 62 61 72 62 61 7a>
> new Buffer('foobarbaz', 'ascii');
<Buffer 66 6f 6f 62 61 72 62 61 7a>
> new Buffer('foobarbaz', 'utf8');
<Buffer 66 6f 6f 62 61 72 62 61 7a>
> new Buffer('é');
<Buffer c3 a9>
> new Buffer('é', 'utf8');
<Buffer c3 a9>
> new Buffer('é', 'ascii');
<Buffer e9>
>Node offers a number of operations to simplify working with strings and Buffers. First, you don’t need to compute
the length of a string before creating a Buffer to hold it; just assign the string as
the argument when creating the Buffer. Alternatively, you can use the Buffer.byteLength() method. This method
takes a string and an encoding and returns the string’s length in
bytes, rather than in characters as String.length does.
You can also write a string to an existing
Buffer. The Buffer.write() method writes a string to a specific index of a Buffer. If there is room in the Buffer starting from the specified offset,
the entire string will be written. Otherwise, characters are truncated
from the end of the string to fit the Buffer. In either case, Buffer.write() will return the number of
bytes that were written. In the case of UTF-8 strings, if a whole
character can’t be written to the Buffer, none of the bytes for that character
will be written. In Example 4-25, because the
Buffer is too small for even one
non-ASCII character, it ends up empty.
In a single-byte
Buffer, it’s possible to write an “a” character,
and doing so returns 1, indicating
that 1 byte was written. However, trying to write a “é” character
fails because it requires 2 bytes, and the method returns
0 because nothing was written.
There is a little more complexity to
Buffer.write(), though. If
possible, when writing UTF-8, Buffer.write() will terminate the character
string with a NUL character.[11] This is much more significant when writing into the
middle of a larger Buffer.
In Example 4-26,
after creating a Buffer that is 5
bytes long (which could have been done directly using the string), we
write the character f to the entire
Buffer. f is the character code 0x66 (102 in
decimal). This makes it easy to see what happens when we write the
characters “ab” to the Buffer
starting with an offset of 1. The zeroth character is left as f. At positions 1 and 2, the characters
themselves are written, 61 followed by 62. Then Buffer.write() inserts a terminator, in this
case a null character of 0x00.
Borrowed from the Firebug debugger in
Firefox, the simple console.log command allows you
to easily output to stdout without using any modules (Example 4-27). It also offers some pretty-printing
functionality to help enumerate objects.
[7] When we talk about a pseudoclass, we are referring to the definition found in Douglas Crockford’s JavaScript: The Good Parts (O’Reilly). From now on, we will use “class” to refer to a “pseudoclass.”
[8] This works in JavaScript because it supports first-class functions.
[9] There is no “standard” size of byte, but the de facto size that virtually everyone uses nowadays is 8 bits. Therefore, octets and bytes are equivalent, and we’ll be using the more common term byte to mean specifically an octet.
[10] It’s very memory-inefficient, for one thing. If you store each byte as a number, for instance, you are using a 64-bit memory space to represent 8 bits.
[11] This generally just means a binary 0.