A simple write up on what is WebSocket protocol and how does it work.

WebSocket protocol enables two way real-time communication between a client and a server over a TCP connection. One well known example where WebSockets are beneficial is a chat applications. By implementing chat over WebSocket protocol we get a live view of a chat window without having to re-load the browser page int order to fetch the new messages.

High level overview

On a high level WebSocket protocol works like this:

  • WebSocket connection is established via HTTP request by client asking to upgrade to WebSocket protocol;
  • Server responds confirming the upgrade request;
  • Client and server keeps the TCP connection open;
  • Client and server sends WebSocket frames over the TCP connection until one side closes the connection;

WebSocket Handshake

In order to establish the WebSocket connection a client must send a HTTP GET request to the WebSocket path. Within the request client sets the following headers:

GET /websoccket_path HTTP/1.1
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Key: <random-key-generated-by-the-client>

Than the server responds with HTTP response 101 Switching Protocols with the following headers:

HTTP/1.1 101 Switchin Protocols
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Key: <accept-hash>

Key exchange

The contents for <accept-hash> is generated by the server:

  1. The server takes the value from the Sec-WebSocket-Key header from the initial request;
  2. The server appends a specific magic value to this key: 258EAFA5-E914-47DA-95CA-C5AB0DC85B11
  3. The server computes a SHA-1 hash of the concatenated string;
  4. Sec-WebSocket-Key is the base-64 encoding of that SHA-1 hash;

As an example:

  1. The client sends hello-world in the Sec-WebSocket-Key header
  2. The server will produce a value hello-world258EAFA5-E914-47DA-95CA-C5AB0DC85B11
  3. It’s SHA-1 hash will be: 825c975a9967274ed44b442082b6896953257441
  4. base64 encoding of the hash will be: glyXWplnJ07US0QggraJaVMldEE= (make sure to encode the HEX value and not the string)

WebSocket communication

Once the connection is established a two way communication between the client and the server happens through WebSocket frames.

WebSocket frame

0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
|     Extended payload length continued, if payload len == 127  |
+ - - - - - - - - - - - - - - - +-------------------------------+
|                               |Masking-key, if MASK set to 1  |
+-------------------------------+-------------------------------+
| Masking-key (continued)       |          Payload Data         |
+-------------------------------- - - - - - - - - - - - - - - - +
:                     Payload Data continued ...                :
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
|                     Payload Data continued ...                |
+---------------------------------------------------------------+

Masking

In order to prevent the caching each Client-To-Server message must be masked. A mask is a 4 byte value that is XOR’ed with each 4 bytes of the payload. Random mask is generated for every frame by the client.

Digesting the frame:

FIN

The first bit of the WebSocket frame is the finish bit.

  • The bit will be set to 0 if the continuation frames contains more data for the same message.
  • The bit will be set to 1 if the frame is last for the current message. Usually it can be assumed that the finish bit will have a 1, as there is effectively no limit to the seize for the WebSocket frame.

RSV (1, 2, 3)

Next 3 bits are the reserved bits. These don’t serve any purpose and is only there for the future proofing of the protocol. These will always be set to 0.

Opcode

Next 4 bits contains the Operation code, it’s used to specify the type of payload in the WebSocket frame.

  • 0001 - For text.
  • 0010 - For binary.
  • 1000 - To close the connection.

Mask

Next bit in the frame is the mask bit.

  • 0 - if the payload is masked.
  • 1 - if the payload is unmasked. Usually the data is masked when going from client to server and unmasked when going from server to client.

Payload length

This part is a bit tricky, as the amount of bits to represent the payload length can be different depending on the size of the payload. The payload length can be represented in either 7, 16 or 64 bits.

  • If the payload length is < 126 bytes:
    • The length will be represented in 7 bits.
  • If the payload length is >= 126 bytes AND < 65536 bytes:
    • The 7 bit length will be exactly 126: 1111110
    • The next 16 bits represents the payload length.
  • If the payload length is >= 65536 bytes:
    • The 7 bit length will be exactly 127: 1111111
    • The next 64 bits represent the payload length

To put more simply - in order to read the frame length:

  • Read the 7 bit length:
    • If the values is 126, read the next 16 bits as length;
    • If the value is 127, read the next 64 bits as the length;
    • Else, the value itself is the length;

This means that the max length of payload in one WebSocket frame is 18,446,744,073,709,551,615 or ~16 exabytes or ~16,000,000 terabytes. This is why it’s pretty safe to assume that the FIN bit will always be set to 1.

Masking key

The next 4 bytes is the masking key used to mask the payload data.

  • If the MASK bit = 1, the next 4 bytes is the mask.
  • If the MASK bit = 0, the payload begins here.

The server usually receives masked data, where each 4 bytes of the payload has been XOR’ed with the mask by the client. This means that the server has to read the payload 4 bytes at the time and XOR the bytes with the mask in order to read the payload.

Important thing here is if the length is not a multiple of 4, server will use only the required amount of the mask bytes, as always reading 4 bytes would likely cause an error.

A quick XOR example:

XOR - “Exclusive OR” is a logical operation that is true only if it’s arguments differ, so:

  • 1 XOR 1 would produce 0;
  • 0 XOR 0 would produce 0;
  • 0 XOR 1 would produce 1;
  • 1 XOR 0 would produce 1;

If 4 bytes of the unmasked message are:

01001001 01000011 01010101 00100001

And the random mask is:

01111011 00100010 01110101 01110011

XOR’ed payload will look like this:

00110010 01100001 00100000 01010010

So when the server receives these bits and XORs it with the mask it gets the original message bits:

01001001 01000011 01010101 00100001

Payload data

All what’s left on the frame is the actual payload data - what is being communicated between the server and the client.

And that’s it

Thank you for reading, there’s really not much else to the protocol. Hopefully this will help someone trying to grasp how WebSocket protocol works behind the curtains.