A lot of you may very well know what happens behind the scenes when a webpage is requested or being served, this article will explain the details and internals of what goes behind when you type http://www.clickoffline.com/index.php
The Internet Architecture
What is meant by internet is a collection of networks that interconnect a large number of content providing servers with the end users. Internet typically is a network of much smaller networks which are interconnected. Any network/machine that is connected to the internet can provide and retrieve content from the internet.
Since internet is a network, just like any other network it supports a wide range of protocols, one of the most frequently used protocols is the HTTP protocol. The HTTP (Hyper Text Transfer Protocol) is the main medium for content retrieval and delivery over the internet. Some of the other protocols that are frequently used are as follows
- FTP – File Transfer Protocol
- SMTP – Simple Mail Transfer Protocol
- POP – Post Office Protocol
- IMAP – Internet Message Access Protocol
- HTTPS – Secure HTTP Protocol
All these protocols are dependent on the internet as the backbone for providing the infrastructure needed to communicate. However these protocols may be used in a wide range of networks, not just the internet, for example, you may run an ftp server in an intranet (local LAN), which is not exactly internet.
Internet also has services that are used by other applications. DNS or Domain Name System is the one of the most important one. DNS works like a telephone directory. DNS typically is a collection of servers, which have a list of domain names matched with their internet IP addresses. Since every machine in the internet can be assigned with a unique IP address, and remembering the IP address is not so easy. DNS helps in referencing these Domains with IP addresses
A sample DNS table is shown below.
| Domain Name
|| IP Address
So when you request a page from Google.com, the browser sends a request to the DNS server to find a match for ‘www.google.com’, the DNS servers will in turn, return the IP address of the respective Domain Name by doing a looking in the DNS table. When this operation is performed in reverse, i.e identifying the default domain name from a given IP address, it is referred to as Reverse DNS Lookup or rDNS.
Steps and processes involved in obtaining an html webpage
Request a web page
Once you type the URL in your browser, which may be IE or Firefox; the steps as described above happen sequentially before we get to see the page on our browsers. Let’s look at each of these steps in detail.
Components of the URL
The first phase is to understand the URL or identifying the components of the URL. A typical URL consists of the following components
• Domain Name
• Resource Name
However there are other components that can be plugged into a URL which include, usernames, passwords and port numbers; we can ignore them as they are not used prominently.
The figure above explains the components of the URL. Now were ready to send the HTTP Request.
Construction of the HTTP Request
The browser composes a request called the HTTP Request that should be sent to the server to retrieve information from the server.
GET /index.php HTTP/1.1
Since HTTP is a connection-less protocol, all information related to the session are sent along the request every time, this includes, resource name, protocol version, browser name, version and OS, supported content types, cookies if any, etc.
Identifying the server
Before sending out a HTTP request, we will need to identify the remote server; this is done by identifying the IP address using DNS for the domain specified in the URL. The HTTP Request is sent to eth remote server by establishing a route between the local machine and the remote server.
Wait For Response
The client browser waits for the response for the server.
Analysis/ Rendering of HTTP Response
The HTTP Response consists of a HTTP Response Header and the HTTP Response Data.
The Response Headers provide information related to the HTTP Response, the most important field is the Response code. Different response codes mean different statuses
The browser analyzes the HTTP response and renders the response one the screen, any dependencies like Images, CSS, java scripts that arise from the source html page follow the same processes to load int eh client, except that the URLs of the resources is loaded automatically by the client browser instead of a user typing it.
HTTP/1.1 200 OK
Date: Wed, 18 Feb 2009 16:19:57 GMT
Server: Apache/2.2.11 (Unix)
Keep-Alive: timeout=15, max=100
Content-Type: text/html; charset=UTF-8
• Specfications for URL (RFC 1738): http://www.w3.org/Addressing/URL/url-spec.txt
• Specifications for DNS (RFC 1035): http://www.ietf.org/rfc/rfc1035.txt
• HTTP Protocol Details: http://en.wikipedia.org/wiki/HTTP
Disclaimer: The views expressed on this webpage are my personal views and do not necessarily reflect views of my employer.