chapter_3_-_your_first_webpage_-_learning_html_and_css:uniform_resource_locators-urls

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

chapter_3_-_your_first_webpage_-_learning_html_and_css:uniform_resource_locators-urls [2013/01/29 19:15]
cellbiol
chapter_3_-_your_first_webpage_-_learning_html_and_css:uniform_resource_locators-urls [2013/02/25 17:35] (current)
cellbiol
Line 1: Line 1:
 ===== 3-1: Uniform Resource Locators - URLs ===== ===== 3-1: Uniform Resource Locators - URLs =====
  
-Work in progress+An URL is the unique address of a file on the Internet. 
 + 
 +An URL is composed by the protocol, the host name (that corresponds to the host's Document Root on the host filesystem),​ and the relative path of the file to be loaded (with respect to the Document Root). This was the short version. 
 + 
 +We now discuss ​in detail the various parts of an URL not before having explored a few connected concepts, namely: 
 + 
 +  * [[http://​en.wikipedia.org/​wiki/​Host_(network)|Hosts]] 
 +  * [[http://​en.wikipedia.org/​wiki/​Hostname|Host Names]] (see also [[http://​whatismyipaddress.com/​hostname|here]]) 
 +  * [[http://​en.wikipedia.org/​wiki/​Virtual_host|Virtual Hosts]] 
 +  * Domains 
 +  * Document Roots (already discussed [[chapter_2_-_the_linux_operating_system:​apache_web_server_configuration|here]]) 
 + 
 +If you think I am tricking you into accessing a chapter with an easy title (URL? Yeah, I know what it is), to then deviate toward complicated and unexpected arguments, you are right. There is a complains box at the bottom of this page called "​Discussion",​ feel free to use it :-) 
 + 
 +Actually there is a reason for discussing hosts, domains, Document Roots together with URLs: there is an (apparently) intricate relationship between those objects that you, as a Bioinformatician who wish to contribute to the growth of the Internet and the progress ​of Science by designing your own original web applications,​ are expected to fully master. Carry on! 
 + 
 +==== Hosts ==== 
 + 
 +Each file on the internet is located on a certain "​host",​ a computer connected to the internet.  
 + 
 +In current language, host is also a synonym of "host name": depending on the context, host could indicate the physical machine connected to the internet, or an host name "​hosted"​ on this machine.  
 + 
 +The host name will be associated to an host IP address by the [[chapter_1_-_internet_networks_and_tcp-ip:​1-4_domain_name_server|DNS system]]. 
 + 
 +So we say that an host (machine) can host an host (host name). Looks like a linguistic joke, but this is how things really stand in human language sometimes. This kind of things may explain in part why it is so difficult to convince computers to understand human speech, but this is another story. 
 + 
 +Examples of host names on the internet could be: 
 + 
 +<​code>​ 
 +  - www.cnn.com  
 +  - news.cnn.com 
 +  - nih.gov 
 +  - www.ncbi.nlm.nih.gov 
 +</​code>​ 
 + 
 +These are different 4 hosts, related to two different domains (cnn.com and nih.gov), that could be, in theory, ​ managed by four different computers, or by a unique physical machine, depending on how the DNS for these host names is set.  
 + 
 +Indeed a machine/​host can manage various host names at the same time. In Apache this is done by configuring VirtualHosts. 
 +==== Apache Virtual Hosts ==== 
 + 
 +Apache allows to set up a number of [[http://​httpd.apache.org/​docs/​2.2/​vhosts/​examples.html|Virtual Hosts]], one for each host name we wish to serve from our machine. Then we can point the DNS records for all these host names to the IP of our machine, so that visitors to the hosts will be directed to our machine and served the correct files and services. 
 + 
 +<box 100% left round red | **Example 3-1-1: An Apache configuration file that defines two VirtualHosts**>​ 
 +<​html>&​nbsp;</​html>​ 
 + 
 +By using Apache Virtual Hosts, it is extremely easy to assign a dedicated Document Root to each of the hosts managed by a machine. This is the secret for hosting several, even hundreds of different websites on a single computer with a single IP address. 
 + 
 +In the following example, you see an Apache configuration file, normally located at: 
 + 
 +/​etc/​apache2/​sites-enabled/​000-default 
 + 
 +that defines 3 hosts for the cellbiol.com domain and their respective Document Roots, by using 2 VirtualHosts directives. 
 + 
 +Mind that this is an hypothetical Apache file, our web setup is entirely different from this. However the indicated document roots are consistent with what you can check on the web at the url level. 
 + 
 +The first VirtualHost defines the web root for the www.cellbiol.com host, that is /var/www. The server has a server alias, cellbiol.com. The alias will share the same DocumentRoot as the server it is associated with: 
 + 
 +<​code>​ 
 +www.cellbiol.com and cellbiol.com (2 host names)  
 +DocumentRoot:​ /var/www (1 Document Root) 
 + 
 +The second host name is defined as an alias of the first with the  
 +ServerAlias directive, see the file below. 
 +</​code>​ 
 + 
 + 
 +You can check that visiting http://​cellbiol.com or http://​www.cellbiol.com will give the same results (at the time of this writing!). 
 + 
 +The second VirtualHost defines the web root for the games.cellbiol.com host. This is /​var/​www/​games. So the "​games"​ directory, Document Root of games.cellbiol.com is a child of the www.cellbiol.com Document Root. /​var/​www/​games is a child of /var/www. 
 + 
 +If this was true, you would expect to be able to reach: 
 + 
 +http://​games.cellbiol.com at http://​www.cellbiol.com/​games. Check it out (don't have too much fun with the games though, back to work immediately!). 
 + 
 +<​code>​ 
 +<​VirtualHost *:80> 
 +        ServerAdmin webmaster@gmailed.com 
 +        DocumentRoot /var/www 
 +        ServerName www.cellbiol.com 
 +        ServerAlias cellbiol.com 
 +        <​Directory /> 
 +                Options FollowSymLinks 
 +                AllowOverride None 
 +        </​Directory>​ 
 +        <​Directory /​var/​www/>​ 
 +                Options Indexes FollowSymLinks MultiViews 
 +                AllowOverride FileInfo Limit 
 +                Order allow,​deny 
 +                allow from all 
 +        </​Directory>​ 
 + 
 +        ErrorLog ${APACHE_LOG_DIR}/​error.log 
 +        LogLevel warn 
 +        CustomLog ${APACHE_LOG_DIR}/​access.log combined 
 +</​VirtualHost>​ 
 + 
 +<​VirtualHost *:80> 
 +        ServerAdmin webmaster@gmailed.com 
 +        DocumentRoot /​var/​www/​games 
 +        ServerName games.cellbiol.com 
 +        <​Directory /> 
 +                Options FollowSymLinks 
 +                AllowOverride None 
 +        </​Directory>​ 
 +        <​Directory /​var/​www/​games/>​ 
 +                AllowOverride None 
 +                Order allow,​deny 
 +                allow from all 
 +        </​Directory>​ 
 + 
 +        ErrorLog ${APACHE_LOG_DIR}/​error.log 
 +        LogLevel warn 
 +        CustomLog ${APACHE_LOG_DIR}/​access.log combined 
 +</​VirtualHost>​ 
 +</​code>​ 
 +</​box>​ 
 + 
 + 
 + 
 +==== Domains ==== 
 + 
 +A minimal host name can be composed by just the "​domain"​. The "​domain",​ as the term is used in current language, is composed of a first level domain (.com, .org, .net, .gov etc..), that identifies broadly the kind of domain, and a second level domain (cnn, nih, cellbiol in the examples above), whose name was chosen at the time of registration by the registrant, to identify it's organization,​ business, activity. 
 + 
 +A domain (first level + second level) identifies a defined "​entity"​ (organization,​ business, online shop, university, newspaper, travel agency etc..).  
 + 
 +Some entities are relatively small and will typically accessible at 2 different host addresses: the basic domain address, and the corresponding third level address "​www"​. 
 +<​code>​ 
 +- organization.org 
 +- www.organization.org 
 +</​code>​ 
 +In the latter address, "​org"​ is the first or "​top"​ level domain, "​organization"​ the second level domain and "​www"​ the third level domain (see also figure 3-1-1). 
 + 
 +The DNS for these two addresses might be set to pint them to the same IP, and the Apache on the host computer is configured to assign the same Document Root to both host names, as done in the first VirtualHost in cellbiol.com example above. 
 + 
 +On the other hand, in the case of a big organization,​ such as a University, a Campus, a big company, it is possible that several third level domain names exist in addition to the classical "​www",​ corresponding to entirely different websites, maybe (but not necessarily) hosted by different computers. For instance each department or faculty could have it's own third level domain, maybe hosted by a Faculty/​Department server: 
 +<​code>​ 
 +- www.organization.org (organization home page) 
 +- biology.organization.org  
 +- neuroscience.organization.org 
 +- bioinformatics.organization.org 
 +- molecular_medicine.organization.org 
 +- ...... 
 +</​code>​ 
 +A particular third level domain could then father several fourth, fifth, or higher level domains. In the example in figure 3-1-1, which corresponds to a real web page on the NCBI website, the host name comprises 5 domain levels.  
 + 
 +==== URLs ==== 
 + 
 +With this background information,​ we can understand the anatomy of an URL (figure 3-1-1).  
 +  - The very first part is the [[chapter_1_-_internet_networks_and_tcp-ip:​1-2_the_tcp-ip_family_of_internet_protocols|TCP/​IP application protocol]]. For web pages this will be http or https (Hyper Text Transfer Protocol or it's secure version).  
 +  - Then follows the host name. You see how this is essential, this is the information about the computer connected to the Internet that actually hosts the requested file on it's filesystem. An URL with the host name alone (this would be http://​www.ncbi.nlm.nih.gov/​) serves the files contained in the host's Document Root (by definition of DocumentRoot). So visiting http://​www.ncbi.nlm.nih.gov/​ is equivalent to visiting the Document Root of the www.ncbi.nlm.nih.gov host, as set in the host's Apache configuration files. 
 +  - The last component of the URL is the path of the actual file to be served, in this case "​VecScreen/​VecScreen.html",​ with respect to the DocumentRoot directory.  
 + 
 +The relationship between a web file URL and path on the host's filesystem is further clarified by figure 3-1-2. 
 + 
 + 
 +== Figure 3-1-1: URL Anatomy == 
 +{{ :​chapter_3_-_your_first_webpage_-_learning_html_and_css:​url_anatomy.png |}} 
 + 
 +The URL request, from a browser on a client computer, to an internet server reachable at a particular host name, can be better read from right to left.  
 + 
 +A request for:  
 + 
 +http://​www.ncbi.nlm.nih.gov/​VecScreen/​VecScreen.html 
 + 
 +Means: "send me the VecScreen.html text file. This is located in the VecScreen directory of host www.ncbi.nlm.nih.gov. Use the http protocol."​ 
 + 
 + 
 +<​code>​ 
 + 
 +Host name:  
 + 
 +www.ncbi.nlm.nih.gov 
 + 
 +Host's Document Root:  
 + 
 +/var/www (purely hypothetical,​ we don't know the actual web root of this host!) 
 + 
 +Host's URL: http://​www.ncbi.nlm.nih.gov/​ (the final slash is usually optional.  
 +By definition, the "base URL" of an host, the URL that contains just the host name 
 +or the host IP, corresponds to the Apache'​s DocumentRoot on the host) 
 +________________________ 
 + 
 +VecScren directory URL: 
 + 
 +http://​www.ncbi.nlm.nih.gov/​VecScreen/​ 
 + 
 +VecScreen directory path on the www.ncbi.nlm.nih.gov host: 
 + 
 +/​var/​www/​VecScreen 
 +________________________ 
 + 
 +VecScren.html file URL: 
 + 
 +http://​www.ncbi.nlm.nih.gov/​VecScreen/​VecScreen.html 
 + 
 +VecScreen file path on the www.ncbi.nlm.nih.gov host: 
 + 
 +/​var/​www/​VecScreen/​VecScreen.html 
 + 
 +</​code>​ 
 + 
 +URLs could include a complex subdirectories hierarchy such as  
 +<​code>​http://​hostname/​dir1/​subdir1/​subsubdir1/​file.html 
 +which could correspond to a path like this on the host's filesystem:​ 
 +/​var/​www/​dir1/​subdir1/​subsubdir1/​file.html 
 +</​code>​ 
 + 
 +The relationship between a web file URL and it's path on the host computer filesystem is illustrated in figure 3-1-2. 
 + 
 +== Figure 3-1-2: The relationship between a web file URL and it's path on the host filesystem == 
 +{{ :​chapter_3_-_your_first_webpage_-_learning_html_and_css:​file_url_vs_file_path.png |}} 
 + 
 +==== Course setup ==== 
 + 
 +A common setup that we encounter during this couse is to have a computer connected to the internet with a public IP address, and no host name associated with it. URLs in this case will have an IP in place of an host name. The document root of your files could be /var/www (Apache'​s default), if you are the administrator of your own machine, for example your laptop.  
 + 
 +In this case your base URL will be just your IP: 
 + 
 +<​code>​http://​122.13.22.34/</​code>​ 
 + 
 +The Document Root of your files could instead be /​home/​username/​public_html if you have an account on a machine shared by several users/​students. In this case your base URL will be like: 
 + 
 +<​code>​http://​122.13.22.34/​~username/</​code>​ 
 + 
 +if the server was setup [[chapter_2_-_the_linux_operating_system:​apache_web_server_configuration|as described here]] 
 + 
 +You now know where to put one of your web files in your filesystem, to make it accessible at a particular URL for the world. This is a good starting point. Let's add a little important piece of information.  
 + 
 +==== The special index.html or index.php files ==== 
 +  
 +If a directory within the Document Root, including the Document Root itself, contains a file called "​index.html"​ or "​index.php",​ this file will be shown by default on visiting the URL of the directory. You can check, for example, that when you visit: 
 + 
 +http://​www.cellbiol.com 
 + 
 +you are actually viewing the file: 
 + 
 +http://​www.cellbiol.com/​index.php 
 + 
 +This allows to have shorter and somewhat cleaner URLs. Also, sometimes Apache is configured to show the contents of a directory to the visitor unless an index.html file is present (in this case the index file is shown instead). So, creating an index.html file, even empty, is an easy way to conceal the directory contents to visitors. Of course, you can also configure Apache so that it does not show the directory contents by default, and for example issues a "​Virtual Directory Listing Denied"​ error if the index file is not present. 
 + 
 +==== Chapter Sections ==== 
 +<box 100% left round blue | **Chapter 3**> 
 +<​html>&​nbsp;</​html>​ 
 + <​PHP>​ 
 +echo file_get_contents('/​home/​cellbio1/​public_html/​bioinformatics_web_development/​data/​my_menus/​menu_chapter_3.html'​);​ 
 +</​PHP>​ 
 +[[..:​start|Back to index]] 
 +</​box>​
chapter_3_-_your_first_webpage_-_learning_html_and_css/uniform_resource_locators-urls.1359504955.txt.gz · Last modified: 2013/01/29 19:15 by cellbiol