A simple web proxy

HTTP is a reasonably simple client/server transfer protocol. Suppose you type the URL http://www.du.edu/uts/ into your web browser. It will then open a TCP connection to www.du.edu on port 80 and send something like this:

GET /uts/ HTTP/1.0<CR><LF>
<CR><LF>

(The <CR><LF> sequence represents ASCII character 13 followed by ASCII 10.) In response, the web server will locate its file called /uts/ and transmit something like this:

HTTP/1.1 200 OK<CR><LF>
Date: Mon, 10 Jan 2000 00:31:09 GMT<CR><LF>
Server: Apache/1.2.6<CR><LF>
Last-Modified: Wed, 05 Jan 2000 18:52:29 GMT<CR><LF>
ETag: "10a74-2b12-387392ed"<CR><LF>
Content-Length: 11026<CR><LF>
Accept-Ranges: bytes<CR><LF>
Connection: close<CR><LF>
Content-Type: text/html<CR><LF>
<CR><LF>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"<CR><LF>

(Here I have only shown the first line of the HTML file that is delivered.)

A web proxy server is designed to sit in between such a transaction in order to monitor, modify and/or prohibit it. All reasonable web browsers may be configured to use a proxy server; this means that when you try to load the page http://www.du.edu/uts/, the browser instead contacts the designated proxy server and asks it for the page as follows:

GET http://www.du.edu/uts/ HTTP/1.0<CR><LF>
<CR><LF>

Notice the difference here: instead of asking just for the resource /uts/, the web browser asks for the full resource http://www.du.edu/uts/. The proxy then has enough information to fetch the file on behalf of the web browser and deliver it as a result, or do whatever the proxy wants to do.

Requirements

Your assignment is to design and implement a web proxy that delivers the requested URL content to the client and keeps a log of the requested URLs. Specifically:

bulletYour proxy must support the HTTP/1.0 GET and POST methods. Note that the POST method provides the data to post after the header section. A Content-length: header provided by the browser will say how many bytes long it is. Your proxy does not have to support any other protocols (such as FTP or gopher). bulletRFC 1945 contains a full HTTP/1.0 specification. Beware, RFC 1945 is 60 pages long. You should only read as much of RFC1945 as you need to complete the assignment. bulletYou may want to start with this document: Basic HTTP (1992) from w3.org. This document is an incomplete description of HTTP 1.0 , but it has most of the information you need for this project and is a much easier read than the RFC. bulletWe must not be able to kill your server simply by sending an invalid request. bulletWe must not be able to kill your server by stopping or killing the client (this includes pressing the STOP button in Netscape or IE). bulletYour proxy server must support multiple simultaneous connections. You may use any reasonable technique (such as multiple processes, multiple threads, select loops, non-blocking I/O) to do this. bulletAny header lines your proxy doesn't understand must be passed to the "origin server" (i.e., the server named in the GET method) except the Connection: header, which you should drop if present. bulletDon't hard-code IP addresses or names into your proxy for any reason. bulletYou should try to make your proxy as robust as possible. E.g., don't let a malfunctioning client or server hang one of your processes forever. bulletTo set up a Netscape proxy, try Edit/Preferences, open the Advanced tab, go to Proxies, and click on "Manual proxy configuration". In Internet Explorer, look under Tools/Internet Options/Connections/LAN Settings. bulletIf you use fork(), you will probably generate zombie processes unless you work to avoid it. Read UNP section 5.9 carefully. bulletThe client does not send an EOF after its request; it just transmits a blank line. bulletYour proxy should allow clients and servers to denote an end-of-line in the headers by transmitting <CR><LF>, or <CR> alone, or <LF> alone. Your proxy should always send <CR><LF> when it wants to transmit a newline in the header. (Note: '\r' is <CR> and '\n' is <LF>.) bulletFor all x, your proxy should change the string HTTP/1.x to HTTP/1.0 when it occurs in the first line of a client request or server response. Do not perform this translation in other lines of the communication. bulletDo not plan to store all of a transmission or even all of its headers in memory before you begin parsing. There is no way to predict how large the transmission or headers will be. In practice the headers will be small, but your program should not depend on this. You may set a reasonable maximum line length and refuse to process any transaction containing a header line that exceeds this length. bulletThe log file of requested URLs that you keep should include the source and destination IP addresses, the date and time, and the URL requested for each transaction. It would also be useful to count the number of bytes transferred and the Referer: header line if any, but this is optional. bulletBe sure to hunt down and kill your own processes that should no longer be running when you log off! bulletMake sure your proxy can handle images, don't rely soley on simple HTML documents for all your testing! bulletIf a client closes a connection prematurely, your server can receive a SIGPIPE (this happens when you call write() on a socket that has been closed by the other end). By default, a SIGPIPE will kill your process - so you need to handle this situation.

Above all your program should deliver web pages properly and be difficult to crash.

Persistence

HTTP 1.1 supports persistent connections by default. Feel free to have your proxy deal with persistent connections, but this is not required for this project. If you chose not to deal with persistence, you will probably want to do something like the following (or your proxy will not work well with some clients/servers):

bulletDo not forward any
Proxy-Connection

or

Connection

request headers. bulletappend the request header

Connection: close

to all requests you send to servers. This tells the server that you don't want persistence, so the server should close the connection once the response is complete.

Server Output

To keep track of all requests, your server should print one line (to standard output) for each request serviced. The line should include the host name or IP address of the client, and the original request-line sent by the client (not any of headers that accompanied the request). For example, the following might be the output generated by your server if it received some requests from a client running on monica.cs.rpi.edu:

> hw2 1234 
monica.cs.rpi.edu:  GET http://www.cs.rpi.edu/
monica.cs.rpi.edu:  GET http://www.cs.rpi.edu/gfx/backg5.jpg
monica.cs.rpi.edu:  GET http://www.cs.rpi.edu/gfx/logo.jpg
monica.cs.rpi.edu:  GET http://www.yahoo.com/images/annakornikova.jpg
monica.cs.rpi.edu:  GET http://fred.com/purchase.asp?prod=17&qty=102
monica.cs.rpi.edu:  HEAD http://www.slashdot.org/
monica.cs.rpi.edu:  POST http://www.fbi.gov/insecuresubmission.cgi

Note that it is not necessary to include the HTTP version number in the output (but feel free to do so if you want).