SLAC CPE Software Engineering Group
ESD DOCUMENTATION AND WEB SUPPORT
WEBMCC Mirroring with Wget
SLAC Detailed SLAC Computing Software Home Software Detailed Documentation and Web Suport
The primary part of ESD Software’s Web documentation is located in AFS and served by SCS public Web servers. To ensure that the production documents critical to operation are available when SCS services are not available, we introduced a local Web server, namely WEBMCC, which is Windows based and resides in the MCC Computer Room. A Web mirroring is thus required to keep WEBMCC in sync with the documents in AFS. This document describes how to create a Web mirroring from AFS to WEBMCC with Wget. This document contributes to the part of the ESD Software Documentation and Web Support Design by Ginger and Greg.
Wget is a GNU software package for retrieving files from Web via HTTP protocol. Wget has many features to make retrieving large files or mirroring entire web sites easy, including:
- Non-interactive, meaning it can be used in a script and thus work in the background;
- Can resume aborted downloads, making mirroring reliable;
- Can use filename wild cards and recursively mirror directories;
- Optionally converts absolute links in downloaded documents to relative, so that downloaded documents may link to each other locally;
- Available on both Unix and Windows, perfect for mirroring between Unix based AFS Web site and Widows based local Web site.
- Uses local file timestamps to determine whether documents need to be re-downloaded when mirroring
The latest version for Windows is wget-1.8.1b. It is precompiled and installed in D:\webroot\tool on WEBMCC, the executables is wget.exe. The AFS area for ESD Software’s Web documentation is in /afs/slac/www/grp/cd/soft/, and the URL is http://www.slac.stanford.edu/grp/cd/soft, served by SCS Web servers. The local Web area is D:\webroot>, which is mapped as driver Y:\> and can be accessed with R/W privilege from any desktop in the group. The URL is http://webmcc/, served by WEBMCC. Following is an example, showing how a mirroring from http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/ to http://webmcc/mirroring_test/ is being done with Wget. That is to say: from /afs/slac/www/grp/cd/soft/mirroring_test/ in AFS to Y:\mirroring_test> in Windows if represented by filesystem.
wget -m -e robots=off reject==A,=D --no-parent -nH -k --cut-dirs=3 http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/
- Recursive Retrieval Options
-m: for mirroring. Recursively retrieving files by following HTTP links;
-k: convert non-relative links to relative, so that the downloaded files can link to each other locally;
- Accept/Rejection Options
--no-parent: Do not ever ascend to the parent directory when retrieving recursively, only the files below a certain hierarchy will be downloaded. In this example, no files above http://www.slac.stanford.edu/grp/cd/soft/ will be downloaded even if –m options is specified;
reject=LIST: list of rejected files. In this example, any files starting with =A,=D will be rejected.
-e robots=off: neither read nor honor robots.txt (contents of robots.txt is used by server administrators to shield parts of their systems from wanderings of Wget)
- Directory Control Options
-nH: Disable generation of host-prefixed directories
--cut-dirs=number: Ignore number directory components
no options -> www.slac.stanford.edu/grp/cd/soft/mirroring_test/
-nH -> grp/cd/soft/mirroring_test/
-nH –cut-dirs=3 -> mirroring_test/
Understanding Web Mirroring:
It is essential to notice that WEB mirroring != Filesystem mirroring. WEB mirroring is done via HTTP protocol, only files that are seen on Web will be mirrored. What is preserved on the mirrored site is the Web content tree. On the other hand, filesystem mirroring is done via TCP/IP protocol, which can preserve everything (filesystem tree).
Examples below show a number of scenarios when mirroring Web directories in AFS area to WEBMCC:
- A WEB directory in AFS has an index.html. Index.html has HTTP links to all files.
- All files are mirrored over to WEBMCC.
There is a very nice tool, lynx, one can use to verify what files are referenced via HTTP links in a Web page, thus to determine which can be mirrored over via HTTP protocol. For example, to get a list of the HTTP links in http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/index.html, use
$ lynx -dump http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/index.html | sed -ne '/1\./,$p'
- The index.html has HTTP links to file1.html and file2.html
- File3.html is not mirrored over to WEBMCC
$ lynx -dump http://www.slac.stanford.edu/grp/cd/soft/dir2/index.html | sed -ne '/1\./,$p'
1. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test dir2/file1.html
2. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test /dir2/file2.html
· A WEB directory in AFS has no index.html.
· All files are mirrored over to WEBMCC, an index.html generated.
· A feature can be used as ~ filesystem mirroring (web dirs with ps, config, gif, data)
· Symbolic link in AFS: file2.html -> ../dir4/file1.html
· Two copies in WEBMCC.
WEB mirroring != filesystem mirroring:
· files in a Web directory not referenced in a Web page via HTTP links will not be mirrored.
· Symbolic links in AFS filesystem hierarchy will be followed if referenced in a Web page, but not preserved on the mirrored site.
· Only Web directory hierarchy preserved
· All files will be mirrored if a WEB directory has no index.html, a feature that can be viewed as a kind of filesystem mirroring
There is no principle problem to use symbolic links in AFS Web area if needed, the mirrored Web site on WEBMCC works just as the AFS site. However, one should keep in mind that the files will be duplicated on the mirrored site. One may want to avoid using symbolic links in the AFS site, the potential for a very large duplication is too great, because our overall Web design is to create a system in which every functional document type home page was linked to every project.
Contact: Jingchen Zhou (X4661, jingchen@slac). Last edited on April 30, 03