SLAC CPE Software Engineering Group
Stanford Linear Accelerator Center
ESD DOCUMENTATION AND WEB SUPPORT

WEBMCC Mirroring with Wget

SLAC Detailed
SLAC Computing
Software Home
Software Detailed
Documentation and Web Suport

Contents: IntroductionWget and Web Mirroring.


 

Introduction

The primary part of ESD Software’s Web documentation is located in AFS and served by SCS public Web servers. To ensure that the production documents critical to operation are available when SCS services are not available, we introduced a local Web server, namely WEBMCC, which is Windows based and resides in the MCC Computer Room. A Web mirroring is thus required to keep WEBMCC in sync with the documents in AFS. This document describes how to create a Web mirroring from AFS to WEBMCC with Wget. This document contributes to the part of the ESD Software Documentation and Web Support Design by Ginger and Greg.

 

Wget and Web Mirroring

Wget is a GNU software package for retrieving files from Web via HTTP protocol. Wget has many features to make retrieving large files or mirroring entire web sites easy, including:

 

The latest version for Windows is wget-1.8.1b. It is precompiled and installed in D:\webroot\tool on WEBMCC, the executables is wget.exe. The AFS area for ESD Software’s Web documentation is in /afs/slac/www/grp/cd/soft/, and the URL is http://www.slac.stanford.edu/grp/cd/soft, served by SCS Web servers. The local Web area is D:\webroot>, which is mapped as driver Y:\> and can be accessed with R/W privilege from any desktop in the group. The URL is http://webmcc/, served by WEBMCC. Following is an example, showing how a mirroring from http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/ to http://webmcc/mirroring_test/ is being done with Wget. That is to say: from /afs/slac/www/grp/cd/soft/mirroring_test/ in AFS to Y:\mirroring_test> in Windows if represented by filesystem.

 

wget -m -e robots=off reject==A,=D --no-parent -nH -k --cut-dirs=3 http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/

  1. Recursive Retrieval Options

-m: for mirroring. Recursively retrieving files by following HTTP links;

-k: convert non-relative links to relative, so that the downloaded files can link to each other locally;

  1. Accept/Rejection Options

--no-parent: Do not ever ascend to the parent directory when retrieving recursively, only the files below a certain hierarchy will be downloaded. In this example, no files above http://www.slac.stanford.edu/grp/cd/soft/ will be downloaded even if –m options is specified;

reject=LIST: list of rejected files. In this example, any files starting with =A,=D will be rejected.

-e robots=off: neither read nor honor robots.txt (contents of robots.txt is used by server administrators to shield parts of their systems from wanderings of Wget)

  1. Directory Control Options

-nH: Disable generation of host-prefixed directories

--cut-dirs=number: Ignore number directory components

e.g.:

no options -> www.slac.stanford.edu/grp/cd/soft/mirroring_test/

-nH -> grp/cd/soft/mirroring_test/

-nH –cut-dirs=3 -> mirroring_test/

Understanding Web Mirroring:

It is essential to notice that WEB mirroring != Filesystem mirroring. WEB mirroring is done via HTTP protocol, only files that are seen on Web will be mirrored. What is preserved on the mirrored site is the Web content tree. On the other hand, filesystem mirroring is done via TCP/IP protocol, which can preserve everything (filesystem tree).

Examples below show a number of scenarios when mirroring Web directories in AFS area to WEBMCC:

 

CASE 1:

 

There is a very nice tool, lynx, one can use to verify what files are referenced via HTTP links in a Web page, thus to determine which can be mirrored over via HTTP protocol. For example, to get a list of the HTTP links in http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/index.html, use

$ lynx -dump http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/index.html | sed -ne '/1\./,$p'

1. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/file1.html

2. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/file2.html

3. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test/dir1/file3.html

Organization Chart
 

 

 

 

 

 

 


Case 2:

$ lynx -dump http://www.slac.stanford.edu/grp/cd/soft/dir2/index.html | sed -ne '/1\./,$p'

1. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test dir2/file1.html

2. http://www.slac.stanford.edu/grp/cd/soft/mirroring_test /dir2/file2.html

Organization Chart
 

 

 

 

 

 


Case 3:

        A WEB directory in AFS has no index.html.

        All files are mirrored over to WEBMCC, an index.html generated.

        A feature can be used as ~ filesystem mirroring (web dirs with ps, config, gif, data)

Organization Chart
 

 

 

 

 


Case 4:

        Symbolic link in AFS: file2.html -> ../dir4/file1.html

        Two copies in WEBMCC.

Organization Chart
 

 

 

 

 

 

 


Remarks:

WEB mirroring != filesystem mirroring:

        files in a Web directory not referenced in a Web page via HTTP links will not be mirrored.

        Symbolic links in AFS filesystem hierarchy will be followed if referenced in a Web page, but not preserved on the mirrored site.

        Only Web directory hierarchy preserved

        All files will be mirrored if a WEB directory has no index.html, a feature that can be viewed as a kind of filesystem mirroring

There is no principle problem to use symbolic links in AFS Web area if needed, the mirrored Web site on WEBMCC works just as the AFS site. However, one should keep in mind that the files will be duplicated on the mirrored site. One may want to avoid using symbolic links in the AFS site, the potential for a very large duplication is too great, because our overall Web design is to create a system in which every functional document type home page was linked to every project.

 

Contact: Jingchen Zhou (X4661, jingchen@slac). Last edited on April 30, 03