Skip to content

Downloading Raw Data

Downloading Raw Data with SCP

For those with SLAC accounts, a simple scp command will allow you to copy over raw data files to wherever you choose. This is because a SLAC is a tier 1 site. It hosts a complete copy of all real and simulated data. View the flowchart below to see the different tier levels. Tier 0 is where data is produced. Other computing sites would fall under tier 2. Tier 2 sites host a partial copy of the data and are secondary sources for data distribution, support either data processing, simulation, or data analysis depending on their capacity. Tier 3 sites are for opportunistic resources and for support for local groups. Note that except for Tier 1, the layout of the institutions’ roles is just an example.

Tier Flowchart

This flowhchart is credited to Tina Cartaro.

To learn more about working with a SLAC account read Working with SLAC Resources.

To copy over data, simply use the following command:

scp username@centos7.slac.stanford.edu:/gpfs/slac/staas/fs1/g/supercdms/data/CDMS/<path/to/file/> <path/to/store/file>

Downloading Raw Data With GNU Wget

GNU Wget is a free software package for retrieving files from web servers and is typically pre-installed on most Linux distributions today. We can use wget to download raw data from the Data Catalog.

First navigate to the file you would like to download on the data catalog and copy the download link.

Then use the following command to download the file:

wget <file_URL> -O path/to/directory/to/store/file/<file_name>

The -O flag ensures the donwloaded file gets outputed with the correct name. Ensure that the file name is the same as the file name as the name in the data catalog. For example, if the file name on the datalog is 01150212_1819_F0001.gz , this should be the name of the output file as well.

Downloading Raw Data With Rclone

Another method to download raw dara is using rclone. Rclone is a command line program that allows to manage files on cloud storage. In this tutorial, we will learn how to configure rclone so that we can use it to easily download raw data from the Open Storage Network (OSN) to run BatNoise on.

Install Rclone

First, check if rclone is already installed using the command:

rclone version

If rclone is not installed, refer to the offical rclone Documentation to install it.

Configure Rclone

Next, we must configure rclone. Do the command:

rclone config

Then, choose the new remote option and name it OSN . The type should be "s3". As of June 15, 2022, this is option 4 in the list you will be prompted with. The provider should be "Ceph" and this is option 3 as of June 15, 2022. The next few options should be skipped until prompted to provide an endpoint. In order to access the OSN, the endpoint should be:

https://ncsa.osn.xsede.org

Skip all succeeding fields until prompted to finalize the new remote. Finally, quit the config.

Setup Access Keys

Now, we need to clone and source credentials to access the OSN for SuperCDMS. Create a new directory to hold the credentials:

mkdir soft

Inside the new directory, clone the osn_secrets repository:

git clone git@gitlab.com:supercdms/DataHandling/osn_secrets.git

Next, source OSN_creds.sh script to export the environment variables OSN_ACCESS_KEY and OSN_SECRET_KEY, allowing you to access the OSN! You can use the following command to do so:

source osn_secrets/OSN_creds.sh

Lastly, we will need to add the access keys to the rclone config file. Copy the access keys from OSN_creds.sh Use the rclone config filecommand to get the path to the file and then open the file and add the following lines between provider and endpoint in the rclone config file.

access_key_id = <OSN_ACCESS_KEY>
secret_access_key = <OSN_SECRET_KEY>

Downloading Files

You are now able to download any file from the OSN. Simply be in the directory you wish to copy over your files to, and list the files in the directory you are looking for with:

rclone ls OSN:<path_to_directory>

Finally copy over the files with:

rclone copy source:path dest:path

Downloading Raw Data with OSN Transfer Tools

To be added in future when issues resolved.

Downloading Raw Data with DataCat

To be added in the future once working.