HPC Getting Started: Data Storage and Archiving
Data Storage and Archiving
It's important for you to manage your files and data! HPC datasets can be quite large, and they tend to get bigger each year. They can be very important, especially when your thesis or publication depends on them. Since I/O is often a bottleneck in high performance computing, efficient, high-speed access to your data is critical. At times, you might need to share data with colleagues or the public. Other data might be sensitive or proprietary. Therefore, it is your duty to:
- Backup andArchive critical data. Never have just one copy of important data!
- Select the most appropriate resources to meet your needs.
- Use shared resources responsibly.
- Set the appropriate access controls.
- Follow proper use policies.
Each researcher has the ultimate responsibility for managing his or her own data.
You'll have several different kinds of files
The source code for programs and libraries, including F77 or C++ code files, makefiles, shell scripts, and similar files.
The compiled and linked versions of programs. These are usually ready to run, but you could have intermediate files.
Programs from other places, including open source software, purchased software, or other programs used by permission of the authors.
The input data and configuration files for the programs or packages you are using.
Some packages write out intermediate files. These may be intended for input to a later run or as checkpoint files to restart a job.
The results of a run or series of runs, either as a listing to be read or printed or as raw data to be graphed or processed further.
Documentation, notes, test data, and other supporting files.
The home filesystem is intended for source code, executables, configuration files, and similar data. This is a global clustered filesystem that is available to every node on the cluster. I/O to the home filesystem is normally fast. It is backed up daily, but make sure you have another copy, not on the DLX, of any file that would be hard to recreate. This is especially important for source code you have written or data you have collected.
The scratch filesystem is intended for temporary storage of checkpoints, program output, or large input files. This is a global clustered filesystem that is available to every node on the cluster. I/O to the scratch filesystem is very fast and efficient. However, scratch files are temporary and the scratch filesystem is not backed up. Copy any important files to your home directory, HSM, or other location. Scratch files may be purged after 30 days.
The HSM systems is intended for long-term storage of files that should be retained, but to which you don't need immediate access. It can be used to archive code and results or to 'stage' large files that won't all fix in your home or scratch directories. Access is through SFTP. HSM is a near-line facility using a large disk cache and two tape robots to automatically stage files and make them available within a few seconds to a few minutes. HSM is backed up daily.
Your personal workstation can be used to save source code, input files, and results of moderate size, depending on the amount of disk space your have available. If it holds important data, we urge you to have it backed up by the campus TSM system or something similar.
Some departments have servers for storing and sharing files for their researchers. Contact your departmental IT person for more information. We suggest these be backed up by the campus TSM system or something similar.
In some cases you may want to move or copy your files to another supercomputing site or to one of the commercial cloud storage providers.
UKIT uses TSM for backups and writes two copies of each file, which are stored in tape libraries at two separate locations on campus. However, UKIT cannot be responsible for any data loss. Make sure you have more than one copy of any important file, and make sure these copies are stored in separate locations.
Data Confidentiality and Access Control
The DLX cluster is a research cluster, intended primarily for fundamental research. We cannot guarantee complete confidentiality for data that resides there. It is your responsibility to set access controls appropriate to your needs. While we take care to secure our systems, security breaches could expose your data to others. System administrators with “root” privileges are not constrained by file permissions, and they have the ability to read and copy any file. They can also assume a user’s identity on the system. Users may encrypt data to provide extra measures of privacy if desired.