Frequently Asked Questions (FAQs)#

Below are some common questions and answers regarding our HPC cluster and its operation.

Cluster Details#

What hardware is in the cluster?

The cluster compute and storage capabilities are changing on a regular basis.

See also

Hardware Overview for detailed information on the HPCs hardware components, along with the partitions comprising subsets of these components.

What is a cluster?

A cluster is a network of interconnected computers designed to work together as a unified and robust system. These computers, also known as nodes, collaborate to collectively handle complex computational tasks more efficiently than a single machine could achieve. By distributing workloads across multiple nodes, a cluster harnesses parallel computing capabilities, enabling researchers and professionals to solve intricate problems, analyze large datasets, and perform simulations with incredible speed and precision. Clusters are widely used in high-performance computing (HPC) environments, scientific research, data analysis, and other fields that demand significant computing resources.

What is the policy on fair use of resources?

The fair use policy for computing resources aims to provide equitable access and distribution within the cluster. This policy ensures that all users have an equal opportunity to utilize the cluster resources based on their needs and priorities.

Fair use policies typically utilize a scheduler that allocates resources to jobs based on job priority, resource requirements, and historical resource usage. This approach prevents any single user or job type from monopolizing the resources and ensures a balanced and efficient allocation of resources across different users and projects.

This policy promotes fairness and efficient resource utilization and contributes to an inclusive and productive computing environment for all users.

Back to the top

Accounts#

How do I get an account for the HPC cluster?

See Request An Account, and complete the ServiceNow RC Access Request form with your details and submit it. Your request will be processed within a few business days.

How can I reset my password?

Please use our website’s ‘Account Management’ tool for password reset. Follow the instructions to reset your password.

See also

Changing Password.

What happens to my account when I leave NU?

ITS controls access to the University’s computing resources, so when you or your students leave, you/they may lose access to many of these resources. Sponsored accounts allow people who work or volunteer at NU, but who are not paid by NU, to access the University’s computing resources. Researchers with sponsored accounts cannot request RC services, but they are allowed to use the systems we manage as members of a MyGroups (requires VPN connection) group controlled by a NU Principal Investigator (PI). Details on sponsored accounts are posted on the ITS sponsored accounts page.

How do I get access?

The HPC resources are available to Northeastern University research faculty, and classes that require high-performance computing for their coursework, and students. For detailed information about reporting requirements and any access restrictions, please refer to our Getting Access page. If you have questions about our access policy or procedures, please contact RC.

What is my cluster username?

Your cluster username is the same as your NU username. If you do not have a cluster account, your NU username is usually your last name, period, and then the first letter of your first name plus your last name (for staff, first letter of first name, period, then followed by lastname). If you’re unsure, please contact RC.

Your cluster username is the same as your NU username. Your NU username is the same as your email address without @northeastern.edu. If you’re unsure, please contact RC.

See also

Request Account

Back to the top

General Usage#

How do I gain access to the cluster?

A faculty or research staff member must sponsor your HPC account request. Faculty and staff can self sponsor. Full details can be found here.

See also

Getting Access

How do I log on to the cluster?

Use an SSH client and connect to login.discovery.neu.edu. Instructions for using ssh and other login tools, as well as recommended clients for different operating systems, as described in Shell Environment on the Cluster. You can also access the cluster through our Web-based interface Open OnDemand (OOD).

Where can I find a list of Linux commands?

We provide a cheat sheet of useful Linux (Cluster via Command-Line) and Slurm commands (Introduction to Slurm) and there are more resources you can find online.

How do I reset my password?

Access to the HPC cluster requires a valid Northeastern University email account. Your HPC password is the same as your Northeastern University account password so in order to reset it, reset your Northeastern University account password.

See also

Find KB article

How can I view .pdf or .csv files on the cluster?

You can view .pdf and .csv files utilizing the File Explorer application on Open OnDemand and the Open OnDemand Desktop Application.

See also

OOD File Explorer and Desktop App

How do I request all CPUs on a node with more than one GPU?

You may wish to request a single GPU on a node and all of the node’s CPUs. However, the GPUs are bound to specific CPUs so the job will only run on the CPUs associated with the GPU you’re running on. Specifying the –exclusive flag in your job script or requesting all of the node’s CPUs will not change this. If you would like to use all cores on a node with one of the GPUs, you must specify this in your Slurm script: #SBATCH –gres-flags=disable-binding

Refer to the Slurm documentation for further information.

How can I get information on the HPC such as how busy it is?

From a login node, you can run sinfo -p short to get the state of the HPC.

See also

Monitoring and Managing Jobs

Back to the top

Data Transfer#

How can I transfer data to/from the HPC cluster?

Data transfer can be done using various methods like scp, rsync, or Globus. Refer to the ‘Transferring Data’ section in ‘Data Management’ for detailed instructions.

See also

Data Storage Options

What Linux commands can I use to transfer files to/from the cluster?

Smaller files can be transferred to/from the cluster using scp, sftp, and rsync as well as standard FTP tools utilizing xfer.discovery.neu.edu.

Larger files should be moved using Globus.

See also

Data Storage Options

Back to the top

Job Management#

How do I submit jobs?

You submit jobs by writing a Slurm script and submitting it with the sbatch command. Please see our Slurm documentation page.

See also

Batch Jobs: sbatch

How do I submit an interactive job?

For an interactive job on the command line, submit an srun job from the login node. Examples can be seen at Interactive Jobs: srun Command

If you wish to run a program that requires a graphical user interface or generates other graphics for display, such as a plot or chemical model, use one of the Open OnDemand interactive apps. Several are available, but if you one you wish to use isn’t in the list, submit an OOD Desktop job. You can also use X11 forwarding with srun.

See also

Introduction to OOD

What partitions can I use?

All account holders have access to the short, debug, express, and gpu partitions. For more information please see Partitions.

How do I choose which partition to use?

Partitions are set up to emphasize run time limits and different resource limits and specialty hardware including GPUs. More information about partitions can be found at Partitions.

How do I check the status of my jobs?

Run the command squeue -u $USER from the login node.

If reporting a problem to us about a particular job, please let us know the JobID for the job that you are having a problem with. You can also run squeue -u $USER to obtain the JobID.

Why is my job not starting?

Several things can cause jobs to wait in the queue. If you request a resource combination we do not have available at the moment the job will be marked pending (PD). You may also have run a large number of jobs in the recent past and the “fair share” algorithm is allowing other users higher priority. Finally, the queue you requested may simply be very busy. If your job is pending there will be another field with the reason; if it is Resources that means that the resource you requested isn’t available, either because it is busy or because you requested a nonexistent resource. If the reason is “Priority” it means that a job with higher priority than yours is running. Your job will rise in priority as it waits, so it will start eventually. To request an estimate from the queueing system of your start time, run squeue -u $USER --start.

How can I check when my job will start?

Run

squeue -j <jobid> --start

Slurm will provide an estimate of the day and time your job will start.

Why was my job killed?

Usually this is because you inadvertently submitted the job to run in a location that the compute nodes can’t access or is temporarily unavailable. If your jobs exit immediately this is usually why. Other common reasons include using too much memory, too many cores, or running past a job’s timelimit.

You can run sacct:

[user@login-00 ~] sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
159637       ompi_char+   parallel  hpc_admin         80  COMPLETED      0:0
159637.batch      batch             hpc_admin          1  COMPLETED      0:0
159637.0          orted             hpc_admin          3  COMPLETED      0:0
159638       ompi_char+   parallel  hpc_admin        400    TIMEOUT      0:1
159638.batch      batch             hpc_admin          1  CANCELLED     0:15
159638.0          orted             hpc_admin         19  CANCELLED  255:126

If it’s still not clear why your job was killed, please contact us and send us the output from sacct.

How can I submit a job to the HPC cluster?

Jobs can be submitted with sbatch and srun to the HPC.

How can I check the status of my job?

You can check the status of your job using the squeue -u $USER command, which will display your current jobs in the queue.

See also

Job Management

What should I do if my job fails?

Check the output and error files for any error messages if your job fails. These files are usually located in your job’s working directory. If you cannot resolve the issue yourself, contact Research Computing with the details of the error message.

When will my job start?

You can list information on your job’s start time using the squeuegit command:

squeue -u $USER --start

Note that Slurm’s estimated start time can be a bit inaccurate. This is because Slurm calculates this estimation off the jobs that are currently running or queued in the system. Any job that is submitted after yours with a higher priority may delay your job. Alternatively, if jobs complete in less time than they’ve requested, more jobs can start sooner than anticipated.

For more information on the squeue command, take a look at our Job Management

How do I check the efficiency of my completed jobs?

Run the command seff on the Slurm job ID:

[user@login-00 ~] seff 38391902
Job ID: 38391902
Cluster: discovery
User/Group: user/users
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:03:25 core-walltime
Job Wall-clock time: 00:03:25
Memory Utilized: 652.00 KB
Memory Efficiency: 0.03% of 1.95 GB

What is a batch system? / What is a job scheduler?

The purpose of a batch system is to execute a series of tasks in a computer program without user intervention (non-interactive jobs). The operation of each program is defined by a set or batch of inputs, submitted by users to the system as “job scripts.”

When job scripts are submitted to the system, the job scheduler determines how each job is queued. If there are no queued jobs of higher priority, the submitted job will run once the necessary compute resources become available.

What are the differences between batch jobs, interactive jobs, and GUI jobs?

A batch job is submitted to the batch system via a job script passed to the sbatch command. Once queued, a batch job will run on resources chosen by the scheduler. When a batch job runs, a user cannot interact with it.

View a walk-through of an example batch job script An interactive job is any process that is run at the command line prompt, generally used for developing code or testing job scripts. Interactive jobs should only be run in an interactive session, which are requested through the srun command. As soon as the necessary compute resources are available, the job scheduler will start the interactive session.
View the Interactive Development & Testing documentation page A GUI job uses the cluster compute resources to run an application, but displays the application’s graphical user interface (GUI) to the local client computer. GUI sessions are also managed by the job scheduler, but require additional software to be installed on the client side computer or ran through Open OnDemand.

How do I submit a job to the batch system?

The primary job submission mechanism is via the sbatch command via the Linux command line interface

$ sbatch <your_job_script>

where <your_job_script> is a file containing the commands that the batch system will execute on your behalf.

See also

Batch Jobs: sbatch

How do I run applications that use multiple processors (i.e. parallel computing)?

Parallel computing refers to the use of multiple processors to run multiple computational tasks simultaneously. Communications between tasks use one of the following interfaces, depending on the task:

OpenMP – used for communication between tasks running concurrently on the same node with access to shared memory
MPI (eg OpenMPI) – used for communication between tasks which use distributed memory
Hybrid – a combination of both OpenMP and MPI interfaces

You must properly configure your job script in order to run an application that uses multiple processors. View sample SLURM scripts for each case below:

Why do I get the error ‘slurmstepd: Exceeded job memory limit at some point’?

Sometimes, SLURM will log the error slurmstepd: Exceeded job memory limit at some point. This appears to be due to memory used for cache and page files triggering the warning. The process that enforces the job memory limits does not kill the job, but the warning is logged. The warning can be safely ignored. If your job truly does exceed the memory request, the error message will look like:

slurmstepd: Job 5019 exceeded memory limit (1292 > 1024), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 5019 ON dev1 CANCELLED AT 2016-05-16T15:33:27 ***

How do I check the my job status?

You can use the following command to check the status of the jobs you’ve submitted:

$ squeue -u <user_name>

To check the status of jobs running under a particular group, modify the command with the -A flag:

$ squeue -A <group_name>

To also return QoS information for jobs under a particular group, use the following command:

$ squeue -O jobarrayid,qos,name,username,timelimit,numcpus,reasonlist -A <group_name>

How do I delete a job from the batch system?

You can use the command

$ scancel <job_id>

to delete jobs from the queue. You can only delete jobs that you submitted.

What are the wall time limits for a partition?

You can check the wall time limits, denoted as TIMELIMIT, for a partition with the following command replacing $PARTITION with the partition of interest

$ sinfo -p $PARTITION

How can I check how busy a partition is?

Use the following command to view how busy the partition is

$ sinfo -p $PARTITION

and view the STATE column to see how many nodes are ‘idle’ (no jobs running on them).

How can I use GPUs for my job?

To use GPUs for your job, you need to request them in your job script. You can find more information in the ‘Working with GPUs’ section.

See also

Working with GPUs

How do I login to the compute node my job is running on?

You will only be able to login to compute nodes that your jobs are running on. If the job is running on one node use:

srun --jobid=jobid --pty /bin/bash

where you can find your jobid by running

squeue -u $USER

If the job is running on more than one node, specify the node you want to login to:

srun --jobid=jobid --nodelist=node_name -N1 --pty /bin/bash

If your job is allocated all of the resources on the node, you will need to include the –overlap option.

Back to the top

Storage#

What are the data storage options?

Several data storage options are available depending on your needs, such as /home, project directories (/work/$PROJECT), and temporary storage (/scratch). Details can be found in the ‘Data Management’ section.

See also

Data Storage Options

What types of storage are available and how should each type be used?

Can my /work storage quota be increased?

You can request a quota increase by submitting a Storage Space Extension Request

Why can’t I run jobs in my home directory?

/home/$USER directories are intended for relatively small amounts of human-readable data such as text files, shell scripts, and source code. All research work should be stored at /work/$PROJECT for long term storage and /scratch/$USER should be used for temporary job storage during run time. Be mindful of /scratch/$USER purge policy.

See also

Data Storage Options

Why should I use /scratch storage?

Scratch storage is fast and provides a large quantity of free space for temporary jobs files. However, there are limits on the number of files and the amount of space you may use and there is a time based purge policy that is implimented. Please review our scratch filesystem policy for details.

See also

Data Storage Options

How do I check my /home/$USER disk usage?

Run du -shc .[^.]* * /home/$USER which will output:

[<username>@<host> ~]$  du -shc .[^.]* * /home/$USER/
39M     .git
106M    discovery-examples
41K     README.md
3.3M    software-installation
147M    total

How long can I store files in /scratch?

/scratch is designed to serve as fast, temporary storage for running jobs, and is not long-term storage. For this reason, files are periodically marked for deletion from all /scratch directories. Please review the /scratch filesystem policy for more details. Store longer-term files in your /work/$PROJECT directory.

See also

Data Storage Options

Back to the top

Software#

What software applications are available?

The full list of applications installed on the cluster is available by using the module avail command from the terminal on the HPC.

See also

Using Module

May I submit an installation request for an application?

You can submit a Software Installation Request ticket to the Research Computing team.

You can also try and install your program locally with package mangers or compiling from source by following our documentation pages.

See also

Package Managers and From Source

Can I install my software on the HPC cluster?

Yes, you can. Please follow the guidelines in the Package Managers and From Source sections. If you encounter any issues, contact Research Computing.

How do I use research software that’s already installed?

We use the modules system for managing software environments. Learn more about how to use modules from Using Module.

Can I run this Docker container on the cluster?

We do not run Docker on the cluster due to security issues. Instead, we use Singularity. Singularity can run Docker images directly, or you can convert a Docker image to a Singularity image (.sif). To import existing Docker images, use the singularity pull command.

module load singularity
singularity pull docker://<container repository URL path>

Can I run my application/container on a GPU?

Please check the documentation for your application/container before running on a GPU. The developer of the program/container will outline GPU support and commands for the program.

Back to the top

Classroom (course specific)#

How can I get my class access to the HPC?

Please submit a Classroom Access Request ticket in order for a classroom to get a access.

See also

Classroom HPC: FAQ

Back to the top

Development#

How do I develop and test software?

Use compute nodes for software development and testing. Connect to the cluster and use the following command:

$ srun --pty /bin/bash

The srun command can be modified to request additional time, processors, and memory. Please refer to Interactive Jobs: srun Command to see more configuration and options for this command.

We use modules to update our $PATH and $LD_LIBRARY_PATH environment variables. To use any available software package that is not part of the default environment, including compilers, you must load the associated modules. To see what is available on the HPC, please use module avail

What compilers are available?

We have the GNU and Intel compilers available on the HPC. Please use module avail gcc or module avail intel to see the versions we have on the HPC.

Back to the top

Open OnDemand#

Why does my Open OnDemand desktop or app show it’s starting but then it immediately ends?

There are two common reasons why you might not be able to launch an Open OnDemand session including interactive desktops and apps like JupyterLab Notebook and RStudio.

You are over quota in your /home/$USER directory. See Home Directory Storage Quota
You have a conda environment loading in your .bashrc environment file or are loading a Python module in your .bashrc file that is interfering with the Open OnDemand desktop. See Shell Environment on the Cluster

Back to the top

Applications#

MATLAB#

How do I run MATLAB programs?

You may use the interactive MATLAB interpreter on the compute nodes after requesting a srun session and module load matlab/<version>.

Back to the top

Common Errors and Issues#

Why can’t I log in?

This is a very generic question that is difficult for us to answer. Research Computing supports many services. If you were to ask this question in a help ticket we would respond with: What are you trying to login to? Are you getting any error messages?

Why does my application keep getting killed on the login nodes?

Login nodes are not meant for long running processes and for CPU intensive tasks. If you are running applications on the HPC interactively, please use srun to get time on a compute node.

See also

Interactive Jobs: srun Command

How can I see what the file permissions are?

The getfacl command is an easy way to see the permissions of a file or directory. It will display the file/directory name, owner of the file/directory, group name that owns the file/directory, and the detailed permissions of the file/directory. See also: man getfacl or getfacl --help

[user@login-00 ~]$ getfacl output.txt 
# file: output.txt
# owner: user
# group: users
user::rw-
group::r--
other::r--

Why am I getting a QOSMaxSubmitJobPerUserLimit error when I try to submit a job?

You may see this error message when you are submitting more jobs than permitted to run on the partition at a given time. These limits are discussed in Partitions.

sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

You will get this error if you have reached the partition or per user limits as described here. For example, if you have 1000 jobs in the general-compute partition and try to submit another one, you will get this error. If you’ve already launched one viz desktop, you’ve reached your limit. Wait for some of your jobs to finish and submit more at that time.

Why does my SSH session automatically disconnect?

SSH connections will time out either due to inactivity or network disruptions. If your sessions are disconnecting due to inactivity, one thing you can do to keep the SSH connection open is to have ssh send a periodic keep alive packet to the server, so it will not timeout. Add the -o ServerAliveInterval=600 option to your ssh login command. SSH can be sensitive to any disruptions in the network which can be common with Wi-Fi networks. Sometimes the ‘keep alive’ setting prevents this. Other times, it may be that you have a setting on your Wi-Fi or ethernet adapter that tells the operating system it can put the device to sleep after a period of inactivity. This is especially common on Windows. Check your network adapters for ‘Power Settings’ and uncheck any options that tell the system it can disable the device to save power. This will vary by operating system so we recommend you conduct an internet search for the appropriate instructions.

Why am I seeing WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED when I log in?

Some users logging in through ssh may encounter this error message. If you receive this message, please see our instructions on how to clear this error.

When I try to log in with ssh, nothing happens when I type my password! When you type your password, the ssh program does not echo your typing or move your cursor. This is normal behavior.

When running Firefox on the cluster, I get : “Firefox is already running, but is not responding. To open a new window, you must first close the existing Firefox process, or restart your system.” What can I do?

From your home directory on the cluster, run the commands:

rm -rf ~/.mozilla/firefox/*.default/.parentlock
rm -rf ~/.mozilla/firefox/*.default/lock

Back to the top

Frequently Asked Questions (FAQs)#

Cluster Details#

Accounts#

General Usage#

Data Transfer#

Job Management#

Storage#

Software#

Classroom (course specific)#

Development#

Open OnDemand#

Applications#

MATLAB#

Common Errors and Issues#

Other Questions#