Pytorch/Tensorflow安装以及GPU使用教程简述
本文内容简单讲述如何利用兰州大学超算平台搭建Pytorch/Tensorflow环境,并简述如何查看GPU使用情况,Pytorch中调用GPU的方法,以及如何提高使用率。
注:本文中利用的数据较少,模型相对较小,所以显存占用以及GPU利用率数值可能偏低。
1.Pytorch/Tensorflow的安装
(1)Pytorch的安装过程
利用Anaconda安装深度学习框架会极为便利,因此在本节中我们将介绍如何在Linux中利用Anaconda安装深度学习框架Pytorch。
1)安装Anaconda
Anaconda的下载地址为:
https://www.anaconda.com/products/individual#windows
下载之后,在Linux中利用如下指令进行安装:
bash Anaconda3-xxxx-Linux-x86_64.sh
然后按照引导提示安装即可,安装完成之后,将Anaconda写入环境变量中(根据自己的路径进行修改):
#Anaconda export PATH=$PATH:/home/..../anaconda3/bin
添加之后,重新加载配置文件,输入:
source ~/.bashrc
在学校超算中,系统中已经预先安装了Anaconda,因此可以直接按照超算官方文档直接直接使用Anaconda。
2)安装Pytorch
利用上一步安装的Anaconda创建一个新的虚拟环境(命令中pytorch为环境名称,用户可自行更改)。
conda create --name pytorch python=3.7
随后Anaconda会检查并提示需要在虚拟环境中安装一些包,输入y确定。等待环境创建完毕后,需要激活环境,使用以下命令激活:
source activate pytorch
创建完虚拟环境之后,我们将继续安装pytorch,安装的指令可登录Pytorch官网选择自己环境的配置条件所提供的安装指令,例如:
conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch #Python 3.9 users will need to add '-c=conda-forge' for installation
等待安装完成后,可以利用如下指令检测安装的Pytorch:
import torch a = torch.cuda.is_available() print(a) ngpu= 1 device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu") print(device) print(torch.cuda.get_device_name(0)) print(torch.rand(3,3).cuda())
安装成功后,结果如下:
(2)Tensorflow安装过程
1)Anaconda的安装
如Pytorch一样,在安装Tensor flow时我们依旧是利用Anaconda进行安装的,在此Anaconda的安装就不再赘述。
2)Tensorflow的安装
利用Anaconda创建新的虚拟环境:
conda create --name tensorflow python=3.7
待环境创建完毕后,激活环境:
source activate tensorflow
进入创建的环境,安装Tensorflow:
conda install tensorflow-gpu
该命令默认安装最新版本的tensorflow,如果想要指定版本,在上述命令后添加=version
。等待结束后即可使用tensorflow。
可以用如下指令检测是否安装成功:
import tensorflow as tf print(tf.__version__) print(tf.test.is_gpu_available())
2.查看是否调用GPU
使用如下指令可以查看申请资源的使用情况(以申请一张GPU卡为例)
nvidia-smi
同时可以利用如下指令,持续对使用情况进行监测,该指令中数字1表示间隔的时间(对应图片中的Every 1.0s),可修改:
watch -n 1 nvidia-smi
3.Pytorch调用GPU常用的方法
(1)利用DataParallel(DP)
该方法用法简单,仅需要添加较少的语句,就可以在申请的GPU资源中进行训练。
首先在所有调用GPU的语句前添加语句:
parser.add_argument('--GPU',type=str,default='0') os.environ['CUDA_VISIBLE_DEVICES']=args.GPU #args.GPU为申请的GPU序号
同时要将模型、数据都存放到GPU中:
model=ConvNet() model = nn.DataParallel(model).cuda() ... img,label=img.cuda(),label.cuda()
再利用DataParallel时,仅仅修改上述语句就可调用GPU进行训练。
如果利用DataParallel进行多卡训练时,修改较为简单
#将parser.add_argument('--GPU',type=str,default='0')修改为: parser.add_argument('--GPU',type=str,default='0,1')
如果利用DataParallel模块进行单机多卡训练时,遇到负载不均衡问题时可以尝试利用DistributedDataParallel模块进行并行化训练。
DP完整实例如下所示:
#github:https://github.com/LianShuaiLong/CV_Applications/blob/master/classification/classification-pytorch/train.py import argparse import glob from tqdm import tqdm import logging import os import pdb from PIL import Image from torch.utils.data import DataLoader,Dataset from torchvision.datasets import ImageFolder import torch.optim as optim from preprocess import transform import torch import torch.nn as nn from backbones.scratch_net import Net from backbones.mobilenetv2 import Mobilenet_v2 from backbones.vgg19 import vgg19 logging.basicConfig(level=logging.INFO,format='%(asctime)s-%(levelname)s-%(name)s-%(message)s') logger = logging.getLogger(os.path.basename(__file__)) parser = argparse.ArgumentParser() parser.add_argument('--data_type',type=str,default='pic_folder',help='pic_label or pic_folder') parser.add_argument('--img_path',type=str,default='/workspace/dataset/train_imgs') parser.add_argument('--label_path',type=str,default='/workspace/dataset/train_imgs/label.txt') parser.add_argument('--train_folder',type=str,default='/workspace/classification/classification-pytorch/dataset/cifar10/train') parser.add_argument('--class_num',type=int,default=10) parser.add_argument('--resume',type=bool,default=False) parser.add_argument('--pretrained_model',type=str,default='/workspace/classification/classification-pytorch/pretrained/model.pth') parser.add_argument('--epoch',type=int,default=100) parser.add_argument('--batch_size',type=int,default=64) parser.add_argument('--lr',type=float,default=0.01) parser.add_argument('--log_step',type=int,default=10) parser.add_argument('--save_step',type=int,default=100) parser.add_argument('--checkpoint_dir',type=str,default='/workspace/classification/classification-pytorch/checkpoint/') parser.add_argument('--GPU',type=str,default='0') parser.add_argument('--backbone',type=str,default='scratch_net',help='scratch_net,vgg19,mobilenetv2') args = parser.parse_args() #-------------------------------------------- os.environ['CUDA_VISIBLE_DEVICES']=args.GPU #表示当前可以被检测到的显卡,该语句要放到所有调用GPU的语句之前 #--------------------------------------------- class Custom_Dataset(Dataset): def __init__(self,img_path,label_path,transform): super(Custom_Dataset,self).__init__() self.img_path = img_path self.label_path = label_path self.file_list = open(label_path,'r').readlines() self.transform = transform def __getitem__(self,index): img,label = self.file_list[index].strip() image = Image.open(os.path.join(self.img_path,img)) image_tensor = self.transform(image) return image_tensor,label def __len__(self): return len(self.file_list) def prepare_data(opt): class_to_idx={} if opt['data_type'] == 'pic_label': train_dataset = Custom_Dataset(opt['img_path'],opt['label_path'],transform=transform) elif opt['data_type'] == 'pic_folder': train_dataset = ImageFolder(opt['train_folder'],transform=transform) class_to_idx = train_dataset.class_to_idx train_loader = DataLoader(train_dataset,batch_size=opt['batch_size'],shuffle=True) idx_to_class = dict(zip(class_to_idx.values(),class_to_idx.keys())) return train_loader,idx_to_class def train(opt): # data train_loader,idx_to_class = prepare_data(opt) # model if opt['backbone'] == 'scratch_net': print('trained with scratch_net...') model = Net(3,opt['class_num']) elif opt['backbone'] == 'mobilenetv2': print('trained with mobilenetv2...') model = Mobilenet_v2(3,opt['class_num']) elif opt['backbone'] == 'vgg19': print('trained with vgg19...') model = vgg19(3,opt['class_num'],bn=False) if opt['resume'] and os.path.isfile(opt['pretrained_model']): model.load_state_dict(torch.load(opt['pretrained_model'])['model_state_dict']) print('load pretrained model from:{}....'.format(opt['pretrained_model'])) else: print('trained from scratch...') #------------------------------------------------- #pdb.set_trace() model = nn.DataParallel(model).cuda() #将模型存入GPU中 #-------------------------------------------------- # loss train_loss = nn.CrossEntropyLoss() # optimizer optimizer = optim.SGD(model.parameters(),lr=opt['lr'],momentum=0.9) # scheduler scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer,T_max=len(train_loader)*opt['epoch']) # train iteration = 0 total_steps = len(train_loader)*opt['epoch'] logger.info('total steps:{}'.format(total_steps)) for epoch in range(opt['epoch']): for idx,(img,label) in enumerate(train_loader): #--------------------------------------------------- img,label = img.cuda(),label.cuda() #将输入与标签也存入GPU中 #--------------------------------------------------- model.train() pred = model(img) loss = train_loss(pred,label) optimizer.zero_grad() loss.backward() optimizer.step() scheduler.step() if iteration%opt['log_step']==0: correct_num = (torch.argmax(pred,1)==label).sum().cpu().data.numpy() batch_acc = correct_num/opt['batch_size'] logger.info('step:{},lr:{},loss:{},batch_acc:{}'.format(idx+epoch*len(train_loader),optimizer.state_dict()['param_groups'][0]['lr'],loss,batch_acc)) if iteration%opt['save_step']==0 or iteration==total_steps: save_dict={ 'model_state_dict':model.module.state_dict(), 'learning_rate':optimizer.state_dict()['param_groups'][0]['lr'], 'train_loss':loss, 'train_acc':batch_acc, 'iter':idx+epoch*len(train_loader), 'idx_to_class':idx_to_class } os.makedirs(opt['checkpoint_dir'],exist_ok=True) torch.save(save_dict,os.path.join(opt['checkpoint_dir'],'model_%d.pth'%iteration)) iteration +=1 if __name__=='__main__': opt = vars(args) train(opt)
(2)利用DistributedDataParallel(DDP)
该方法相比于DataParallel,DDP支持模型并行以及数据并行。并且DDP不仅适用于单机,同样也适用于多机的情况
使用DDP时需要注意一下几点:
1.首先要添加local_rank参数以及初始化:
#可在args里面添加local_rank参数 parser.add_argument("--local_rank", default=os.getenv('LOCAL_RANK', -1), type=int)if args.local_rank != -1: torch.cuda.set_device(args.local_rank) device=torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl", init_method='env://')
对应下面实例的初始化环境的语句:
init_distributed_mode(args=args)
2.对载入数据进行处理
... train_dataset = torchvision.datasets.CIFAR10(root='./data',train=True,download=False, transform=Transform) #添加train_sampler,给每一个rank对应的进程分配训练的样本索引 train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,num_replicas=args.world_size, rank=rank) #添加train_batch_sampler,将样本索引每batch_size个元素组成一个list train_batch_sampler=torch.utils.data.BatchSampler(train_sampler,batch_size,drop_last=True) ... nw=min([os.cpu_count(),batch_size if batch_size>1 else 0, 8]) #线程数 train_loader = torch.utils.data.DataLoader(train_dataset, batch_sampler=train_batch_sampler, num_workers=nw, pin_memory=True)
3.将模型放到GPU上
... model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.gpu])
4.启动程序
上述指令中--nproc_per_node=4
表示利用4张GPU卡,DDP.py表示要运行的代码文件。
如果过要特指使用哪几块GPU,可使用如下指令:
CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 --use_env DDP.py
以下为整体代码实现:
#github:https://github.com/WZMIAOMIAO/deep-learning-for-image-processing #注:该实例中包含其他文件的函数,可通过上述github地址进行查看导入的函数 import os import tempfile import math from datetime import datetime import argparse import torch import torch.optim as optim import torch.optim.lr_scheduler as lr_scheduler import torch.multiprocessing as mp from torch.multiprocessing import Process import torchvision import torchvision.transforms as transforms import torch import torch.nn as nn import torch.distributed as dist from apex.parallel import DistributedDataParallel as DDP from apex import amp from torch.utils.tensorboard import SummaryWriter from model_ResNet import resnet101 from multi_train_utils.distributed_utils import init_distributed_mode,dist,cleanup from multi_train_utils.train_eval_utils import train_one_epoch,evaluate def main(args): if torch.cuda.is_available() is False: #检测是否存在GPU资源 raise EnvironmentError("not find GPU device for training !") init_distributed_mode(args=args) #初始化各进程环境 rank=args.rank device=torch.device(args.device) batch_size=args.batch_size weights_path=args.weights args.lr *= args.world_size #学习率要根据GPU数量进行增倍或使用其他的方法 if rank == 0: print(args) print('Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/') tb_writer=SummaryWriter() if os.path.exists('./weights') is False: os.makedirs('./weights') Transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ]) train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=False, transform=Transform) train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=args.world_size, rank=rank) train_batch_sampler=torch.utils.data.BatchSampler(train_sampler,batch_size,drop_last=True) val_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=False, transform=Transform) val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset, num_replicas=args.world_size, rank=rank) val_batch_sampler = torch.utils.data.BatchSampler(val_sampler, batch_size, drop_last=True) nw=min([os.cpu_count(),batch_size if batch_size>1 else 0, 8]) #线程数 if rank == 0: print('Using {} dataloader workers every process'.format(nw)) train_loader = torch.utils.data.DataLoader(train_dataset, batch_sampler=train_batch_sampler, num_workers=nw, pin_memory=True) val_loader = torch.utils.data.DataLoader(val_dataset, batch_sampler=val_batch_sampler, num_workers=nw, pin_memory=True) model = ConvNet().to(device) #model=resnet101(num_classes=10).to(device) if os.path.exists(weights_path):#加载预训练权重 weights_dict=torch.load(weights_path,map_location=device) load_weights_dict={ k: v for k,v in weights_dict.items() if model.state_dict()[k].numel() == v.numel() } model.load_state_dict(load_weights_dict,strict=False) else: checkpoint_path = os.path.join(tempfile.gettempdir(),'initial_weights.pt') if rank==0: #如果不存在预训练模型,需要将第一个进程中的权重保存,然后其他进程载入, # 保持初始化权重一致 torch.save(model.state_dict(),checkpoint_path) dist.barrier() #要指定map_location,否则会导致第一块GPU占用更多资源 model.load_state_dict(torch.load(checkpoint_path,map_location=device)) if args.freeze_layers:#是否冻结权重 for name,para in model.named_parameters(): #除了最后全连接层,其余权重全部冻结 if "fc" not in name: para.requires_grad_(False) else: #只有训练带有BN结构的网络时使用SyncBatchNorm才有意义 if args.syncBN: model=torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device) model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.gpu]) pg=[p for p in model.parameters() if p.requires_grad] optimizer = optim.SGD(pg,lr=args.lr,momentum=0.9,weight_decay=0.005) lf = lambda x: ((1 + math.cos(x * math.pi / args.epochs)) / 2) * (1 - args.lrf) + args.lrf scheduler = lr_scheduler.LambdaLR(optimizer,lr_lambda=lf) for epoch in range(args.epochs): train_sampler.set_epoch(epoch) mean_loss = train_one_epoch(model=model, optimizer=optimizer, data_loader=train_loader, device=device, epoch=epoch) scheduler.step() sum_num=evaluate(model=model, data_loader=val_loader, device=device) acc = sum_num / val_sampler.total_size if rank==0: print("[epoch {}] accuracy: {}".format(epoch, round(acc, 3))) #tags = ["loss", "accuracy", "learning_rate"] tb_writer.add_scalar('loss', mean_loss, epoch) tb_writer.add_scalar('accuracy', acc, epoch) tb_writer.add_scalar('learning_rate', optimizer.param_groups[0]["lr"], epoch) torch.save(model.module.state_dict(), "./weights/model-{}.pth".format(epoch)) if rank == 0: if os.path.exists(checkpoint_path) is True: os.remove(checkpoint_path) cleanup() class ConvNet(nn.Module): def __init__(self, num_classes=10): super(ConvNet, self).__init__() self.layer1 = nn.Sequential( nn.Conv2d(3, 16, kernel_size=5, stride=1, padding=2), nn.BatchNorm2d(16), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2)) self.layer2 = nn.Sequential( nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2)) self.layer3 = nn.Sequential( nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2)) self.fc = nn.Linear(1024, num_classes) def forward(self, x): out = self.layer1(x) out = self.layer2(out) out = self.layer3(out) out = out.reshape(out.size(0), -1) out = self.fc(out) return out if __name__ == '__main__': os.environ['MASTER_ADDR'] = '127.0.0.1' parser = argparse.ArgumentParser() parser.add_argument('--num_classes', type=int, default=5) parser.add_argument('--epochs', type=int, default=30) parser.add_argument('--batch_size', type=int, default=32) parser.add_argument('--lr', type=float, default=0.001) parser.add_argument('--lrf', type=float, default=0.1) # 是否启用SyncBatchNorm parser.add_argument('--syncBN', type=bool, default=True) # resnet34 官方权重下载地址 # https://download.pytorch.org/models/resnet34-333f7ec4.pth parser.add_argument('--weights', type=str, default='resNet34.pth', help='initial weights path') parser.add_argument('--freeze-layers', type=bool, default=False) # 不要改该参数,系统会自动分配 parser.add_argument('--device', default='cuda', help='device id (i.e. 0 or 0,1 or cpu)') # 开启的进程数(注意不是线程),不用设置该参数,会根据nproc_per_node自动设置 parser.add_argument('--world-size', default=4, type=int, help='number of distributed processes') parser.add_argument('--dist-url', default='env://', help='url used to set up distributed training') opt = parser.parse_args() main(opt)
以下为利用DDP训练时的多卡使用情况:
4.解决GPU利用率低的问题
如果遇到GPU利用率较低的问题,可以简单修改如下内容进行尝试:
1.修改train.py文件中日志输出的间隔;
2.可以尝试设置pin_memory=True以及num_works,其中pin_memory打开,省掉了将数据从CPU传入到缓存RAM中;num_works表示CPU工作的线程数,一般来说不要超过CPU的核心数