'datafile', transfer .mat files of a folder + subdirectories

'datafile', transfer .mat files of a folder + subdirectories

Postby Sander » 2018-10-15 15:20:56

Dear Techila Community,

I just started using Techila. After completing most of the tutorials, I wanted to apply the cloudfor function on my data.

I working with neuroimaging data and I am in need of a cloud-computing solution because I want to segment >1000 T1-brain images, which would take 60min/subject on my local computer. Furthermore I am using Matlab (an academic licence) and a Toolbox called SPM and CAT12, including .mat files, that are calling multiple other .mat files during the process. In addition the T1-images (50MB per file) are needed for processing, so I would need to upload them, too (which might be stressing for the workspace) (plus there are several output files about the same size as the t1-image created in the process).

Anyway, It's a lot of work to figure out, which .mat files are needed and to add them (maybe 50 files) via
cloudfor('datafile',file1,file2,.....)

Is there a way to add folders, instead of files to the Workspace?
Or is even it somehow possible to upload all the needed files to the Google Cloud Storage Buckets and then "addpath" them (like in matlab)? (and maybe even write the output files into the Bucket?)

Kind regards,
Sander
Sander
 
Posts: 1
Joined: 2018-10-15 13:45:55

Re: 'datafile', transfer .mat files of a folder + subdirecto

Postby techila support » 2018-10-16 11:29:16

Hello Sander,

Thanks for reaching out. Regarding this part.

I am using Matlab (an academic licence) and a Toolbox called SPM and CAT12, including .mat files, that are calling multiple other .mat files during the process.


Did you mean m-files instead of mat-files? If you meant m-files, then one option would be to build a list of all m-files in the toolboxes programmatically and pass this list to cloudfor to define the dependencies by using the following cloudfor parameter:

Code: Select all
%cloudfor('dependency',<csv list of m-files needed during computationa>)


Regarding your other question about data transfers. Using Google storage buckets is possible when using Techila. The Techila Workers automatically have the necessary permissions / software components to download / upload files from buckets. If you also want to programmatically upload/download files between your Matlab workstation and a Google bucket, then you will need to install the Google Cloud SDK on your computer as well.

I made an example code package that shows how to build the dependency list (of m-files) programmatically. This example also uses Google cloud storage buckets to transfer data using the following flow:

    1. Upload set of input files from your computer to a Google storage bucket
    2. At the start of each Job, one input file will be retrieved from the bucket. Job #1 will download the first file, Job #2 will download the second file and so on.
    3. At the end of each Job, a result file will be generated. This result file will be transferred back to the Google storage bucket using a unique folder/name combination (includes timestamp).
    4. As soon as a new result file is available in the Google storage bucket, it will be downloaded to your computer using a callback function (function definition in file my_cbfun).

You can find the example code attached in this post (demo.zip).

To start the process, extract the demo.zip on your computer and run the main function. Please note that before you can successfully run it however, you will need to create a Google storage bucket and configure the Google Cloud SDK on your computer.

For reference, I have included the 'main.m' file below:
Code: Select all
function [res1, res2, res3, res4, res5, datafromfile] = main()
%% Update paths to contains the function files used in the example

addpath('folder1')
addpath('folder2')
addpath('folder2/folder3')


%% Create .mat files and transfer them to a Google bucket from your computer
ddir = 'matfiles';
if ~exist(ddir,'dir')
    disp(['creating ' ddir ' directory for input files.'])
    mkdir('matfiles')
end

% Create input files for demo purposes. These will be transferred via a
% Google cloud bucket to Techila Workers.
filecount = 10;
basename='inputfile_';
for x=1:filecount
    save([ddir filesep basename num2str(x) '.mat'],'x')
end

% Transfer the files to the bucket from your computer.
% Note! Before you can run this command, you will need to:
%
% 1. Install the Google Cloud SDK on your computer.
%    https://cloud.google.com/sdk/install
%
% 2. Grant authorization to Cloud SDK (gcloud) to access Google Cloud
%    Platform.
%
%    https://cloud.google.com/sdk/gcloud/reference/auth/
%
% 3. Create a Google bucket for your data.
%
% 4. Modify the 'bucketname' parameter below to match your bucket's name.

bucketname = 'demo-bucket-test';
disp(['Will use following bucket to store data: ' bucketname])
% Test that we can execute gsutil commands
cmd = ['gsutil ls gs://' bucketname];
[status,cmdout] = system(cmd);
if status ~= 0
    % If status code is something other than 0 something went wrong.
    error(['Something went wrong. cmdout = ' cmdout]);
end

% Specify a bucket folder where the input files will be stored.
tdir = 'my-input-data';
disp(['Will use the following bucket folder to store input files: ' tdir])

% Specify a name for the result data folder. This will be used to name both
% the local and cloud folders.
result_directory = ['myresults-' datestr(now,'yyyy-mm-dd-HH-MM-SS')];
disp(['Will use the following folder to store result files: ' result_directory])
% Create the folder locally
mkdir(result_directory)

% Transfer all input files from the local folder to the Google bucket.
disp('Starting input file upload...')
cmd2 = ['gsutil -m cp ' ddir filesep '*.mat gs://' bucketname '/' tdir];
disp('Completed uploading input files.')
[status2,cmdout2] = system(cmd2);
if status2 ~= 0
    % If status code is something other than 0 something went wrong.
    error(['Something went wrong. cmdout = ' cmdout2])
end


%% This code section shows how to programmatically add all .m files from
%  specified directories to the compilation.

% .m-files from following folders will be added to the compilation.
dirs_of_interest = {'folder1','folder2'};

% Build a char array that contains the .m-file names as comma-separated
% list.
fnames=[];
for x = 1:length(dirs_of_interest)
    % get recursive file listing that includes all .m files
    flist = dir([dirs_of_interest{x} filesep '**' filesep '*.m']);
    for y = 1:length(flist)
        fnames = [fnames flist(y).name ','];
    end
end
% Trim trailing comma
fnames = fnames(1:end-1);


% result arrays:
res1=zeros(1,filecount);
res2=zeros(1,filecount);
res3=zeros(1,filecount);
res4=zeros(1,filecount);
res5=zeros(1,filecount);
res_file_names=cell(1,filecount);


% Cloudfor parameters:
%
% cloudfor('dependency','eval(fnames)')
% = Will add all mfiles listed in 'fnames' as dependencies and add them to
% the compilation.

% Create the Project. Will process one file per Job.
cloudfor x=1:filecount
%cloudfor('dependency','eval(fnames)')
%cloudfor('stepsperjob', 1)
%cloudfor('callback','my_cbfun(res_file_names{x},result_directory,bucketname);')
  if isdeployed % prevent local code execution
      % Code block inside the if isdeployed -statement will be executed on
      % Techila Workers.

      % Download one input file from the bucket to the Techila Worker.
      fname = [basename num2str(x) '.mat'];
      cmd3 = ['gsutil cp gs://' bucketname '/' tdir '/' fname ' .'];
      [status3,cmdout3] = system(cmd3);

      % Load the data from the file to verify that we have the right file.
      datafromfile{x} = load(fname);

      % Execute all functions using eval to prevent automatic dependency
      % checking to pick up the functions. These should work ok, as the
      % dependencies have been manually specified
      % using %cloudfor('dependency','eval(fnames)') parameter.
      res1(x) = eval('f1(x)')
      res2(x) = eval('f2(x)')
      res3(x) = eval('f3(x)')
      res4(x) = eval('f4(x)')
      res5(x) = eval('f5(x)')

      % For the sake of the example, write some result data into a file so we
      % can transfer it via the cloud bucket. This the recommended way to
      % return results if the amount of result data is large.
      res_str = ['Result from idx ' num2str(x)];
      res_file_name = ['my_result_' num2str(x) '.mat']

      save(res_file_name,'res_str')
      cmd4 = ['gsutil cp ' res_file_name ' gs://' bucketname '/' result_directory]
      [status4,cmdout4] = system(cmd4)

      res_file_names{x} = res_file_name; % return the result file name
  end
cloudend


end


Also for reference, this is an example output generated when running the code.

[res1, res2, res3, res4, res5, datafromfile] = main();
creating matfiles directory for input files.
Will use following bucket to store data: demo-bucket-test
Will use the following bucket folder to store input files: my-input-data
Will use the following folder to store result files: myresults-2018-10-16-14-11-13
Starting input file upload...
Completed uploading input files.
TIP: use "%cloudfor('inputparam',<variable>)" to define the variables to be delivered into Techila environment.
Techila initialized.
Creating 1 databundle(s)...
Creating Datafile Bundle 1...
Creating Parameter Bundle...
Project ID 61 created (10 jobs)
NOTE: For the first Project, it may take extra time to prepare the Techila Distributed Computing Engine environment.
This can range from 10 minutes to 1.5 hours depending on the required configuration steps.
Streaming results...
Downloading result file: my_result_2.mat
Downloading result file: my_result_9.mat
Downloading result file: my_result_10.mat
Downloading result file: my_result_5.mat
Downloading result file: my_result_1.mat
Downloading result file: my_result_6.mat
Downloading result file: my_result_8.mat
Downloading result file: my_result_3.mat
Downloading result file: my_result_4.mat
Downloading result file: my_result_7.mat

################ Project Statistics ################
Project ID: 61
Workers participated: 1
Total CPU time used: 0 d 0 h 1 m 21 s
Wall clock time used: 0 d 0 h 1 m 32 s
Acceleration factor: 0,88x

################ Job Statistics ################
Avg CPU core usage: 89,45% (CPU time / wall clock time)
CPU Time: 7,874 s (min) 8,198 s (avg) 8,359 s (max)
Memory used: 221,374 MB (avg) 222,840 MB (max)
I/O read: 43,066 MB (avg) 43,066 MB (max)
I/O write: 0,426 MB (avg) 0,426 MB (max)
Average total I/O: 43,492 MB
>>
Attachments
demo.zip
Example on how to build list of dependencies programmatically and how to use a Google cloud storage bucket to transfer files.
(3.4 KiB) Downloaded 20 times
Techila MATLAB documentation available here:

http://www.techilatechnologies.com/help ... ngine.html
techila support
Techila Staff
Techila Staff
 
Posts: 49
Joined: 2015-12-21 10:19:47


Return to General Discussion

cron