Reading data from Amazon S3 on Matlab Parallel Cloud Worker

Question

Jhon Wine on 24 Jan 2018

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/378841-reading-data-from-amazon-s3-on-matlab-parallel-cloud-worker

Commented: Jhon Wine on 5 Jul 2018

Accepted Answer: Jhon Wine

Open in MATLAB Online

Hi, I'm tying to process a big dataset that is stored on Amazon S3. My code architecture is as following:

Matlab client calls Matlab Parallel Cloud (my default cluster is Parallel Cloud, 16 workers):

r = zeros(100,1);
readTimes = r;
parfor i=1:100
  [ri,readTimesi] = myProcess(i);
  r(i) = ri;
  readTimes(i) = readTimesi;
end
fprintf('Mean Read Time %.1f sec\n',mean(readTimes));

Each worker access Amazon S3 independently to retrieve data for processing using dataStore.

function [r,readTime] = myProcess(i)
  %Set S3 Credentials 
  setenv('AWS_ACCESS_KEY_ID', 'ID');
  setenv('AWS_SECRET_ACCESS_KEY','Key'); 
  setenv('AWS_REGION', 'us-west-2');
  %Load Data
  fp= ['s3://mybucket/data/file' num2str(i) '.data'];
  t=tic;
  ds=fileDatastore(fp,'ReadFcn',@AWSRead);
  data=ds.read;
  readTime=toc(t);
  %Process
  %...
  r = mean(data);
end
function data= AWSRead(fileName)
  fid = fopen(fileName);
  data= fread(fid,inf,'short');
  fclose(fid);
end

I'm trying to trouble shoot why my Mean Read Time is slow, and how can I speed it up.

I noticed that Mean Read Time is much faster if I am using my local machine as the parallel worker pool parpool('local') rather then Matlab Parallel Cloud. I read in Matlab's documentation that Matlab Parallel Cloud runs on EC2 which should integrate with S3 automatically to have very good data transfer speeds if both EC2 and S3 are on the same site.

My questions are: Which site should I use to have maximal data transfer performances? Where is Matlab Parallel Cloud hosted? Or how can I speed my data transfer performances (except running it locally, as I need many more workers)?

I did not use Matlab Drive to host my files, as they are too big and will not fit drive's 5GB maximum allocation.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Jhon Wine on 26 Jan 2018

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/378841-reading-data-from-amazon-s3-on-matlab-parallel-cloud-worker#answer_301941

After looking to this matter further, I think Matlab Parallel Cloud runs off US East Virginia region - which is a different region then where I store my data. Upon switching storage location the problem was solved

2 Comments
Show NoneHide None

Mukesh Dangi on 29 Jun 2018

Hi Jhon, I see that you good in MATLAB programming and i'm new to it. I would really appriciate your help if you could help me out in the same scenario. I'm new bee to MATLAB However i did some research and was not able to find out any code to access files form S3 bucket using MATLAB code. my MATLAB code is a lambda function on AWS.

Once i complete a upload on S3, i want to trigger this MATLAB code and analyse the files.

I know there are many Geniuses out there, Please suggest. I'm using your code but i'm getting Error in S3Read (line 3) parfor i=1:100

Caused by: Undefined function or variable 'spectralFilePath'. error

Jhon Wine on 5 Jul 2018

Hi, thank you for the comment. It was a typo. Instead of 'spectralFilePath' you should write 'fp'. I corrected my code above

Sign in to comment.

Reading data from Amazon S3 on Matlab Parallel Cloud Worker

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

Reading data from Amazon S3 on Matlab Parallel Cloud Worker

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None