fastest way to sum the file sizes by owner in a directoryPerl Program to efficiently process 500,000 small files in a directorysum file sizes by year-month and include year-wise totals when printingUsing wget to recursively fetch a directory with arbitrary files in itCalculate size of files in shellShell command to tar directory excluding certain files/foldersUsing ls to list directories and their total sizesGiven two directory trees, how can I find out which files differ?Recursively counting files in a Linux directoryhow to do arithmetic operations with the size of the files and directory in shell scriptingPrinting size of the largest file and the average file sizebash script: calculate sum size of filesSort files (including those in subdirectories) based on their size and print file name and size

The plural of 'stomach"

What to do with wrong results in talks?

How do I rename a LINUX host without needing to reboot for the rename to take effect?

Your magic is very sketchy

Have I saved too much for retirement so far?

Trouble understanding overseas colleagues

HashMap containsKey() returns false although hashCode() and equals() are true

What is the oldest known work of fiction?

The baby cries all morning

Best way to store options for panels

Why are on-board computers allowed to change controls without notifying the pilots?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

How will losing mobility of one hand affect my career as a programmer?

How to verify if g is a generator for p?

Greatest common substring

Curses work by shouting - How to avoid collateral damage?

voltage of sounds of mp3files

Is exact Kanji stroke length important?

Failed to fetch jessie backports repository

Ways to speed up user implemented RK4

Opposite of a diet

Can a monster with multiattack use this ability if they are missing a limb?

Everything Bob says is false. How does he get people to trust him?

Why does John Bercow say “unlock” after reading out the results of a vote?



fastest way to sum the file sizes by owner in a directory


Perl Program to efficiently process 500,000 small files in a directorysum file sizes by year-month and include year-wise totals when printingUsing wget to recursively fetch a directory with arbitrary files in itCalculate size of files in shellShell command to tar directory excluding certain files/foldersUsing ls to list directories and their total sizesGiven two directory trees, how can I find out which files differ?Recursively counting files in a Linux directoryhow to do arithmetic operations with the size of the files and directory in shell scriptingPrinting size of the largest file and the average file sizebash script: calculate sum size of filesSort files (including those in subdirectories) based on their size and print file name and size













2















I'm using the below command using an alias to print the sum of all file sizes by owner in a directory



ls -l $dir | awk ' NF>3 file[$3]+=$5 
END for( i in file) ss=file[i];
if(ss >=1024*1024*1024 ) size=ss/1024/1024/1024; unit="G" else
if(ss>=1024*1024) size=ss/1024/1024; unit="M" else size=ss/1024; unit="K";
format="%.2f%s"; res=sprintf(format,size,unit);
printf "%-8s %12dt%sn",res,file[i],i ' | sort -k2 -nr


but, it doesn't seem to be fast all the times.



Is it possible to get the same output in some other way, but faster?










share|improve this question

















  • 2





    why not parse ls

    – Barmar
    Mar 21 at 15:18






  • 1





    You don't need to escape newlines inside a string.

    – Barmar
    Mar 21 at 15:19











  • check superuser.com/a/597173

    – UjinT34
    Mar 21 at 15:31











  • When it's slow, how fast is ls -l $dir alone? On some file systems, listing large directories is very, very slow.

    – Aaron Digulla
    Mar 21 at 16:29






  • 1





    I have around 308,530 files under one such directory..

    – stack0114106
    Mar 21 at 18:14















2















I'm using the below command using an alias to print the sum of all file sizes by owner in a directory



ls -l $dir | awk ' NF>3 file[$3]+=$5 
END for( i in file) ss=file[i];
if(ss >=1024*1024*1024 ) size=ss/1024/1024/1024; unit="G" else
if(ss>=1024*1024) size=ss/1024/1024; unit="M" else size=ss/1024; unit="K";
format="%.2f%s"; res=sprintf(format,size,unit);
printf "%-8s %12dt%sn",res,file[i],i ' | sort -k2 -nr


but, it doesn't seem to be fast all the times.



Is it possible to get the same output in some other way, but faster?










share|improve this question

















  • 2





    why not parse ls

    – Barmar
    Mar 21 at 15:18






  • 1





    You don't need to escape newlines inside a string.

    – Barmar
    Mar 21 at 15:19











  • check superuser.com/a/597173

    – UjinT34
    Mar 21 at 15:31











  • When it's slow, how fast is ls -l $dir alone? On some file systems, listing large directories is very, very slow.

    – Aaron Digulla
    Mar 21 at 16:29






  • 1





    I have around 308,530 files under one such directory..

    – stack0114106
    Mar 21 at 18:14













2












2








2








I'm using the below command using an alias to print the sum of all file sizes by owner in a directory



ls -l $dir | awk ' NF>3 file[$3]+=$5 
END for( i in file) ss=file[i];
if(ss >=1024*1024*1024 ) size=ss/1024/1024/1024; unit="G" else
if(ss>=1024*1024) size=ss/1024/1024; unit="M" else size=ss/1024; unit="K";
format="%.2f%s"; res=sprintf(format,size,unit);
printf "%-8s %12dt%sn",res,file[i],i ' | sort -k2 -nr


but, it doesn't seem to be fast all the times.



Is it possible to get the same output in some other way, but faster?










share|improve this question














I'm using the below command using an alias to print the sum of all file sizes by owner in a directory



ls -l $dir | awk ' NF>3 file[$3]+=$5 
END for( i in file) ss=file[i];
if(ss >=1024*1024*1024 ) size=ss/1024/1024/1024; unit="G" else
if(ss>=1024*1024) size=ss/1024/1024; unit="M" else size=ss/1024; unit="K";
format="%.2f%s"; res=sprintf(format,size,unit);
printf "%-8s %12dt%sn",res,file[i],i ' | sort -k2 -nr


but, it doesn't seem to be fast all the times.



Is it possible to get the same output in some other way, but faster?







linux shell perl






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 21 at 15:13









stack0114106stack0114106

4,8642423




4,8642423







  • 2





    why not parse ls

    – Barmar
    Mar 21 at 15:18






  • 1





    You don't need to escape newlines inside a string.

    – Barmar
    Mar 21 at 15:19











  • check superuser.com/a/597173

    – UjinT34
    Mar 21 at 15:31











  • When it's slow, how fast is ls -l $dir alone? On some file systems, listing large directories is very, very slow.

    – Aaron Digulla
    Mar 21 at 16:29






  • 1





    I have around 308,530 files under one such directory..

    – stack0114106
    Mar 21 at 18:14












  • 2





    why not parse ls

    – Barmar
    Mar 21 at 15:18






  • 1





    You don't need to escape newlines inside a string.

    – Barmar
    Mar 21 at 15:19











  • check superuser.com/a/597173

    – UjinT34
    Mar 21 at 15:31











  • When it's slow, how fast is ls -l $dir alone? On some file systems, listing large directories is very, very slow.

    – Aaron Digulla
    Mar 21 at 16:29






  • 1





    I have around 308,530 files under one such directory..

    – stack0114106
    Mar 21 at 18:14







2




2





why not parse ls

– Barmar
Mar 21 at 15:18





why not parse ls

– Barmar
Mar 21 at 15:18




1




1





You don't need to escape newlines inside a string.

– Barmar
Mar 21 at 15:19





You don't need to escape newlines inside a string.

– Barmar
Mar 21 at 15:19













check superuser.com/a/597173

– UjinT34
Mar 21 at 15:31





check superuser.com/a/597173

– UjinT34
Mar 21 at 15:31













When it's slow, how fast is ls -l $dir alone? On some file systems, listing large directories is very, very slow.

– Aaron Digulla
Mar 21 at 16:29





When it's slow, how fast is ls -l $dir alone? On some file systems, listing large directories is very, very slow.

– Aaron Digulla
Mar 21 at 16:29




1




1





I have around 308,530 files under one such directory..

– stack0114106
Mar 21 at 18:14





I have around 308,530 files under one such directory..

– stack0114106
Mar 21 at 18:14












6 Answers
6






active

oldest

votes


















2














Get a listing from Perl (tagged), add up sizes and sort it out by owner



perl -wE'
chdir (shift // ".");
for (glob ".* *")
next if not -f;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; next ;
$rept$owner_id += $size

say (getpwuid($_)//$_, " => $rept$_ bytes") for sort keys %rept
'


I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to glob-ed (even though glob was faster in a related problem).



I expect good runtimes in comparison with ls, which slows down dramatically as a file list in a single directory gets long. This is mostly due to the system so Perl will be affected as well but as far as I recall it handles it far better.



However, I've seen such dramatic slowdown in getting listings as an issue once entries get to around a hundred thousand, not a few thousand, so I am not sure why it runs slow on your system.



If this need be recursive then use File::Find. For example



perl -MFile::Find -wE'
$dir = shift // ".";
find( sub
return if /^..?$/;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; return ;
$rept$owner_id += $size
, $dir );
say (getpwuid($_)//$_, "$_ => $rept$_ bytes") for keys %rept
'


This scans a directory with 2.4 Gb, with mostly small files distributed over a hierarchy of subdirectories, in a little over 2 seconds. The du -sh took around 5 seconds (the first time round).






share|improve this answer

























  • @stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

    – zdim
    Mar 21 at 18:41











  • @stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

    – zdim
    Mar 21 at 18:43












  • I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

    – stack0114106
    Mar 21 at 18:47












  • @stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

    – zdim
    Mar 21 at 18:51












  • @stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

    – zdim
    Mar 21 at 18:54


















4














Another perl one, that displays total sizes sorted by user:



#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d)
my $filename = File::Spec->catfile($dir, $file);
my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
$users$uid += $size if S_ISREG($mode);

closedir $d;

my @sizes = sort $a->[0] cmp $b->[0]
map [ getpwuid($_) // $_, $users$_ ] keys %users;
local $, = "t";
say @$_ for @sizes;





share|improve this answer

























  • what does if S_ISREG($mode) do ?..

    – stack0114106
    Mar 21 at 18:16











  • yes your solution works.. thank you

    – stack0114106
    Mar 21 at 18:50











  • @stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

    – Shawn
    Mar 21 at 21:09



















2














Parsing output from ls - bad idea.



How about using find instead?



  • start in directory $dir

    • limit to that directory level (-maxdepth 1)

    • limit to files (-type f)

    • print a line with user name and file size in bytes (-printf "%u %sn")


  • run the results through a perl filter

    • split each line (-a)

    • add to a hash under key (field 0) the size (field 1)

    • at the end (END ...) print out the hash contents, sorted by key, i.e. user name


$ find $dir -maxdepth 1 -type f -printf "%u %sn" | 
perl -ane '$s$F[0] += $F[1]; END print "$_ $s$_n" foreach (sort keys %s); '
stefanb 263305714



A solution using Perl:



#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV)
opendir(my $dh, $dir);

# files in this directory
while (my $entry = readdir($dh))
my $file = File::Spec->catfile($dir, $entry);

# only files
if (-f $file)
my($uid, $size) = (stat($file))[4, 7];
$users$uid += $size



closedir($dh);


print "$_ $users$_n" foreach (sort keys %users);

exit 0;


Test run:



$ perl dummy.pl .
1000 263618544


Interesting difference. The Perl solution discovers 3 more files in my test directory than the find solution. I have to ponder why that is...






share|improve this answer

























  • It should print for all owners..not just the current user.. the files are owned by diff users

    – stack0114106
    Mar 21 at 16:14











  • Updated accordingly

    – Stefan Becker
    Mar 21 at 16:26











  • your solution works.. thank you..

    – stack0114106
    Mar 21 at 18:51


















1














Not sure why question is tagged perl when awk is being used.



Here's a simple perl version:



#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dirn");

map
if ( ! m/^[.][.]?$/o )
($s,$u) = (stat)[7,4];
$h$u += $s;

glob ".* *";

map
$s = $h$_;
$u = !( $s >>10) ? ""
: !(($s>>=10)>>10) ? "k"
: !(($s>>=10)>>10) ? "M"
: !(($s>>=10)>>10) ? "G"
: ($s>>=10) ? "T"
: undef
;
printf "%-8s %12dt%sn", $s.$u, $h$_, getpwuid($_)//$_;
keys %h;




  • glob gets our file list


  • m// discards . and ..


  • stat the size and uid

  • accumulate sizes in %h

  • compute the unit by bitshifting (>>10 is integer divide by 1024)

  • map uid to username (// provides fallback)

  • print results (unsorted)


  • NOTE: unlike some other answers, this code doesn't recurse into subdirectories

To exclude symlinks, subdirectories, etc, change the if to appropriate -X tests. (eg. (-f $_), (!-d $_ and !-l $_), etc). See perl docs on the _ filehandle optimisation for caching stat results.






share|improve this answer

























  • I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

    – Aaron Digulla
    Mar 21 at 16:27











  • yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

    – jhnc
    Mar 21 at 16:29







  • 1





    Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

    – Aaron Digulla
    Mar 21 at 16:32











  • @jhnc.. your solution also works..thank you

    – stack0114106
    Mar 21 at 18:49



















1














Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:



$ cat bar.awk
@load "filefuncs"
BEGIN
FS=":" # passwd field sep
passwd="/etc/passwd" # get usernames from passwd
while ((getline < passwd)>0)
users[$3]=$1
close(passwd) # close passwd

if(path="") # set path with -v path=...
path="." # default path is cwd
pathlist[1]=path # path from the command line
# you could have several paths
fts(pathlist,FTS_PHYSICAL,filedata) # dont mind links (vs. FTS_LOGICAL)
for(p in filedata) # p for paths
for(f in filedata[p]) # f for files
if(filedata[p][f]["stat"]["type"]=="file") # mind files only
size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
for(i in size)
print (users[i]?users[i]:i),size[i] # print username if found else uid
exit



Sample outputs:



$ ls -l
total 3623
drwxr-xr-x 2 james james 3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root root 4 Mar 21 18:52 bar
-rw-r--r-- 1 james james 424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james 546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james 315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james 125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410


Another:



$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real 0m1.289s
user 0m0.852s
sys 0m0.440s


Yet another test with a million empty files:



$ time awk -v path=../million_files -f bar.awk

real 0m5.057s
user 0m4.000s
sys 0m1.056s





share|improve this answer

























  • looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

    – stack0114106
    Mar 21 at 18:04











  • TIme to upgrade to a modern version of GNU awk.

    – James Brown
    Mar 21 at 18:17











  • this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

    – stack0114106
    Mar 21 at 18:24











  • A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

    – James Brown
    Mar 21 at 18:29






  • 1





    ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

    – stack0114106
    Mar 21 at 18:31


















0














Using datamash (and Stefan Becker's find code):



find $dir -maxdepth 1 -type f -printf "%ut%sn" | datamash -sg 1 sum 2





share|improve this answer

























  • @agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

    – stack0114106
    Mar 22 at 11:38











  • @stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

    – agc
    Mar 22 at 11:58










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55283652%2ffastest-way-to-sum-the-file-sizes-by-owner-in-a-directory%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























6 Answers
6






active

oldest

votes








6 Answers
6






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Get a listing from Perl (tagged), add up sizes and sort it out by owner



perl -wE'
chdir (shift // ".");
for (glob ".* *")
next if not -f;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; next ;
$rept$owner_id += $size

say (getpwuid($_)//$_, " => $rept$_ bytes") for sort keys %rept
'


I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to glob-ed (even though glob was faster in a related problem).



I expect good runtimes in comparison with ls, which slows down dramatically as a file list in a single directory gets long. This is mostly due to the system so Perl will be affected as well but as far as I recall it handles it far better.



However, I've seen such dramatic slowdown in getting listings as an issue once entries get to around a hundred thousand, not a few thousand, so I am not sure why it runs slow on your system.



If this need be recursive then use File::Find. For example



perl -MFile::Find -wE'
$dir = shift // ".";
find( sub
return if /^..?$/;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; return ;
$rept$owner_id += $size
, $dir );
say (getpwuid($_)//$_, "$_ => $rept$_ bytes") for keys %rept
'


This scans a directory with 2.4 Gb, with mostly small files distributed over a hierarchy of subdirectories, in a little over 2 seconds. The du -sh took around 5 seconds (the first time round).






share|improve this answer

























  • @stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

    – zdim
    Mar 21 at 18:41











  • @stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

    – zdim
    Mar 21 at 18:43












  • I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

    – stack0114106
    Mar 21 at 18:47












  • @stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

    – zdim
    Mar 21 at 18:51












  • @stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

    – zdim
    Mar 21 at 18:54















2














Get a listing from Perl (tagged), add up sizes and sort it out by owner



perl -wE'
chdir (shift // ".");
for (glob ".* *")
next if not -f;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; next ;
$rept$owner_id += $size

say (getpwuid($_)//$_, " => $rept$_ bytes") for sort keys %rept
'


I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to glob-ed (even though glob was faster in a related problem).



I expect good runtimes in comparison with ls, which slows down dramatically as a file list in a single directory gets long. This is mostly due to the system so Perl will be affected as well but as far as I recall it handles it far better.



However, I've seen such dramatic slowdown in getting listings as an issue once entries get to around a hundred thousand, not a few thousand, so I am not sure why it runs slow on your system.



If this need be recursive then use File::Find. For example



perl -MFile::Find -wE'
$dir = shift // ".";
find( sub
return if /^..?$/;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; return ;
$rept$owner_id += $size
, $dir );
say (getpwuid($_)//$_, "$_ => $rept$_ bytes") for keys %rept
'


This scans a directory with 2.4 Gb, with mostly small files distributed over a hierarchy of subdirectories, in a little over 2 seconds. The du -sh took around 5 seconds (the first time round).






share|improve this answer

























  • @stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

    – zdim
    Mar 21 at 18:41











  • @stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

    – zdim
    Mar 21 at 18:43












  • I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

    – stack0114106
    Mar 21 at 18:47












  • @stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

    – zdim
    Mar 21 at 18:51












  • @stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

    – zdim
    Mar 21 at 18:54













2












2








2







Get a listing from Perl (tagged), add up sizes and sort it out by owner



perl -wE'
chdir (shift // ".");
for (glob ".* *")
next if not -f;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; next ;
$rept$owner_id += $size

say (getpwuid($_)//$_, " => $rept$_ bytes") for sort keys %rept
'


I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to glob-ed (even though glob was faster in a related problem).



I expect good runtimes in comparison with ls, which slows down dramatically as a file list in a single directory gets long. This is mostly due to the system so Perl will be affected as well but as far as I recall it handles it far better.



However, I've seen such dramatic slowdown in getting listings as an issue once entries get to around a hundred thousand, not a few thousand, so I am not sure why it runs slow on your system.



If this need be recursive then use File::Find. For example



perl -MFile::Find -wE'
$dir = shift // ".";
find( sub
return if /^..?$/;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; return ;
$rept$owner_id += $size
, $dir );
say (getpwuid($_)//$_, "$_ => $rept$_ bytes") for keys %rept
'


This scans a directory with 2.4 Gb, with mostly small files distributed over a hierarchy of subdirectories, in a little over 2 seconds. The du -sh took around 5 seconds (the first time round).






share|improve this answer















Get a listing from Perl (tagged), add up sizes and sort it out by owner



perl -wE'
chdir (shift // ".");
for (glob ".* *")
next if not -f;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; next ;
$rept$owner_id += $size

say (getpwuid($_)//$_, " => $rept$_ bytes") for sort keys %rept
'


I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to glob-ed (even though glob was faster in a related problem).



I expect good runtimes in comparison with ls, which slows down dramatically as a file list in a single directory gets long. This is mostly due to the system so Perl will be affected as well but as far as I recall it handles it far better.



However, I've seen such dramatic slowdown in getting listings as an issue once entries get to around a hundred thousand, not a few thousand, so I am not sure why it runs slow on your system.



If this need be recursive then use File::Find. For example



perl -MFile::Find -wE'
$dir = shift // ".";
find( sub
return if /^..?$/;
($owner_id, $size) = (stat)[4,7]
or do warn "Trouble stat for: $_"; return ;
$rept$owner_id += $size
, $dir );
say (getpwuid($_)//$_, "$_ => $rept$_ bytes") for keys %rept
'


This scans a directory with 2.4 Gb, with mostly small files distributed over a hierarchy of subdirectories, in a little over 2 seconds. The du -sh took around 5 seconds (the first time round).







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 22 at 9:07

























answered Mar 21 at 17:31









zdimzdim

34k32443




34k32443












  • @stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

    – zdim
    Mar 21 at 18:41











  • @stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

    – zdim
    Mar 21 at 18:43












  • I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

    – stack0114106
    Mar 21 at 18:47












  • @stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

    – zdim
    Mar 21 at 18:51












  • @stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

    – zdim
    Mar 21 at 18:54

















  • @stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

    – zdim
    Mar 21 at 18:41











  • @stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

    – zdim
    Mar 21 at 18:43












  • I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

    – stack0114106
    Mar 21 at 18:47












  • @stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

    – zdim
    Mar 21 at 18:51












  • @stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

    – zdim
    Mar 21 at 18:54
















@stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

– zdim
Mar 21 at 18:41





@stack0114106 "the first one runs and gives results" -- (1) is that where you see the error (in the first script), and (2) does the second one not run (or is that where you see the error)?

– zdim
Mar 21 at 18:41













@stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

– zdim
Mar 21 at 18:43






@stack0114106 That would mean that $owner_id didn't get assigned (unless I have some typo from copy-pasting?) -- I can't readily imagine what kind of a beasty would not return owner it from stat ... ? Can you debug -- add a print like say "no owner id for $_" if not $owner_id; or some such. I can't as I don't have a problem

– zdim
Mar 21 at 18:43














I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

– stack0114106
Mar 21 at 18:47






I think if folks leave the organization, their id would be disabled, but the files would still be there.. and would that affect how getpwuid results??..

– stack0114106
Mar 21 at 18:47














@stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

– zdim
Mar 21 at 18:51






@stack0114106 They can handle it in many ways (when a user leaves an organization), but I think that there would have to be an owner id for each file on the system. Can you add the print from my previous comment to see what that is about?

– zdim
Mar 21 at 18:51














@stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

– zdim
Mar 21 at 18:54





@stack0114106 "the second one would take lot of time" -- for how large a hierarchy? How long does the awk parsed ls take (from the question) on that ? If that's just too big to try now can you compare on soemthing more reasonable? Perl File::Find should be fast, as much as one can expect recursive searching to go.

– zdim
Mar 21 at 18:54













4














Another perl one, that displays total sizes sorted by user:



#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d)
my $filename = File::Spec->catfile($dir, $file);
my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
$users$uid += $size if S_ISREG($mode);

closedir $d;

my @sizes = sort $a->[0] cmp $b->[0]
map [ getpwuid($_) // $_, $users$_ ] keys %users;
local $, = "t";
say @$_ for @sizes;





share|improve this answer

























  • what does if S_ISREG($mode) do ?..

    – stack0114106
    Mar 21 at 18:16











  • yes your solution works.. thank you

    – stack0114106
    Mar 21 at 18:50











  • @stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

    – Shawn
    Mar 21 at 21:09
















4














Another perl one, that displays total sizes sorted by user:



#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d)
my $filename = File::Spec->catfile($dir, $file);
my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
$users$uid += $size if S_ISREG($mode);

closedir $d;

my @sizes = sort $a->[0] cmp $b->[0]
map [ getpwuid($_) // $_, $users$_ ] keys %users;
local $, = "t";
say @$_ for @sizes;





share|improve this answer

























  • what does if S_ISREG($mode) do ?..

    – stack0114106
    Mar 21 at 18:16











  • yes your solution works.. thank you

    – stack0114106
    Mar 21 at 18:50











  • @stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

    – Shawn
    Mar 21 at 21:09














4












4








4







Another perl one, that displays total sizes sorted by user:



#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d)
my $filename = File::Spec->catfile($dir, $file);
my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
$users$uid += $size if S_ISREG($mode);

closedir $d;

my @sizes = sort $a->[0] cmp $b->[0]
map [ getpwuid($_) // $_, $users$_ ] keys %users;
local $, = "t";
say @$_ for @sizes;





share|improve this answer















Another perl one, that displays total sizes sorted by user:



#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d)
my $filename = File::Spec->catfile($dir, $file);
my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
$users$uid += $size if S_ISREG($mode);

closedir $d;

my @sizes = sort $a->[0] cmp $b->[0]
map [ getpwuid($_) // $_, $users$_ ] keys %users;
local $, = "t";
say @$_ for @sizes;






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 21 at 16:35

























answered Mar 21 at 16:27









ShawnShawn

4,8572614




4,8572614












  • what does if S_ISREG($mode) do ?..

    – stack0114106
    Mar 21 at 18:16











  • yes your solution works.. thank you

    – stack0114106
    Mar 21 at 18:50











  • @stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

    – Shawn
    Mar 21 at 21:09


















  • what does if S_ISREG($mode) do ?..

    – stack0114106
    Mar 21 at 18:16











  • yes your solution works.. thank you

    – stack0114106
    Mar 21 at 18:50











  • @stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

    – Shawn
    Mar 21 at 21:09

















what does if S_ISREG($mode) do ?..

– stack0114106
Mar 21 at 18:16





what does if S_ISREG($mode) do ?..

– stack0114106
Mar 21 at 18:16













yes your solution works.. thank you

– stack0114106
Mar 21 at 18:50





yes your solution works.. thank you

– stack0114106
Mar 21 at 18:50













@stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

– Shawn
Mar 21 at 21:09






@stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the -f $file in another answer, just a different way of checking.

– Shawn
Mar 21 at 21:09












2














Parsing output from ls - bad idea.



How about using find instead?



  • start in directory $dir

    • limit to that directory level (-maxdepth 1)

    • limit to files (-type f)

    • print a line with user name and file size in bytes (-printf "%u %sn")


  • run the results through a perl filter

    • split each line (-a)

    • add to a hash under key (field 0) the size (field 1)

    • at the end (END ...) print out the hash contents, sorted by key, i.e. user name


$ find $dir -maxdepth 1 -type f -printf "%u %sn" | 
perl -ane '$s$F[0] += $F[1]; END print "$_ $s$_n" foreach (sort keys %s); '
stefanb 263305714



A solution using Perl:



#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV)
opendir(my $dh, $dir);

# files in this directory
while (my $entry = readdir($dh))
my $file = File::Spec->catfile($dir, $entry);

# only files
if (-f $file)
my($uid, $size) = (stat($file))[4, 7];
$users$uid += $size



closedir($dh);


print "$_ $users$_n" foreach (sort keys %users);

exit 0;


Test run:



$ perl dummy.pl .
1000 263618544


Interesting difference. The Perl solution discovers 3 more files in my test directory than the find solution. I have to ponder why that is...






share|improve this answer

























  • It should print for all owners..not just the current user.. the files are owned by diff users

    – stack0114106
    Mar 21 at 16:14











  • Updated accordingly

    – Stefan Becker
    Mar 21 at 16:26











  • your solution works.. thank you..

    – stack0114106
    Mar 21 at 18:51















2














Parsing output from ls - bad idea.



How about using find instead?



  • start in directory $dir

    • limit to that directory level (-maxdepth 1)

    • limit to files (-type f)

    • print a line with user name and file size in bytes (-printf "%u %sn")


  • run the results through a perl filter

    • split each line (-a)

    • add to a hash under key (field 0) the size (field 1)

    • at the end (END ...) print out the hash contents, sorted by key, i.e. user name


$ find $dir -maxdepth 1 -type f -printf "%u %sn" | 
perl -ane '$s$F[0] += $F[1]; END print "$_ $s$_n" foreach (sort keys %s); '
stefanb 263305714



A solution using Perl:



#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV)
opendir(my $dh, $dir);

# files in this directory
while (my $entry = readdir($dh))
my $file = File::Spec->catfile($dir, $entry);

# only files
if (-f $file)
my($uid, $size) = (stat($file))[4, 7];
$users$uid += $size



closedir($dh);


print "$_ $users$_n" foreach (sort keys %users);

exit 0;


Test run:



$ perl dummy.pl .
1000 263618544


Interesting difference. The Perl solution discovers 3 more files in my test directory than the find solution. I have to ponder why that is...






share|improve this answer

























  • It should print for all owners..not just the current user.. the files are owned by diff users

    – stack0114106
    Mar 21 at 16:14











  • Updated accordingly

    – Stefan Becker
    Mar 21 at 16:26











  • your solution works.. thank you..

    – stack0114106
    Mar 21 at 18:51













2












2








2







Parsing output from ls - bad idea.



How about using find instead?



  • start in directory $dir

    • limit to that directory level (-maxdepth 1)

    • limit to files (-type f)

    • print a line with user name and file size in bytes (-printf "%u %sn")


  • run the results through a perl filter

    • split each line (-a)

    • add to a hash under key (field 0) the size (field 1)

    • at the end (END ...) print out the hash contents, sorted by key, i.e. user name


$ find $dir -maxdepth 1 -type f -printf "%u %sn" | 
perl -ane '$s$F[0] += $F[1]; END print "$_ $s$_n" foreach (sort keys %s); '
stefanb 263305714



A solution using Perl:



#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV)
opendir(my $dh, $dir);

# files in this directory
while (my $entry = readdir($dh))
my $file = File::Spec->catfile($dir, $entry);

# only files
if (-f $file)
my($uid, $size) = (stat($file))[4, 7];
$users$uid += $size



closedir($dh);


print "$_ $users$_n" foreach (sort keys %users);

exit 0;


Test run:



$ perl dummy.pl .
1000 263618544


Interesting difference. The Perl solution discovers 3 more files in my test directory than the find solution. I have to ponder why that is...






share|improve this answer















Parsing output from ls - bad idea.



How about using find instead?



  • start in directory $dir

    • limit to that directory level (-maxdepth 1)

    • limit to files (-type f)

    • print a line with user name and file size in bytes (-printf "%u %sn")


  • run the results through a perl filter

    • split each line (-a)

    • add to a hash under key (field 0) the size (field 1)

    • at the end (END ...) print out the hash contents, sorted by key, i.e. user name


$ find $dir -maxdepth 1 -type f -printf "%u %sn" | 
perl -ane '$s$F[0] += $F[1]; END print "$_ $s$_n" foreach (sort keys %s); '
stefanb 263305714



A solution using Perl:



#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV)
opendir(my $dh, $dir);

# files in this directory
while (my $entry = readdir($dh))
my $file = File::Spec->catfile($dir, $entry);

# only files
if (-f $file)
my($uid, $size) = (stat($file))[4, 7];
$users$uid += $size



closedir($dh);


print "$_ $users$_n" foreach (sort keys %users);

exit 0;


Test run:



$ perl dummy.pl .
1000 263618544


Interesting difference. The Perl solution discovers 3 more files in my test directory than the find solution. I have to ponder why that is...







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 21 at 16:35

























answered Mar 21 at 16:08









Stefan BeckerStefan Becker

4,31521125




4,31521125












  • It should print for all owners..not just the current user.. the files are owned by diff users

    – stack0114106
    Mar 21 at 16:14











  • Updated accordingly

    – Stefan Becker
    Mar 21 at 16:26











  • your solution works.. thank you..

    – stack0114106
    Mar 21 at 18:51

















  • It should print for all owners..not just the current user.. the files are owned by diff users

    – stack0114106
    Mar 21 at 16:14











  • Updated accordingly

    – Stefan Becker
    Mar 21 at 16:26











  • your solution works.. thank you..

    – stack0114106
    Mar 21 at 18:51
















It should print for all owners..not just the current user.. the files are owned by diff users

– stack0114106
Mar 21 at 16:14





It should print for all owners..not just the current user.. the files are owned by diff users

– stack0114106
Mar 21 at 16:14













Updated accordingly

– Stefan Becker
Mar 21 at 16:26





Updated accordingly

– Stefan Becker
Mar 21 at 16:26













your solution works.. thank you..

– stack0114106
Mar 21 at 18:51





your solution works.. thank you..

– stack0114106
Mar 21 at 18:51











1














Not sure why question is tagged perl when awk is being used.



Here's a simple perl version:



#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dirn");

map
if ( ! m/^[.][.]?$/o )
($s,$u) = (stat)[7,4];
$h$u += $s;

glob ".* *";

map
$s = $h$_;
$u = !( $s >>10) ? ""
: !(($s>>=10)>>10) ? "k"
: !(($s>>=10)>>10) ? "M"
: !(($s>>=10)>>10) ? "G"
: ($s>>=10) ? "T"
: undef
;
printf "%-8s %12dt%sn", $s.$u, $h$_, getpwuid($_)//$_;
keys %h;




  • glob gets our file list


  • m// discards . and ..


  • stat the size and uid

  • accumulate sizes in %h

  • compute the unit by bitshifting (>>10 is integer divide by 1024)

  • map uid to username (// provides fallback)

  • print results (unsorted)


  • NOTE: unlike some other answers, this code doesn't recurse into subdirectories

To exclude symlinks, subdirectories, etc, change the if to appropriate -X tests. (eg. (-f $_), (!-d $_ and !-l $_), etc). See perl docs on the _ filehandle optimisation for caching stat results.






share|improve this answer

























  • I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

    – Aaron Digulla
    Mar 21 at 16:27











  • yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

    – jhnc
    Mar 21 at 16:29







  • 1





    Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

    – Aaron Digulla
    Mar 21 at 16:32











  • @jhnc.. your solution also works..thank you

    – stack0114106
    Mar 21 at 18:49
















1














Not sure why question is tagged perl when awk is being used.



Here's a simple perl version:



#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dirn");

map
if ( ! m/^[.][.]?$/o )
($s,$u) = (stat)[7,4];
$h$u += $s;

glob ".* *";

map
$s = $h$_;
$u = !( $s >>10) ? ""
: !(($s>>=10)>>10) ? "k"
: !(($s>>=10)>>10) ? "M"
: !(($s>>=10)>>10) ? "G"
: ($s>>=10) ? "T"
: undef
;
printf "%-8s %12dt%sn", $s.$u, $h$_, getpwuid($_)//$_;
keys %h;




  • glob gets our file list


  • m// discards . and ..


  • stat the size and uid

  • accumulate sizes in %h

  • compute the unit by bitshifting (>>10 is integer divide by 1024)

  • map uid to username (// provides fallback)

  • print results (unsorted)


  • NOTE: unlike some other answers, this code doesn't recurse into subdirectories

To exclude symlinks, subdirectories, etc, change the if to appropriate -X tests. (eg. (-f $_), (!-d $_ and !-l $_), etc). See perl docs on the _ filehandle optimisation for caching stat results.






share|improve this answer

























  • I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

    – Aaron Digulla
    Mar 21 at 16:27











  • yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

    – jhnc
    Mar 21 at 16:29







  • 1





    Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

    – Aaron Digulla
    Mar 21 at 16:32











  • @jhnc.. your solution also works..thank you

    – stack0114106
    Mar 21 at 18:49














1












1








1







Not sure why question is tagged perl when awk is being used.



Here's a simple perl version:



#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dirn");

map
if ( ! m/^[.][.]?$/o )
($s,$u) = (stat)[7,4];
$h$u += $s;

glob ".* *";

map
$s = $h$_;
$u = !( $s >>10) ? ""
: !(($s>>=10)>>10) ? "k"
: !(($s>>=10)>>10) ? "M"
: !(($s>>=10)>>10) ? "G"
: ($s>>=10) ? "T"
: undef
;
printf "%-8s %12dt%sn", $s.$u, $h$_, getpwuid($_)//$_;
keys %h;




  • glob gets our file list


  • m// discards . and ..


  • stat the size and uid

  • accumulate sizes in %h

  • compute the unit by bitshifting (>>10 is integer divide by 1024)

  • map uid to username (// provides fallback)

  • print results (unsorted)


  • NOTE: unlike some other answers, this code doesn't recurse into subdirectories

To exclude symlinks, subdirectories, etc, change the if to appropriate -X tests. (eg. (-f $_), (!-d $_ and !-l $_), etc). See perl docs on the _ filehandle optimisation for caching stat results.






share|improve this answer















Not sure why question is tagged perl when awk is being used.



Here's a simple perl version:



#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dirn");

map
if ( ! m/^[.][.]?$/o )
($s,$u) = (stat)[7,4];
$h$u += $s;

glob ".* *";

map
$s = $h$_;
$u = !( $s >>10) ? ""
: !(($s>>=10)>>10) ? "k"
: !(($s>>=10)>>10) ? "M"
: !(($s>>=10)>>10) ? "G"
: ($s>>=10) ? "T"
: undef
;
printf "%-8s %12dt%sn", $s.$u, $h$_, getpwuid($_)//$_;
keys %h;




  • glob gets our file list


  • m// discards . and ..


  • stat the size and uid

  • accumulate sizes in %h

  • compute the unit by bitshifting (>>10 is integer divide by 1024)

  • map uid to username (// provides fallback)

  • print results (unsorted)


  • NOTE: unlike some other answers, this code doesn't recurse into subdirectories

To exclude symlinks, subdirectories, etc, change the if to appropriate -X tests. (eg. (-f $_), (!-d $_ and !-l $_), etc). See perl docs on the _ filehandle optimisation for caching stat results.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 21 at 20:08

























answered Mar 21 at 16:19









jhncjhnc

2,559214




2,559214












  • I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

    – Aaron Digulla
    Mar 21 at 16:27











  • yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

    – jhnc
    Mar 21 at 16:29







  • 1





    Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

    – Aaron Digulla
    Mar 21 at 16:32











  • @jhnc.. your solution also works..thank you

    – stack0114106
    Mar 21 at 18:49


















  • I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

    – Aaron Digulla
    Mar 21 at 16:27











  • yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

    – jhnc
    Mar 21 at 16:29







  • 1





    Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

    – Aaron Digulla
    Mar 21 at 16:32











  • @jhnc.. your solution also works..thank you

    – stack0114106
    Mar 21 at 18:49

















I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

– Aaron Digulla
Mar 21 at 16:27





I don't see m/// in the script. My guess is you're referring to !/^[.][.]?$/o?

– Aaron Digulla
Mar 21 at 16:27













yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

– jhnc
Mar 21 at 16:29






yes. // is shortcut for m//. m is only needed if you want to use different delimiter (eg m[], m<>, etc). Three slashes was typo.

– jhnc
Mar 21 at 16:29





1




1





Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

– Aaron Digulla
Mar 21 at 16:32





Please either use m// in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl.

– Aaron Digulla
Mar 21 at 16:32













@jhnc.. your solution also works..thank you

– stack0114106
Mar 21 at 18:49






@jhnc.. your solution also works..thank you

– stack0114106
Mar 21 at 18:49












1














Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:



$ cat bar.awk
@load "filefuncs"
BEGIN
FS=":" # passwd field sep
passwd="/etc/passwd" # get usernames from passwd
while ((getline < passwd)>0)
users[$3]=$1
close(passwd) # close passwd

if(path="") # set path with -v path=...
path="." # default path is cwd
pathlist[1]=path # path from the command line
# you could have several paths
fts(pathlist,FTS_PHYSICAL,filedata) # dont mind links (vs. FTS_LOGICAL)
for(p in filedata) # p for paths
for(f in filedata[p]) # f for files
if(filedata[p][f]["stat"]["type"]=="file") # mind files only
size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
for(i in size)
print (users[i]?users[i]:i),size[i] # print username if found else uid
exit



Sample outputs:



$ ls -l
total 3623
drwxr-xr-x 2 james james 3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root root 4 Mar 21 18:52 bar
-rw-r--r-- 1 james james 424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james 546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james 315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james 125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410


Another:



$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real 0m1.289s
user 0m0.852s
sys 0m0.440s


Yet another test with a million empty files:



$ time awk -v path=../million_files -f bar.awk

real 0m5.057s
user 0m4.000s
sys 0m1.056s





share|improve this answer

























  • looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

    – stack0114106
    Mar 21 at 18:04











  • TIme to upgrade to a modern version of GNU awk.

    – James Brown
    Mar 21 at 18:17











  • this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

    – stack0114106
    Mar 21 at 18:24











  • A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

    – James Brown
    Mar 21 at 18:29






  • 1





    ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

    – stack0114106
    Mar 21 at 18:31















1














Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:



$ cat bar.awk
@load "filefuncs"
BEGIN
FS=":" # passwd field sep
passwd="/etc/passwd" # get usernames from passwd
while ((getline < passwd)>0)
users[$3]=$1
close(passwd) # close passwd

if(path="") # set path with -v path=...
path="." # default path is cwd
pathlist[1]=path # path from the command line
# you could have several paths
fts(pathlist,FTS_PHYSICAL,filedata) # dont mind links (vs. FTS_LOGICAL)
for(p in filedata) # p for paths
for(f in filedata[p]) # f for files
if(filedata[p][f]["stat"]["type"]=="file") # mind files only
size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
for(i in size)
print (users[i]?users[i]:i),size[i] # print username if found else uid
exit



Sample outputs:



$ ls -l
total 3623
drwxr-xr-x 2 james james 3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root root 4 Mar 21 18:52 bar
-rw-r--r-- 1 james james 424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james 546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james 315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james 125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410


Another:



$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real 0m1.289s
user 0m0.852s
sys 0m0.440s


Yet another test with a million empty files:



$ time awk -v path=../million_files -f bar.awk

real 0m5.057s
user 0m4.000s
sys 0m1.056s





share|improve this answer

























  • looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

    – stack0114106
    Mar 21 at 18:04











  • TIme to upgrade to a modern version of GNU awk.

    – James Brown
    Mar 21 at 18:17











  • this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

    – stack0114106
    Mar 21 at 18:24











  • A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

    – James Brown
    Mar 21 at 18:29






  • 1





    ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

    – stack0114106
    Mar 21 at 18:31













1












1








1







Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:



$ cat bar.awk
@load "filefuncs"
BEGIN
FS=":" # passwd field sep
passwd="/etc/passwd" # get usernames from passwd
while ((getline < passwd)>0)
users[$3]=$1
close(passwd) # close passwd

if(path="") # set path with -v path=...
path="." # default path is cwd
pathlist[1]=path # path from the command line
# you could have several paths
fts(pathlist,FTS_PHYSICAL,filedata) # dont mind links (vs. FTS_LOGICAL)
for(p in filedata) # p for paths
for(f in filedata[p]) # f for files
if(filedata[p][f]["stat"]["type"]=="file") # mind files only
size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
for(i in size)
print (users[i]?users[i]:i),size[i] # print username if found else uid
exit



Sample outputs:



$ ls -l
total 3623
drwxr-xr-x 2 james james 3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root root 4 Mar 21 18:52 bar
-rw-r--r-- 1 james james 424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james 546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james 315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james 125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410


Another:



$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real 0m1.289s
user 0m0.852s
sys 0m0.440s


Yet another test with a million empty files:



$ time awk -v path=../million_files -f bar.awk

real 0m5.057s
user 0m4.000s
sys 0m1.056s





share|improve this answer















Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:



$ cat bar.awk
@load "filefuncs"
BEGIN
FS=":" # passwd field sep
passwd="/etc/passwd" # get usernames from passwd
while ((getline < passwd)>0)
users[$3]=$1
close(passwd) # close passwd

if(path="") # set path with -v path=...
path="." # default path is cwd
pathlist[1]=path # path from the command line
# you could have several paths
fts(pathlist,FTS_PHYSICAL,filedata) # dont mind links (vs. FTS_LOGICAL)
for(p in filedata) # p for paths
for(f in filedata[p]) # f for files
if(filedata[p][f]["stat"]["type"]=="file") # mind files only
size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
for(i in size)
print (users[i]?users[i]:i),size[i] # print username if found else uid
exit



Sample outputs:



$ ls -l
total 3623
drwxr-xr-x 2 james james 3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root root 4 Mar 21 18:52 bar
-rw-r--r-- 1 james james 424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james 546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james 315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james 125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410


Another:



$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real 0m1.289s
user 0m0.852s
sys 0m0.440s


Yet another test with a million empty files:



$ time awk -v path=../million_files -f bar.awk

real 0m5.057s
user 0m4.000s
sys 0m1.056s






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 22 at 7:08

























answered Mar 21 at 17:23









James BrownJames Brown

20.1k42037




20.1k42037












  • looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

    – stack0114106
    Mar 21 at 18:04











  • TIme to upgrade to a modern version of GNU awk.

    – James Brown
    Mar 21 at 18:17











  • this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

    – stack0114106
    Mar 21 at 18:24











  • A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

    – James Brown
    Mar 21 at 18:29






  • 1





    ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

    – stack0114106
    Mar 21 at 18:31

















  • looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

    – stack0114106
    Mar 21 at 18:04











  • TIme to upgrade to a modern version of GNU awk.

    – James Brown
    Mar 21 at 18:17











  • this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

    – stack0114106
    Mar 21 at 18:24











  • A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

    – James Brown
    Mar 21 at 18:29






  • 1





    ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

    – stack0114106
    Mar 21 at 18:31
















looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

– stack0114106
Mar 21 at 18:04





looks like my awk doesn't have filefuncs awk: foo.awk:1: ^ invalid char '@' in expression

– stack0114106
Mar 21 at 18:04













TIme to upgrade to a modern version of GNU awk.

– James Brown
Mar 21 at 18:17





TIme to upgrade to a modern version of GNU awk.

– James Brown
Mar 21 at 18:17













this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

– stack0114106
Mar 21 at 18:24





this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??..

– stack0114106
Mar 21 at 18:24













A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

– James Brown
Mar 21 at 18:29





A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many.

– James Brown
Mar 21 at 18:29




1




1





ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

– stack0114106
Mar 21 at 18:31





ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++

– stack0114106
Mar 21 at 18:31











0














Using datamash (and Stefan Becker's find code):



find $dir -maxdepth 1 -type f -printf "%ut%sn" | datamash -sg 1 sum 2





share|improve this answer

























  • @agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

    – stack0114106
    Mar 22 at 11:38











  • @stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

    – agc
    Mar 22 at 11:58















0














Using datamash (and Stefan Becker's find code):



find $dir -maxdepth 1 -type f -printf "%ut%sn" | datamash -sg 1 sum 2





share|improve this answer

























  • @agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

    – stack0114106
    Mar 22 at 11:38











  • @stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

    – agc
    Mar 22 at 11:58













0












0








0







Using datamash (and Stefan Becker's find code):



find $dir -maxdepth 1 -type f -printf "%ut%sn" | datamash -sg 1 sum 2





share|improve this answer















Using datamash (and Stefan Becker's find code):



find $dir -maxdepth 1 -type f -printf "%ut%sn" | datamash -sg 1 sum 2






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 22 at 11:40

























answered Mar 22 at 11:36









agcagc

5,0171438




5,0171438












  • @agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

    – stack0114106
    Mar 22 at 11:38











  • @stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

    – agc
    Mar 22 at 11:58

















  • @agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

    – stack0114106
    Mar 22 at 11:38











  • @stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

    – agc
    Mar 22 at 11:58
















@agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

– stack0114106
Mar 22 at 11:38





@agc..the answer seems to be simple.. is datamash available in RHEL 6.1?

– stack0114106
Mar 22 at 11:38













@stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

– agc
Mar 22 at 11:58





@stack0114106, Not sure -- RPM files exist, but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on.

– agc
Mar 22 at 11:58

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55283652%2ffastest-way-to-sum-the-file-sizes-by-owner-in-a-directory%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript