1. How to get list of output of find command of Linux

I want to find get a list of full pathnames of all files in a directory and its subdirectories - just as one gets in the output of 'find' command of Linux.

I could not find a way to do it in manual. The function dir() apparently does not support recursive option to check subdirectories also.

I can display the list on the console by using code: system_exec("find") but how can I get this in a sequence for further use in Euphoria program?

Thanks for your help.

new topic     » topic index » view message » categorize

2. Re: How to get list of output of find command of Linux

  • try walkdir; this is recursive
include std/filesys.e  
 
walk_dir(sequence path_name, object your_function, 
integer scan_subdirs = types :FALSE, 
object dir_source = types :NO_ROUTINE_ID) 
  • try writing to a file

system( find --help > foo.txt ) 
 
--> load "foo.txt" 
--> parse and use the output 

_tom

new topic     » goto parent     » topic index » view message » categorize

3. Re: How to get list of output of find command of Linux

_tom said...
  • try writing to a file

system( find --help > foo.txt ) 
 
--> load "foo.txt" 
--> parse and use the output 
_tom

This is a very good and simple method. However, the code should not have:

--help

It should simply be:

system("find > foo.txt") 
new topic     » goto parent     » topic index » view message » categorize

4. Re: How to get list of output of find command of Linux

I would not recommend using system() to make an external call to find. That method is not portable and should be considered bad practice.

Here is the method I for recursively finding files in a directory without using walk_dir() and without making recursive calls to the same function.

Essentially, we just keep a running queue of the directories we'd like to scan while adding the files found to a list for output.

include std/filesys.e 
include std/wildcard.e 
 
public function get_files( sequence path = current_dir(), sequence name = "*", integer maxdepth = 0 ) 
-- 
-- path     : directory path to scan 
-- name     : wildcard pattern to look for 
-- maxdepth : maximum folder depth to scan within path 
-- 
    integer depth 
    sequence files, queue, item, full_path 
    object items 
 
    depth = 1 
    files = {} 
    path = canonical_path( path ) 
 
    -- push initial path into queue 
    queue = { {path,depth} } 
 
    while length( queue ) do 
 
        -- pop an item off the queue 
        item = queue[1] 
        queue = queue[2..$] 
 
        -- get the queue item values 
        path = item[1] 
        depth = item[2] 
 
        -- get the directory items 
        items = dir( path ) 
 
        if atom( items ) then 
            -- error 
            continue 
        end if 
 
        for i = 1 to length( items ) do 
            item = items[i] 
 
            if find( item[D_NAME], {".",".."} ) then 
                -- skip these entries 
                continue 
            end if 
 
            full_path = join_path({ path, item[D_NAME] }) 
 
            if find( 'd', item[D_ATTRIBUTES] ) then 
                -- is a directory 
 
                if maxdepth = 0 or depth < maxdepth then 
                    -- queue this directory 
                    queue = append( queue, {full_path,depth+1} ) 
                end if 
 
            else 
                -- is a file 
 
                if wildcard:is_match( name, item[D_NAME] ) then 
                    -- add this to the output 
                    files = append( files, full_path ) 
                end if 
 
            end if 
 
        end for 
 
    end while 
 
    return files 
end function 

-Greg

new topic     » goto parent     » topic index » view message » categorize

5. Re: How to get list of output of find command of Linux

Thanks for a very clear code.

Why are you against recursion?

new topic     » goto parent     » topic index » view message » categorize

6. Re: How to get list of output of find command of Linux

rneu said...

Thanks for a very clear code.

Why are you against recursion?

I'm not against recursion per se, but it's not really a good approach for scanning through a file system. With any programming language, but especially interpreted ones like Euphoria, deeply recursive functions create additional overhead, and file systems tend to be deeply recursive, so there's a potential for lots of overhead.

Alternatively, if it applies to your project, what I'd recommend is using a queued approach like I've shown, but combine that with a visitor function like walk_dir() uses, so that you don't collect all the file names you've found in memory. I can provide an example on how to do that if you need.

In the end, the "best" method for this is probably an iterator that holds its position in the queue and only returns the values as they're requested. Other languages have this feature, like Python, and it's a pretty neat, albeit complicated, way to run through a list of items very efficiently.

-Greg

new topic     » goto parent     » topic index » view message » categorize

7. Re: How to get list of output of find command of Linux

I quickly tried that code on Phix, and apart from the parameter default of current_dir() not being supported on Phix (really must fix that some day), and wildcard_match(), it works fine.
One of the reasons for doing so was that I thought of a slightly better way to manage the queue (some comments stripped for clarity):

function get_files(sequence path="", sequence wildname="*", integer maxdepth=0)  
--  
-- path     : directory path to scan  
-- wildname : wildcard pattern to look for  
-- maxdepth : maximum folder depth to scan within path  
--  
    path = iff(path=""?current_dir():get_proper_path(path)) 
    integer depth = 1 
    sequence files = {}, 
             queue = {path,depth,{}} 
    while length(queue) do  
        {path,depth,queue} = queue 
        object d = dir(path)  
        if sequence(d) then  
            for i=1 to length(d) do  
                string {name,attr} = d[i]  
                if not find(name,{".",".."}) then  
                    string full_path = join_path({path,name})  
                    if find('d',attr) then  
                        if maxdepth=0 or depth<maxdepth then  
                            queue = {full_path,depth+1,queue} 
                        end if  
                    elsif wildcard_match(wildname, name) then  
                        files = append(files, full_path)  
                    end if  
                end if  
            end for  
         end if  
    end while  
    return files  
end function  

Not at all sure that will work on OE, and I can accept that a "flat" queue might be slightly more efficient, but I was aiming for elegance.

Pete

PS: On recursion vs iteration, Greg may have slightly exaggerated, but is essentially correct. An iterative solution will create just one queue and plant each element in the result sequence files just once, and will therefore always be slightly more efficient than a recursive solution, but any difference in performance is probably insignificant. In the end, whichever you find easier (recursion vs iteration), to understand and maintain, is the one to use.

new topic     » goto parent     » topic index » view message » categorize

8. Re: How to get list of output of find command of Linux

ghaberek said...

In the end, the "best" method for this is probably an iterator that holds its position in the queue and only returns the values as they're requested. Other languages have this feature, like Python, and it's a pretty neat, albeit complicated, way to run through a list of items very efficiently.

Maybe something like this? http://rosettacode.org/wiki/Generator/Exponential#Phix (but with only one task)

Pete

new topic     » goto parent     » topic index » view message » categorize

9. Re: How to get list of output of find command of Linux

ghaberek said...

Alternatively, if it applies to your project, what I'd recommend is using a queued approach like I've shown, but combine that with a visitor function like walk_dir() uses, so that you don't collect all the file names you've found in memory. I can provide an example on how to do that if you need. -Greg

A walk_dir() approach code here will be really illustrative and helpful.

new topic     » goto parent     » topic index » view message » categorize

10. Re: How to get list of output of find command of Linux

rneu said...

A walk_dir() approach code here will be really illustrative and helpful.

It's pretty simple

sequence files = {} 
function look_at(sequence path, sequence d) 
    files = append(files,{join_path({path,d[D_NAME]}),d[D_SIZE]}) 
    return 0 -- keep going 
end function 
if walk_dir("C:\\Downloads\\Software", routine_id("look_at"), TRUE)!=0 then 
    ?"oops" 
end if 
for i=1 to length(files) do 
    printf(1, "%s: %d\n",files[i]) 
end for 

Output:

<snip> 
C:\Downloads\Software\SKU322505 inpa\inpa\InpaCANinstall.doc: 4561920 
C:\Downloads\Software\SKU322505 inpa\inpa\InpaCANinstall.pdf: 300349 
C:\Downloads\Software\SKU322505 inpa\inpa\autorun.inf: 60 
C:\Downloads\Software\SKU322505 inpa\inpa\cemtl.ico: 4286 
C:\Downloads\Software\tdm-gcc-5.1.0-3.exe: 36802006 
C:\Downloads\Software\tdm64-gcc-5.1.0-2.exe: 48071122 

For a fuller example, see http://openeuphoria.org/docs/std_filesys.html#_1282_walk_dir

Pete

PS: you'll probably also need include std/filesys.e at the top

new topic     » goto parent     » topic index » view message » categorize

11. Re: How to get list of output of find command of Linux

petelomax said...

I quickly tried that code on Phix, and apart from the parameter default of current_dir() not being supported on Phix (really must fix that some day), and wildcard_match(), it works fine.

One of the reasons for doing so was that I thought of a slightly better way to manage the queue (some comments stripped for clarity):

Not at all sure that will work on OE, and I can accept that a "flat" queue might be slightly more efficient, but I was aiming for elegance.

Interesting. You're basically creating a linked list instead of a flat queue. Neat!

petelomax said...

PS: On recursion vs iteration, Greg may have slightly exaggerated, but is essentially correct. An iterative solution will create just one queue and plant each element in the result sequence files just once, and will therefore always be slightly more efficient than a recursive solution, but any difference in performance is probably insignificant. In the end, whichever you find easier (recursion vs iteration), to understand and maintain, is the one to use.

I suppose I like to hyperbolize things once in a while. Simply knowing that a particular method is introducing additional overhead makes it feel inefficient. But in most cases the difference is likely imperceptible.

petelomax said...

Maybe something like this? http://rosettacode.org/wiki/Generator/Exponential#Phix (but with only one task)

Sort of? I don't think you'd need to use a task for that though. I was thinking of a recursive version of something like FindFirstFile, which basically uses a cursor to continue generating items until it's reach the end of the list.

petelomax said...
rneu said...

A walk_dir() approach code here will be really illustrative and helpful.

It's pretty simple

... 

For a fuller example, see http://openeuphoria.org/docs/std_filesys.html#_1282_walk_dir

I was offering to adapt my get_files() function to work like walk_dir() which I think is what he was asking for. Here's the code...

include std/filesys.e 
include std/wildcard.e 
 
public function get_files( integer rtn_id, sequence path = current_dir(), sequence name = "*", integer maxdepth = 0 ) 
-- 
-- rtn_id   : routine id to call with each full path 
-- path     : directory path to scan 
-- name     : wildcard pattern to look for 
-- maxdepth : maximum folder depth to scan within path 
-- 
    integer depth, exit_code 
    sequence queue, item, full_path 
    object items 
 
    depth = 1 
    exit_code = 0 
    path = canonical_path( path ) 
 
    -- push initial path into queue 
    queue = { {path,depth} } 
 
    while length( queue ) do 
 
        -- pop an item off the queue 
        item = queue[1] 
        queue = queue[2..$] 
 
        -- get the queue item values 
        path = item[1] 
        depth = item[2] 
 
        -- get the directory items 
        items = dir( path ) 
 
        if atom( items ) then 
            -- error 
            continue 
        end if 
 
        for i = 1 to length( items ) do 
            item = items[i] 
 
            if find( item[D_NAME], {".",".."} ) then 
                -- skip these entries 
                continue 
            end if 
 
            full_path = join_path({ path, item[D_NAME] }) 
 
            if find( 'd', item[D_ATTRIBUTES] ) then 
                -- is a directory 
 
                if maxdepth = 0 or depth < maxdepth then 
                    -- queue this directory 
                    queue = append( queue, {full_path,depth+1} ) 
                end if 
 
            else 
                -- is a file 
 
                if wildcard:is_match( name, item[D_NAME] ) then 
 
                    -- call the user routine 
                    exit_code = call_func( rtn_id, {full_path} ) 
 
                    if exit_code != 0 then 
                        -- time to quit 
                        queue = {} 
                        exit 
                    end if 
 
                end if 
 
            end if 
 
        end for 
 
    end while 
 
    return exit_code 
end function 

And then call the function like you would walk_dir()...

function look_at( sequence full_path ) 
    printf( 1, "%s\n", {full_path} ) 
    return 0 
end function 
 
get_files( routine_id("look_at") ) 

-Greg

new topic     » goto parent     » topic index » view message » categorize

12. Re: How to get list of output of find command of Linux

ghaberek said...
petelomax said...

Maybe something like this? http://rosettacode.org/wiki/Generator/Exponential#Phix (but with only one task)

Sort of? I don't think you'd need to use a task for that though. I was thinking of a recursive version of something like FindFirstFile, which basically uses a cursor to continue generating items until it's reach the end of the list.

I had some time to sit down and knock out an example of what I was describing. Basically, my first version above was an eager scanner (it gets all of the files at once), this is a lazy scanner (it only gets more files when it needs to).

To use it, simple call dir_open() to start scanning the directory, then loop until dir_next() returns an empty sequence, and call dir_close() to clean up when you're done.

This is the most efficient method in terms of both speed -- you don't have to wait for it to find all the files before starting the loop, and memory -- it only keeps a subset of directories and file paths in memory at any given time.

Edit: actually, this is probably only faster as a matter of perception since you can get started on the body of the loop sooner; so it feels faster. Just running the numbers quickly on my machine, the eager method (see above) may be about 20-30% faster, at least it is on my system. So if memory is of no concern, then scanning the entire directory first is probably faster, but real-world cases will vary wildly. My tests are doing nothing but printing the path to the console. YMMV

Also, the find command is like, a bazillion times faster than any of this. Not sure what it's doing but there's got to be some lower-level operating system or file system calls going on.

-- dir_scan.e 
include std/filesys.e 
include std/wildcard.e 
 
sequence d = {} 
 
enum PATTERN,DIRS,FILES 
 
-- 
-- prepare a new directory scanner instance 
-- 
public function dir_open( sequence path = ".", sequence pattern = "*" ) 
 
    if path[$] != SLASH then path &= SLASH end if 
    path = canonical_path( path ) 
 
    integer id = find( {}, d ) 
 
    if id = 0 then 
        d = append( d, {} ) 
        id = length( d ) 
    end if 
 
    d[id] = {pattern,{path},{}} -- PATTERN,DIRS,FILES 
 
    return id 
end function 
 
-- 
-- clean up an unused directory scanner 
-- 
public procedure dir_close( integer id ) 
    d[id] = {} 
end procedure 
 
-- 
-- get the next available file in the queue 
-- (returns "" when queues are exhausted) 
-- 
public function dir_next( integer id ) 
 
    sequence pattern = d[id][PATTERN] 
 
    if length( d[id][FILES] ) != 0 then 
        -- we have files, return the next one 
 
        -- pop a file off the queue 
        sequence next_file = d[id][FILES][1] 
        d[id][FILES] = d[id][FILES][2..$] 
 
        return next_file 
 
    end if 
 
    while length( d[id][DIRS] ) != 0 do 
        -- we need to find more files 
 
        -- pop a directory off the queue 
        sequence next_dir = d[id][DIRS][1] 
        d[id][DIRS] = d[id][DIRS][2..$] 
 
        -- get the directory items 
        object items = dir( next_dir ) 
        if atom( items ) then 
            continue 
        end if 
 
        for i = 1 to length( items ) do 
 
            -- get the item details 
            sequence item_name = items[i][D_NAME] 
            sequence item_attr = items[i][D_ATTRIBUTES] 
            sequence full_path = next_dir & item_name 
 
            if find( item_name, {".",".."} ) then 
                -- skip these items 
                continue 
 
            elsif find( 'd', item_attr ) then 
                -- add this to the directory queue 
                d[id][DIRS] = append( d[id][DIRS], full_path & SLASH ) 
 
            elsif wildcard:is_match( pattern, item_name ) then 
                -- add this to the files queue 
                d[id][FILES] = append( d[id][FILES], full_path ) 
 
            end if 
 
        end for 
 
        if length( d[id][FILES] ) != 0 then 
            -- we have files now, return the next one 
 
            -- pop a file off the queue 
            sequence next_file = d[id][FILES][1] 
            d[id][FILES] = d[id][FILES][2..$] 
 
            return next_file 
        end if 
 
    end while 
 
    return "" 
end function 
-- find.ex 
include dir_scan.e 
 
procedure main() 
 
    integer id = dir_open( "/home/greg", "*.e" ) 
 
    sequence path 
 
    while length( path ) with entry do 
        printf( 1, "%s\n", {path} ) 
    entry 
        path = dir_next( id ) 
    end while 
 
    dir_close( id ) 
 
end procedure 
 
main() 

-Greg

new topic     » goto parent     » topic index » view message » categorize

13. Re: How to get list of output of find command of Linux

Here are the numbers testing the home directory on my machine (about 8200 files), using time.

find command lazy method (eui) lazy method (euc) eager method (eui) eager method (euc)
real 0m0.356s 1m47.227s 1m39.817s 1m16.339s 1m10.341s
user 0m0.188s 0m54.788s 0m48.072s 0m23.888s 0m18.424s
sys 0m0.156s 0m52.280s 0m51.384s 0m51.856s 0m51.412s

-Greg

new topic     » goto parent     » topic index » view message » categorize

14. Re: How to get list of output of find command of Linux

I dug through the code for find and discovered that it's using a Linux API called fts which, ironically, uses a similar method to the one I've implemented above. So I guess I was on the right track! pleased

I can work on writing a wrapper for this in Euphoria but that presents a challenge since the API relies heavily on a structure and structure support in Euphoria is all over the place right now. getlost

Edit: my searching also found this conversation, which I forgot about from last year: dir() on linux via c_func, anyone?

-Greg

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu