1. How to get list of output of find command of Linux
- Posted by rneu Jul 13, 2017
I want to find get a list of full pathnames of all files in a directory and its subdirectories - just as one gets in the output of 'find' command of Linux.
I could not find a way to do it in manual. The function dir() apparently does not support recursive option to check subdirectories also.
I can display the list on the console by using code: system_exec("find") but how can I get this in a sequence for further use in Euphoria program?
Thanks for your help.
2. Re: How to get list of output of find command of Linux
- Posted by _tom (admin) Jul 14, 2017
- try walkdir; this is recursive
include std/filesys.e walk_dir(sequence path_name, object your_function, integer scan_subdirs = types :FALSE, object dir_source = types :NO_ROUTINE_ID)
- try writing to a file
system( find --help > foo.txt ) --> load "foo.txt" --> parse and use the output
3. Re: How to get list of output of find command of Linux
- Posted by rneu Jul 16, 2017
- try writing to a file
system( find --help > foo.txt ) --> load "foo.txt" --> parse and use the output_tom
This is a very good and simple method. However, the code should not have:
It should simply be:
system("find > foo.txt")
4. Re: How to get list of output of find command of Linux
- Posted by ghaberek (admin) Jul 17, 2017
I would not recommend using system() to make an external call to find. That method is not portable and should be considered bad practice.
Here is the method I for recursively finding files in a directory without using walk_dir() and without making recursive calls to the same function.
Essentially, we just keep a running queue of the directories we'd like to scan while adding the files found to a list for output.
include std/filesys.e include std/wildcard.e public function get_files( sequence path = current_dir(), sequence name = "*", integer maxdepth = 0 ) -- -- path : directory path to scan -- name : wildcard pattern to look for -- maxdepth : maximum folder depth to scan within path -- integer depth sequence files, queue, item, full_path object items depth = 1 files = {} path = canonical_path( path ) -- push initial path into queue queue = { {path,depth} } while length( queue ) do -- pop an item off the queue item = queue[1] queue = queue[2..$] -- get the queue item values path = item[1] depth = item[2] -- get the directory items items = dir( path ) if atom( items ) then -- error continue end if for i = 1 to length( items ) do item = items[i] if find( item[D_NAME], {".",".."} ) then -- skip these entries continue end if full_path = join_path({ path, item[D_NAME] }) if find( 'd', item[D_ATTRIBUTES] ) then -- is a directory if maxdepth = 0 or depth < maxdepth then -- queue this directory queue = append( queue, {full_path,depth+1} ) end if else -- is a file if wildcard:is_match( name, item[D_NAME] ) then -- add this to the output files = append( files, full_path ) end if end if end for end while return files end function
5. Re: How to get list of output of find command of Linux
- Posted by rneu Jul 17, 2017
Thanks for a very clear code.
Why are you against recursion?
6. Re: How to get list of output of find command of Linux
- Posted by ghaberek (admin) Jul 18, 2017
Thanks for a very clear code.
Why are you against recursion?
I'm not against recursion per se, but it's not really a good approach for scanning through a file system. With any programming language, but especially interpreted ones like Euphoria, deeply recursive functions create additional overhead, and file systems tend to be deeply recursive, so there's a potential for lots of overhead.
Alternatively, if it applies to your project, what I'd recommend is using a queued approach like I've shown, but combine that with a visitor function like walk_dir() uses, so that you don't collect all the file names you've found in memory. I can provide an example on how to do that if you need.
In the end, the "best" method for this is probably an iterator that holds its position in the queue and only returns the values as they're requested. Other languages have this feature, like Python, and it's a pretty neat, albeit complicated, way to run through a list of items very efficiently.
7. Re: How to get list of output of find command of Linux
- Posted by petelomax Jul 18, 2017
I quickly tried that code on Phix, and apart from the parameter default of current_dir() not being supported on Phix (really must fix that some day), and wildcard_match(), it works fine.
One of the reasons for doing so was that I thought of a slightly better way to manage the queue (some comments stripped for clarity):
function get_files(sequence path="", sequence wildname="*", integer maxdepth=0) -- -- path : directory path to scan -- wildname : wildcard pattern to look for -- maxdepth : maximum folder depth to scan within path -- path = iff(path=""?current_dir():get_proper_path(path)) integer depth = 1 sequence files = {}, queue = {path,depth,{}} while length(queue) do {path,depth,queue} = queue object d = dir(path) if sequence(d) then for i=1 to length(d) do string {name,attr} = d[i] if not find(name,{".",".."}) then string full_path = join_path({path,name}) if find('d',attr) then if maxdepth=0 or depth<maxdepth then queue = {full_path,depth+1,queue} end if elsif wildcard_match(wildname, name) then files = append(files, full_path) end if end if end for end if end while return files end function
Not at all sure that will work on OE, and I can accept that a "flat" queue might be slightly more efficient, but I was aiming for elegance.
PS: On recursion vs iteration, Greg may have slightly exaggerated, but is essentially correct. An iterative solution will create just one queue and plant each element in the result sequence files just once, and will therefore always be slightly more efficient than a recursive solution, but any difference in performance is probably insignificant. In the end, whichever you find easier (recursion vs iteration), to understand and maintain, is the one to use.
8. Re: How to get list of output of find command of Linux
- Posted by petelomax Jul 18, 2017
In the end, the "best" method for this is probably an iterator that holds its position in the queue and only returns the values as they're requested. Other languages have this feature, like Python, and it's a pretty neat, albeit complicated, way to run through a list of items very efficiently.
Maybe something like this? http://rosettacode.org/wiki/Generator/Exponential#Phix (but with only one task)
9. Re: How to get list of output of find command of Linux
- Posted by rneu Jul 18, 2017
Alternatively, if it applies to your project, what I'd recommend is using a queued approach like I've shown, but combine that with a visitor function like walk_dir() uses, so that you don't collect all the file names you've found in memory. I can provide an example on how to do that if you need. -Greg
A walk_dir() approach code here will be really illustrative and helpful.
10. Re: How to get list of output of find command of Linux
- Posted by petelomax Jul 18, 2017
A walk_dir() approach code here will be really illustrative and helpful.
It's pretty simple
sequence files = {} function look_at(sequence path, sequence d) files = append(files,{join_path({path,d[D_NAME]}),d[D_SIZE]}) return 0 -- keep going end function if walk_dir("C:\\Downloads\\Software", routine_id("look_at"), TRUE)!=0 then ?"oops" end if for i=1 to length(files) do printf(1, "%s: %d\n",files[i]) end for
<snip> C:\Downloads\Software\SKU322505 inpa\inpa\InpaCANinstall.doc: 4561920 C:\Downloads\Software\SKU322505 inpa\inpa\InpaCANinstall.pdf: 300349 C:\Downloads\Software\SKU322505 inpa\inpa\autorun.inf: 60 C:\Downloads\Software\SKU322505 inpa\inpa\cemtl.ico: 4286 C:\Downloads\Software\tdm-gcc-5.1.0-3.exe: 36802006 C:\Downloads\Software\tdm64-gcc-5.1.0-2.exe: 48071122
For a fuller example, see http://openeuphoria.org/docs/std_filesys.html#_1282_walk_dir
PS: you'll probably also need include std/filesys.e at the top
11. Re: How to get list of output of find command of Linux
- Posted by ghaberek (admin) Jul 18, 2017
I quickly tried that code on Phix, and apart from the parameter default of current_dir() not being supported on Phix (really must fix that some day), and wildcard_match(), it works fine.
One of the reasons for doing so was that I thought of a slightly better way to manage the queue (some comments stripped for clarity):
Not at all sure that will work on OE, and I can accept that a "flat" queue might be slightly more efficient, but I was aiming for elegance.
Interesting. You're basically creating a linked list instead of a flat queue. Neat!
PS: On recursion vs iteration, Greg may have slightly exaggerated, but is essentially correct. An iterative solution will create just one queue and plant each element in the result sequence files just once, and will therefore always be slightly more efficient than a recursive solution, but any difference in performance is probably insignificant. In the end, whichever you find easier (recursion vs iteration), to understand and maintain, is the one to use.
I suppose I like to hyperbolize things once in a while. Simply knowing that a particular method is introducing additional overhead makes it feel inefficient. But in most cases the difference is likely imperceptible.
Maybe something like this? http://rosettacode.org/wiki/Generator/Exponential#Phix (but with only one task)
Sort of? I don't think you'd need to use a task for that though. I was thinking of a recursive version of something like FindFirstFile, which basically uses a cursor to continue generating items until it's reach the end of the list.
A walk_dir() approach code here will be really illustrative and helpful.
It's pretty simple
For a fuller example, see http://openeuphoria.org/docs/std_filesys.html#_1282_walk_dir
I was offering to adapt my get_files() function to work like walk_dir() which I think is what he was asking for. Here's the code...
include std/filesys.e include std/wildcard.e public function get_files( integer rtn_id, sequence path = current_dir(), sequence name = "*", integer maxdepth = 0 ) -- -- rtn_id : routine id to call with each full path -- path : directory path to scan -- name : wildcard pattern to look for -- maxdepth : maximum folder depth to scan within path -- integer depth, exit_code sequence queue, item, full_path object items depth = 1 exit_code = 0 path = canonical_path( path ) -- push initial path into queue queue = { {path,depth} } while length( queue ) do -- pop an item off the queue item = queue[1] queue = queue[2..$] -- get the queue item values path = item[1] depth = item[2] -- get the directory items items = dir( path ) if atom( items ) then -- error continue end if for i = 1 to length( items ) do item = items[i] if find( item[D_NAME], {".",".."} ) then -- skip these entries continue end if full_path = join_path({ path, item[D_NAME] }) if find( 'd', item[D_ATTRIBUTES] ) then -- is a directory if maxdepth = 0 or depth < maxdepth then -- queue this directory queue = append( queue, {full_path,depth+1} ) end if else -- is a file if wildcard:is_match( name, item[D_NAME] ) then -- call the user routine exit_code = call_func( rtn_id, {full_path} ) if exit_code != 0 then -- time to quit queue = {} exit end if end if end if end for end while return exit_code end function
And then call the function like you would walk_dir()...
function look_at( sequence full_path ) printf( 1, "%s\n", {full_path} ) return 0 end function get_files( routine_id("look_at") )
12. Re: How to get list of output of find command of Linux
- Posted by ghaberek (admin) Jul 24, 2017
- Last edited Jul 25, 2017
Maybe something like this? http://rosettacode.org/wiki/Generator/Exponential#Phix (but with only one task)
Sort of? I don't think you'd need to use a task for that though. I was thinking of a recursive version of something like FindFirstFile, which basically uses a cursor to continue generating items until it's reach the end of the list.
I had some time to sit down and knock out an example of what I was describing. Basically, my first version above was an eager scanner (it gets all of the files at once), this is a lazy scanner (it only gets more files when it needs to).
To use it, simple call dir_open() to start scanning the directory, then loop until dir_next() returns an empty sequence, and call dir_close() to clean up when you're done.
This is the most efficient method in terms of both speed -- you don't have to wait for it to find all the files before starting the loop, and memory -- it only keeps a subset of directories and file paths in memory at any given time.
Edit: actually, this is probably only faster as a matter of perception since you can get started on the body of the loop sooner; so it feels faster. Just running the numbers quickly on my machine, the eager method (see above) may be about 20-30% faster, at least it is on my system. So if memory is of no concern, then scanning the entire directory first is probably faster, but real-world cases will vary wildly. My tests are doing nothing but printing the path to the console. YMMV
Also, the find command is like, a bazillion times faster than any of this. Not sure what it's doing but there's got to be some lower-level operating system or file system calls going on.
-- dir_scan.e include std/filesys.e include std/wildcard.e sequence d = {} enum PATTERN,DIRS,FILES -- -- prepare a new directory scanner instance -- public function dir_open( sequence path = ".", sequence pattern = "*" ) if path[$] != SLASH then path &= SLASH end if path = canonical_path( path ) integer id = find( {}, d ) if id = 0 then d = append( d, {} ) id = length( d ) end if d[id] = {pattern,{path},{}} -- PATTERN,DIRS,FILES return id end function -- -- clean up an unused directory scanner -- public procedure dir_close( integer id ) d[id] = {} end procedure -- -- get the next available file in the queue -- (returns "" when queues are exhausted) -- public function dir_next( integer id ) sequence pattern = d[id][PATTERN] if length( d[id][FILES] ) != 0 then -- we have files, return the next one -- pop a file off the queue sequence next_file = d[id][FILES][1] d[id][FILES] = d[id][FILES][2..$] return next_file end if while length( d[id][DIRS] ) != 0 do -- we need to find more files -- pop a directory off the queue sequence next_dir = d[id][DIRS][1] d[id][DIRS] = d[id][DIRS][2..$] -- get the directory items object items = dir( next_dir ) if atom( items ) then continue end if for i = 1 to length( items ) do -- get the item details sequence item_name = items[i][D_NAME] sequence item_attr = items[i][D_ATTRIBUTES] sequence full_path = next_dir & item_name if find( item_name, {".",".."} ) then -- skip these items continue elsif find( 'd', item_attr ) then -- add this to the directory queue d[id][DIRS] = append( d[id][DIRS], full_path & SLASH ) elsif wildcard:is_match( pattern, item_name ) then -- add this to the files queue d[id][FILES] = append( d[id][FILES], full_path ) end if end for if length( d[id][FILES] ) != 0 then -- we have files now, return the next one -- pop a file off the queue sequence next_file = d[id][FILES][1] d[id][FILES] = d[id][FILES][2..$] return next_file end if end while return "" end function
-- find.ex include dir_scan.e procedure main() integer id = dir_open( "/home/greg", "*.e" ) sequence path while length( path ) with entry do printf( 1, "%s\n", {path} ) entry path = dir_next( id ) end while dir_close( id ) end procedure main()
13. Re: How to get list of output of find command of Linux
- Posted by ghaberek (admin) Jul 25, 2017
Here are the numbers testing the home directory on my machine (about 8200 files), using time.
find command | lazy method (eui) | lazy method (euc) | eager method (eui) | eager method (euc) | |
real | 0m0.356s | 1m47.227s | 1m39.817s | 1m16.339s | 1m10.341s |
user | 0m0.188s | 0m54.788s | 0m48.072s | 0m23.888s | 0m18.424s |
sys | 0m0.156s | 0m52.280s | 0m51.384s | 0m51.856s | 0m51.412s |
14. Re: How to get list of output of find command of Linux
- Posted by ghaberek (admin) Jul 25, 2017
I dug through the code for find and discovered that it's using a Linux API called fts which, ironically, uses a similar method to the one I've implemented above. So I guess I was on the right track!
I can work on writing a wrapper for this in Euphoria but that presents a challenge since the API relies heavily on a structure and structure support in Euphoria is all over the place right now.
Edit: my searching also found this conversation, which I forgot about from last year: dir() on linux via c_func, anyone?