Re: Python vs Euphoria for web scraping

new topic     » goto parent     » topic index » view thread      » older message » newer message
euphoric said...

But you have got to make grabbing URL output easier. Please! smile

I should be able to do something like this:

string page_html = get_url("http://phix.x10.mx/docs/html/phix.htm") 

without having to go through all the "easy" cURL set-up steps, especially for just web scraping.

Or are all those steps unavoidable?

No problem. In the next release curl_easy_perform_ex() will accept either a curl handle (as now), or a plain string url.
Should you want it sooner, just replace this in libcurl.e:

global function curl_easy_perform_ex(object curl) 
-- see also curl_multi_perform_ex, if you modify this. 
    enter_cs(ceb_cs) 
    integer slot_no = 0 
    for i=1 to length(curl_easy_buffers) do 
        if integer(curl_easy_buffers[i]) then 
            curl_easy_buffers[i] = "" 
            slot_no = i 
            exit 
        end if 
    end for 
    if slot_no=0 then 
        curl_easy_buffers = append(curl_easy_buffers,"") 
--      curl_multi_rids = append(curl_multi_rids,0) 
        slot_no = length(curl_easy_buffers) 
    end if 
    leave_cs(ceb_cs) 
 
    bool free_curl = false, 
         was_global_init = global_init 
    if string(curl) then 
        string url = curl 
        if not was_global_init then curl_global_init() end if 
        curl = curl_easy_init() 
        curl_easy_setopt(curl, CURLOPT_URL, url) 
        free_curl = true 
    end if 
    -- set callback function to receive data 
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_cb) 
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, slot_no) 
 
    -- get file 
    integer ret = curl_easy_perform(curl) 
    if free_curl then 
        curl_easy_cleanup(curl) 
        if not was_global_init then curl_global_cleanup() end if 
    end if 
 
    enter_cs(ceb_cs) 
    string res = curl_easy_buffers[slot_no] 
    curl_easy_buffers[slot_no] = 0  -- (can now be reused) 
    leave_cs(ceb_cs) 
 
    if ret!=CURLE_OK then 
        return ret 
    end if 
 
    return res 
end function 

I have also simplified https://rosettacode.org/wiki/Web_scraping#Phix

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu