1. Python vs Euphoria for web scraping

This article says Python is the "ideal language for this job." I suspect Euphoria or Phix could be better.

Anyway, it would be cool to see a Euphoria or Phix implementation.

new topic     » topic index » view message » categorize

2. Re: Python vs Euphoria for web scraping

I always tout Euphoria as a great language for text parsing! If someone could port Python's built-in HTMLParser to Euphoria, it might be easier to lure developers with that.

And if we could port Beautiful Soup, that'd be even more impressive. (It uses a lot of Python-isms that may not translate well to Euphoria.)

-Greg

new topic     » goto parent     » topic index » view message » categorize

3. Re: Python vs Euphoria for web scraping

There seem to be a ton of HTML parser libraries made with C. Would it be cheating to use one of those? smile

HTMLTidy seems very capable!

new topic     » goto parent     » topic index » view message » categorize

4. Re: Python vs Euphoria for web scraping

Hi

Some people call it cheating, some people call it using the expired patent on the wheel.

Cheers

Chris

new topic     » goto parent     » topic index » view message » categorize

5. Re: Python vs Euphoria for web scraping

euphoric said...

Would it be cheating to use one of those?

I would say that if you're designing an application which needs the functionality, using an external library is perfectly fine.

But, if you're trying to show off Euphoria's strengths over other languages, it's probably best to write it natively in Euphoria.

-Greg

new topic     » goto parent     » topic index » view message » categorize

6. Re: Python vs Euphoria for web scraping

BTDT:

https://rosettacode.org/wiki/Web_scraping#Phix
https://rosettacode.org/wiki/Rosetta_Code/Rank_languages_by_popularity#Phix
https://rosettacode.org/wiki/Rosetta_Code/Tasks_without_examples#Phix

new topic     » goto parent     » topic index » view message » categorize

7. Re: Python vs Euphoria for web scraping

Nice!

But you have got to make grabbing URL output easier. Please! smile

I should be able to do something like this:

string page_html = get_url("http://phix.x10.mx/docs/html/phix.htm") 

without having to go through all the "easy" cURL set-up steps, especially for just web scraping.

Or are all those steps unavoidable?

new topic     » goto parent     » topic index » view message » categorize

8. Re: Python vs Euphoria for web scraping

euphoric said...

But you have got to make grabbing URL output easier. Please! smile

I should be able to do something like this:

string page_html = get_url("http://phix.x10.mx/docs/html/phix.htm") 

without having to go through all the "easy" cURL set-up steps, especially for just web scraping.

Or are all those steps unavoidable?

No problem. In the next release curl_easy_perform_ex() will accept either a curl handle (as now), or a plain string url.
Should you want it sooner, just replace this in libcurl.e:

global function curl_easy_perform_ex(object curl) 
-- see also curl_multi_perform_ex, if you modify this. 
    enter_cs(ceb_cs) 
    integer slot_no = 0 
    for i=1 to length(curl_easy_buffers) do 
        if integer(curl_easy_buffers[i]) then 
            curl_easy_buffers[i] = "" 
            slot_no = i 
            exit 
        end if 
    end for 
    if slot_no=0 then 
        curl_easy_buffers = append(curl_easy_buffers,"") 
--      curl_multi_rids = append(curl_multi_rids,0) 
        slot_no = length(curl_easy_buffers) 
    end if 
    leave_cs(ceb_cs) 
 
    bool free_curl = false, 
         was_global_init = global_init 
    if string(curl) then 
        string url = curl 
        if not was_global_init then curl_global_init() end if 
        curl = curl_easy_init() 
        curl_easy_setopt(curl, CURLOPT_URL, url) 
        free_curl = true 
    end if 
    -- set callback function to receive data 
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_cb) 
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, slot_no) 
 
    -- get file 
    integer ret = curl_easy_perform(curl) 
    if free_curl then 
        curl_easy_cleanup(curl) 
        if not was_global_init then curl_global_cleanup() end if 
    end if 
 
    enter_cs(ceb_cs) 
    string res = curl_easy_buffers[slot_no] 
    curl_easy_buffers[slot_no] = 0  -- (can now be reused) 
    leave_cs(ceb_cs) 
 
    if ret!=CURLE_OK then 
        return ret 
    end if 
 
    return res 
end function 

I have also simplified https://rosettacode.org/wiki/Web_scraping#Phix

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu