1. http_get does not retrieve page content
- Posted by useless_ Jan 09, 2013
- 1508 views
My recent post on this topic in the ticket section was deleted. In question is how Eu parses a http header, to retrieve webpages.
Not all http servers adhere exactly to the "standard". By strictly following the "standard" Eu will "break" when attempting to fetch a page with any deviation from what is acceptable to Eu. This procedure is now broken.
It's not Eu place to enforce the "standards", and it's unacceptable that Eu is voluntarily broken, and intolerant of any slight difference in the "standard". Eu will now refuse to get those webpages, when it is quite possible to gracefully get them.
useless
2. Re: http_get does not retrieve page content
- Posted by useless_ Jan 09, 2013
- 1422 views
My recent post on this topic in the ticket section was deleted. In question is how Eu parses a http header, to retrieve webpages.
Not all http servers adhere exactly to the "standard". By strictly following the "standard" Eu will "break" when attempting to fetch a page with any deviation from what is acceptable to Eu. This procedure is now broken.
It's not Eu place to enforce the "standards", and it's unacceptable that Eu is voluntarily broken, and intolerant of any slight difference in the "standard". Eu will now refuse to get those webpages, when it is quite possible to gracefully get them.
useless
My first post on this subject now shows an EDIT flag. I didn't edit that post. (i did edit this one).
useless
Someone also edited this post, again changing what i said, and then deleted a followup post. There is a distinct lack of editorial integrity on this forum.
3. Re: http_get does not retrieve page content
- Posted by DerekParnell (admin) Jan 09, 2013
- 1389 views
My recent post on this topic in the ticket section was deleted. In question is how Eu parses a http header, to retrieve webpages.
Not all http servers adhere exactly to the "standard". By strictly following the "standard" Eu will "break" when attempting to fetch a page with any deviation from what is acceptable to Eu. This procedure is now broken.
It's not Eu place to enforce the "standards", and it's unacceptable that Eu is voluntarily broken, and intolerant of any slight difference in the "standard". Eu will now refuse to get those webpages, when it is quite possible to gracefully get them.
This sounds very reasonable to me.
What exactly is it in the webpage data that's causing the Eu library routines to reject those webpages?
4. Re: http_get does not retrieve page content
- Posted by useless_ Jan 09, 2013
- 1398 views
My recent post on this topic in the ticket section was deleted. In question is how Eu parses a http header, to retrieve webpages.
Not all http servers adhere exactly to the "standard". By strictly following the "standard" Eu will "break" when attempting to fetch a page with any deviation from what is acceptable to Eu. This procedure is now broken.
It's not Eu place to enforce the "standards", and it's unacceptable that Eu is voluntarily broken, and intolerant of any slight difference in the "standard". Eu will now refuse to get those webpages, when it is quite possible to gracefully get them.
This sounds very reasonable to me.
What exactly is it in the webpage data that's causing the Eu library routines to reject those webpages?
The decision, as i understand it, is which chars are line terminators in the header, and what order and quantity they are. There is an RFC and a "standard", which isn't always followed to the letter, and this is a common situation online (the frequent cause of browser wars).
It's my contention that trim() will solve for all "standard" situations as well as non-standard situations.
useless
5. Re: http_get does not retrieve page content
- Posted by jimcbrown (admin) Jan 09, 2013
- 1392 views
Not all http servers adhere exactly to the "standard". By strictly following the "standard" Eu will "break" when attempting to fetch a page with any deviation from what is acceptable to Eu. This procedure is now broken.
It's not Eu place to enforce the "standards", and it's unacceptable that Eu is voluntarily broken, and intolerant of any slight difference in the "standard". Eu will now refuse to get those webpages, when it is quite possible to gracefully get them.
This sounds very reasonable to me.
Me too, although I'd like to see some hard data (specific websites that demonstrate the symptoms of this issue - and if possible, statistics on how widespread this is and how other HTTP library implementations deal with this issue) before I'd feel comfortable implementing this kind of change myself.
Still, I'll defer to the group decision, as I usually do.
6. Re: http_get does not retrieve page content
- Posted by useless_ Jan 09, 2013
- 1385 views
Not all http servers adhere exactly to the "standard". By strictly following the "standard" Eu will "break" when attempting to fetch a page with any deviation from what is acceptable to Eu. This procedure is now broken.
It's not Eu place to enforce the "standards", and it's unacceptable that Eu is voluntarily broken, and intolerant of any slight difference in the "standard". Eu will now refuse to get those webpages, when it is quite possible to gracefully get them.
This sounds very reasonable to me.
Me too, although I'd like to see some hard data (specific websites that demonstrate the symptoms of this issue - and if possible, statistics on how widespread this is and how other HTTP library implementations deal with this issue) before I'd feel comfortable implementing this kind of change myself.
Still, I'll defer to the group decision, as I usually do.
I don't log the different header syntax, i just get the webpages. And i no longer research other computer languages, because i found this one. I am telling you the situation has occured.
useless
7. Re: http_get does not retrieve page content
- Posted by CoJaBo2 Jan 09, 2013
- 1365 views
I don't log the different header syntax, i just get the webpages. And i no longer research other computer languages, because i found this one. I am telling you the situation has occured.
Ok, so log it next time. In the mean time, there is nothing that can be done about it, since there is no way to tell what the problem was. Indeed, it could well have simply been bug #831, which is now fixed.
8. Re: http_get does not retrieve page content
- Posted by DerekParnell (admin) Jan 09, 2013
- 1373 views
Ok, so log it next time. In the mean time, there is nothing that can be done about it, since there is no way to tell what the problem was. Indeed, it could well have simply been bug #831, which is now fixed.
I think that 'trim()' is a better way to go as that would be more tolerant of the variations that might exist in web servers out in the wild.
9. Re: http_get does not retrieve page content
- Posted by CoJaBo2 Jan 09, 2013
- 1376 views
I think that 'trim()' is a better way to go as that would be more tolerant of the variations that might exist in web servers out in the wild.
Is there any evidence to suggest that such servers do exist? I would be interested in seeing one.
10. Re: http_get does not retrieve page content
- Posted by useless_ Jan 09, 2013
- 1363 views
I think that 'trim()' is a better way to go as that would be more tolerant of the variations that might exist in web servers out in the wild.
Is there any evidence to suggest that such servers do exist? I would be interested in seeing one.
There is my words.
useless
11. Re: http_get does not retrieve page content
- Posted by DerekParnell (admin) Jan 09, 2013
- 1338 views
I think that 'trim()' is a better way to go as that would be more tolerant of the variations that might exist in web servers out in the wild.
Is there any evidence to suggest that such servers do exist? I would be interested in seeing one.
I have none.
But let's assume that they don't; in other words, every web server that currently exists and will ever exist will always deliver correct webpage headers. In that case, trim() does no harm and runs very quickly (has almost nothing to do).
Now let's continue hypothesising ... what if, a web sever actually does exist that can deliver headers that have non-standard whitespace. We need to ask ourselves, do we care? ... does the user of our application care? Probably not. So if the library uses trim(), it will be as if standard headers were used and the application doesn't trip over or do something unintended.
So on the balance of probability (there exists a chance of bad web servers existing), and in the interests of using defensive programming, why not use trim() regardless.
12. Re: http_get does not retrieve page content
- Posted by m_sabal Jan 10, 2013
- 1401 views
Step 1: Please test the offending web sites using the old eunet. I suspect you will find the same behavior. Step 2: Identify what characters are being used as line terminators in place of the standard.
Trim will not work in this case, because the entire document, headers plus body, is being sent as one large block of data (or in some cases, multiple n-byte blocks). The program then needs to parse each of the header lines out of that block, and then return what's left as the body of the document. Trim will only remove null characters and white space from the beginning and end of the entire transmission. It won't help with the parsing.
I made the decision to make eunet strictly standards compliant until enough use cases could be identified that violated the standard to program for the exception. It's all open source. You are welcome to add the additional code you need to solve your problem.