Jump to content
HTMLValidator.com

Link checking and TSslHttpCli problem

Recommended Posts

Posted (edited)

I use TSslHttpCli for link checking in my application. I just installed the latest recommended release, 8.65.

 

A user reported that the link checking in my application is stalling (getting stuck).

 

This URL seems to be causing a problem: 

http://www.ec.gc.ca/dd-sd/Default.asp?lang=En&n=C2844D2D-1

 

 

When I try to check it using TSslHttpCli and HttpCli1->HeadASync(), I get two calls to OnRequestDone.

 

1. First, there is a location change to "http://ec.gc.ca/Error 404.html" (could the space character in the new URL cause a problem?)

2. Next there is a first RequestDone call with StatusCode 0 and ReasonPhrase "", which I've programmed to ignore because I expect a 2nd call to RequestDone

3. Next there is a second RequestDone call with StatusCode 200 and ReasonPhrase "OK".

4. Something goes wrong here and my link checker stalls/stops and I'm not sure why but I'm wondering if something is messed up / corrupted with TSslHttpCli ... especially since the RequestDone function is being called twice (is this a bug)? Or perhaps I'm doing something wrong?

 

I'd really appreciate any insight into what might be happening. Thanks!

Edited by HTMLValidator.com

Share this post


Link to post

I have investigated your URL and fixed it by correcting the bad relocation URL containing a space, at least for GET, but not yet for HEAD so another problem somewhere.  In my tests, the server closes the page as soon as a path with a space is found.  Testing with Firefox and Edge/Chrome suggests they correct the location path, only Edge displays it corrected. 

 

Even after correcting the space the Error 404.html page is returned with a 200 response, despite the page saying HTTP Error 404 - Not Found in English and French, so certainly my link checker would assume the link was okay, I don't parse the page text, maybe you do?

 

This also raises the issue of whether ICS should correct bad URLs, which browsers seem to do.  However, such correction is not trivial except for the simplest case of spaces, since we don't want to double encode / for instance. 

 

Angus

 

 

 

Share this post


Link to post

Thanks for checking into this. My link checker should check for spaces in redirected URLs and report them and maybe fix them... I'll work on this, but this is not the main problem.

 

I am still confused why the RequestDone function is called twice? Shouldn't it be called only once?

 

The main problem is my link checker stalls out for some reason when this happens, and it never works again until the user exists and restarts my application. Did you find any indication of this issue also causing any serious problems or corruption with the operation of TSslHttpCli? I need to research more what is going on that this link stops/stalls my link checker.

 

Having the TSslHttpCli correct bad URLs like browsers do seems like a good idea in the cases where it is trivial... but if one is running a link checker like I am, I would want a way to disable these corrections or somehow know that they have been done so I can report the problem to the user.

 

Share this post


Link to post

I found the problem with HEAD and some redirections, it failed to start the redirected request until close was called, a bug that seems to have been there for many years.  Looked at my own link checker and I use GET not HEAD which is why I've never seen it.  Your 404 error page returns content even for 200 and HEAD.  It will be in SVN tonight.

 

Angus

 

  • Like 1

Share this post


Link to post
5 minutes ago, Angus Robertson said:

I found the problem with HEAD and some redirections, it failed to start the redirected request until close was called, a bug that seems to have been there for many years.  Looked at my own link checker and I use GET not HEAD which is why I've never seen it.  Your 404 error page returns content even for 200 and HEAD.  It will be in SVN tonight.

That's great... thanks!

Share this post


Link to post
11 hours ago, Angus Robertson said:

I found the problem with HEAD and some redirections, it failed to start the redirected request until close was called, a bug that seems to have been there for many years.  Looked at my own link checker and I use GET not HEAD which is why I've never seen it.  Your 404 error page returns content even for 200 and HEAD.  It will be in SVN tonight.

 

Hello,

 

I downloaded the latest SVN and it looks like the major issue with the link checker stalling is fixed so thank you very much!

 

A minor issue... it looks like you're fixing the space character in location redirections which I think (in general) is a good idea but because I want my link checker to report issues and errors to the user, is there a way to detect that this fix/correction was done so I can report it to the user?

 

Share this post


Link to post

You can check the original location header by keeping it in the onHeaderData event before the relocation actually happens. 

 

I'm only auto URL encoding the redirection URL that the user can not change, not a URL passed to the component, that needs careful consideration.  Auto URL encoding is effectively what Firefox and Edge/Chrome do.

 

< HTTP/1.1 302 Redirect
< Content-Type: text/html; charset=UTF-8
< Location: http://ec.gc.ca/Error 404.html

< Content-Length: 153
> GET /Error%20404.html HTTP/1.1

 

Angus

 

Share this post


Link to post

Further to my last comment, I need to change the auto URL encoding for relocation so it does not process a URL that is already encoded correctly.  There will be another version soon.

 

Angus

 

Share this post


Link to post
Posted (edited)
8 hours ago, Angus Robertson said:

Further to my last comment, I need to change the auto URL encoding for relocation so it does not process a URL that is already encoded correctly.  There will be another version soon.

 

Angus

 

Well, I'm not sure how you are handling this... but when there was a location change to "https://www.htmlvalidator.com/test/contains space.txt" it looks like it encoded the URL to "https://www.htmlvalidator.com/test%2Fcontains%20space.txt", so it fixed the space character but broke the '/' character.

Perhaps you should just change space characters to %20?

Edited by HTMLValidator.com

Share this post


Link to post

Yes, paths are meant to be encoded within the path delimiters, not / itself, unless after ?  So I did the simple fix of only handling space.  About 20 links failed my own tester with full encoding.  So a new version is in SVN.  That Canadian site also broke the ICS proxy due to not supporting absolute URLs used by proxies, that's been on my list to fix for a year, so got done as well. Testing is always useful, gets me to fix things. 

 

Angus

 

 

 

  • Like 1

Share this post


Link to post
Posted (edited)
7 hours ago, Angus Robertson said:

Yes, paths are meant to be encoded within the path delimiters, not / itself, unless after ?  So I did the simple fix of only handling space.  About 20 links failed my own tester with full encoding.  So a new version is in SVN.  That Canadian site also broke the ICS proxy due to not supporting absolute URLs used by proxies, that's been on my list to fix for a year, so got done as well. Testing is always useful, gets me to fix things. 

Thanks. I've updated and it fixes the issue I mentioned.

As you suggested, I'm also using onHeaderData to get the "raw" location URL (before encoding fixes/changes)... and basically throwing a warning if a location change URL contains a space character.

 

While I have your attention... is there a quick "ICS" way/function to check a URL to see if there are any encoding issues with it? Right now I'm only checking for space characters (very basic).

Edited by HTMLValidator.com

Share this post


Link to post

I did wonder whether to write a check URL function, but it's not easy, except for space which is illegal in all URLs.  Lots of other special characters like /, & and % may be the result of previous encoding, so you don't know whether to encode them again.

 

But open to suggestions for a URL checker.

 

Angus

Share this post


Link to post
Posted (edited)
On 4/10/2021 at 3:34 AM, Angus Robertson said:

I did wonder whether to write a check URL function, but it's not easy, except for space which is illegal in all URLs.  Lots of other special characters like /, & and % may be the result of previous encoding, so you don't know whether to encode them again.

 

But open to suggestions for a URL checker.

Well, you wouldn't be able to detect some issues but a URL/percent encoding checker that would check for invalid encoding like when there is % that is not followed by two hexadecimal digits, and of course if there is that invalid space character. Possibly also detect encoded control characters like %00-%1F. I may have already written something like this a while back (I have to check as I can't remember for sure) but I thought that a function like this might already be in ICS.

Not a big issue but perhaps something to consider.

Edited by HTMLValidator.com

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×