Jump to content
clubreseau

remove part of string and compare

Recommended Posts

Hi,

 

Question 1

 

I have a listbox with full or URL like

http://test.com/demo.php

https://test.com/demo.php

http://www.test.com/demo.php

 

 

if my string have somethink like https://test.com/test/test.php then do not add in listbox1 because test.com are in the listbox1.

maybe the solution is to remove http:// and https:// at the begining until first / and after compare to see.

 

 

Question 2

 

I have a listbox with 3000 URL when I press the button to paste all items from listbox to memo, its feeze and take time.

Someone have the solution for long text for memo.

 

Thank you, please im visual need a demo of part of code.

Edited by clubreseau

Share this post


Link to post

Question 1

Try something like this

function ExtractDomainName(const Url: string): string;
var
  FirstChar, LastChar, Len: Integer;
begin
  Result := '';
  FirstChar := Pos('://', Url);
  LastChar := Pos('/', Url, FirstChar + 3);
  if FirstChar = 0 then
    FirstChar := 1                  //  means no protocol there, so we adjust for the first char
  else
    FirstChar := FirstChar + 3;     //  to compensate the length of "://"
  if LastChar = 0 then
    LastChar := Length(Url) + 1;    //  adjust for the last char, add 1 for Len calculation
  if FirstChar >= LastChar then
    Exit;                           //  we have broken an empty URL or broken one
  Len := LastChar - FirstChar;
  SetLength(Result, Len);
  Move(Url[FirstChar], Result[1], Len * SizeOf(Char));
end;

Question 2

You need to understand the following

TMemo is a control that will issue an invalidate and redraw..., every time you add an item, this behaviour is not strict to TMemo, you can avoid the redraw on every items when you will add many items in a loop by locking it.

  Memo.Lines.BeginUpdate;
  try
    // We loop here and add items
  finally
    Memo.Lines.EndUpdate;
  end;

try..finally is crucial here, because if you locked some control like a Memo or ListBox, then in that loop some exception raised, that will mean EndUpdate is not called and your control is frozen and might cause you whole application to freeze.

Share this post


Link to post
On 11/4/2020 at 3:22 AM, Kas Ob. said:

Question 1

Try something like this


function ExtractDomainName(const Url: string): string;
var
  FirstChar, LastChar, Len: Integer;
begin
  Result := '';
  FirstChar := Pos('://', Url);
  LastChar := Pos('/', Url, FirstChar + 3);
  if FirstChar = 0 then
    FirstChar := 1                  //  means no protocol there, so we adjust for the first char
  else
    FirstChar := FirstChar + 3;     //  to compensate the length of "://"
  if LastChar = 0 then
    LastChar := Length(Url) + 1;    //  adjust for the last char, add 1 for Len calculation
  if FirstChar >= LastChar then
    Exit;                           //  we have broken an empty URL or broken one
  Len := LastChar - FirstChar;
  SetLength(Result, Len);
  Move(Url[FirstChar], Result[1], Len * SizeOf(Char));
end;

Question 2

You need to understand the following

TMemo is a control that will issue an invalidate and redraw..., every time you add an item, this behaviour is not strict to TMemo, you can avoid the redraw on every items when you will add many items in a loop by locking it.


  Memo.Lines.BeginUpdate;
  try
    // We loop here and add items
  finally
    Memo.Lines.EndUpdate;
  end;

try..finally is crucial here, because if you locked some control like a Memo or ListBox, then in that loop some exception raised, that will mean EndUpdate is not called and your control is frozen and might cause you whole application to freeze.

I use this to Load 10K of Items from listbox1 to Memo1 and it take 4 minut to see items in memo1 and the APP Freeze.

 

here my code

procedure TForm1.Button1(Sender: TObject);
var
i:integer;
begin
       Memo1.lines.BeginUpdate;
  try

   for i := 0 to ListBox1.Items.Count -1 do

       Memo1.Lines := ListBox1.Items;

      finally

        Memo1.lines.EndUpdate;

      end;
end;

 

Share this post


Link to post
1 hour ago, clubreseau said:

for i := 0 to ListBox1.Items.Count -1 do
       Memo1.Lines := ListBox1.Items;

What is this ???!!!!!!

You are loading all items 10000 items, and yes it will take 4 minutes.

 

You should either do this 

  Memo1.Lines.BeginUpdate;
  try
    for I := 0 to ListBox1.Items.Count - 1 do
      Memo1.Lines.Add(ListBox1.Items.Strings[I])
  finally
    Memo1.Lines.EndUpdate;
  end;

Or simply this one line, (no looping)

Memo1.Lines := ListBox1.Items;  // no need to BeginUpdate..EndUpdate here 

the difference is this, if you need to process each item then use BeginUpdate..EndUpdate and loop through the items, when you don't need to process then you don't need to loop, means just one line will do it.

 

Here is a demo with 10k lines, you go and time it and what is the difference for yourself.

 

ListBoxMemo.zip

Share this post


Link to post
12 hours ago, Kas Ob. said:

What is this ???!!!!!!

You are loading all items 10000 items, and yes it will take 4 minutes.

 

You should either do this 


  Memo1.Lines.BeginUpdate;
  try
    for I := 0 to ListBox1.Items.Count - 1 do
      Memo1.Lines.Add(ListBox1.Items.Strings[I])
  finally
    Memo1.Lines.EndUpdate;
  end;

Or simply this one line, (no looping)


Memo1.Lines := ListBox1.Items;  // no need to BeginUpdate..EndUpdate here 

the difference is this, if you need to process each item then use BeginUpdate..EndUpdate and loop through the items, when you don't need to process then you don't need to loop, means just one line will do it.

 

Here is a demo with 10k lines, you go and time it and what is the difference for yourself.

 

ListBoxMemo.zip

 

why this freeze the APP and take 1 minut to load 10K of items ?

procedure TForm1.Button1(Sender: TObject);

begin
   Form1.Memo1.Lines := Form2.Listbox1.Items;
end;

 

Share this post


Link to post
18 minutes ago, clubreseau said:

why this freeze the APP and take 1 minut to load 10K of items ?

Generally speaking, the API of most Windows common controls goes through the windows message queue. So my guess is that for each item in the listbox there's a lot of messages going back and forth while the listbox is populated and that's an enormous overhead.

If you break the application in the debugger while the list is loading you can see what it's doing.

Share this post


Link to post
29 minutes ago, clubreseau said:

why this freeze the APP and take 1 minut to load 10K of items ?

I have no idea !

 

On my PC both take almost the same time and that is less than 2 seconds.

Share this post


Link to post
14 minutes ago, Anders Melander said:

Generally speaking, the API of most Windows common controls goes through the windows message queue. So my guess is that for each item in the listbox there's a lot of messages going back and forth

Messages are used, yes (LB_ADDSTRING, etc).  But not the message queue, no.  The messages in question would be SENT directly to the ListBox's window procedure for immediate processing, not POSTED to the message queue of the thread that owns the ListBox awaiting dispatching to the ListBox's window procedure.

 

Edited by Remy Lebeau
  • Like 1

Share this post


Link to post
1 minute ago, Remy Lebeau said:

But not the message queue, no.

Yes, it was poorly expressed on my part. If they went through the message queue then the application wouldn't appear hung.

Share this post


Link to post

Im confused !

 

then the fastest way is ?

 

someone tell me to use tstringlist ?

Edited by clubreseau

Share this post


Link to post
17 hours ago, clubreseau said:

I use this to Load 10K of Items from listbox1 to Memo1 and it take 4 minut to see items in memo1 and the APP Freeze.

 

here my code


procedure TForm1.Button1(Sender: TObject);
var
i:integer;
begin
       Memo1.lines.BeginUpdate;
  try

   for i := 0 to ListBox1.Items.Count -1 do

       Memo1.Lines := ListBox1.Items;

      finally

        Memo1.lines.EndUpdate;

      end;
end;

 

 

the function do not delete all duplicate domain in listbox1 ?

function ExtractDomainName(const Url: string): string;
var
  FirstChar, LastChar, Len: Integer;
begin
  Result := '';
  FirstChar := Pos('://', Url);
  LastChar := Pos('/', Url, FirstChar + 3);
  if FirstChar = 0 then
    FirstChar := 1                  //  means no protocol there, so we adjust for the first char
  else
    FirstChar := FirstChar + 3;     //  to compensate the length of "://"
  if LastChar = 0 then
    LastChar := Length(Url) + 1;    //  adjust for the last char, add 1 for Len calculation
  if FirstChar >= LastChar then
    Exit;                           //  we have broken an empty URL or broken one
  Len := LastChar - FirstChar;
  SetLength(Result, Len);
  Move(Url[FirstChar], Result[1], Len * SizeOf(Char));
end;

 

Share this post


Link to post

Inserting lines one by one is a very expensive operation for memo. Also your 'All items at once' is wrong !

procedure TForm10.Button5Click(Sender: TObject);
begin
  Memo1.Lines := ListBox1.Items; // this just inserts item one by one => same as 'One item with loop' !
  // if you want to insert all items in a faster way than what you called 'One item with loop', you need to use text property or SetTextBuf (depending on your need):
  // - Memo1.Text := ListBox1.Items.Text;
  // - Memo1.SetTextBuf(PChar(ListBox1.Items.Text));
end;

 

  • Like 2

Share this post


Link to post

 I have a listbox with full or URL like

http://test.com/demo.php

https://test.com/demo.php

http://www.test.com/demo.php

 

 

if my string have somethink like https://test.com/test/test.php then do not add in listbox1 because test.com are in the listbox1.

maybe the solution is to remove http:// and https:// at the begining until first / and after compare to see.

 

someone have a solution to delete duplicate domain in my listbox1 ?

 

Share this post


Link to post
3 hours ago, clubreseau said:

someone have a solution to delete duplicate domain in my listbox1 ?

function ExtractDomain(const URL : String) : String;
var
    I, J : Integer;
begin
    I := Pos('://', URL);
    if I <= 0 then
        I := 1
    else
        Inc(I, 3);
    J := Pos('/', URL, I);
    if J <= 0 then begin
        Result := Copy(URL, I, MAXINT);
        Exit;
    end;
    Result := Copy(URL, I, J - I);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
    Index  : Integer;
    Dict   : TDictionary<String, Integer>;
    URL    : String;
    Domain : String;
    Value  : Integer;
begin
    Dict  := TDictionary<String, Integer>.Create(10000);
    try
        ListBox1.Items.BeginUpdate;
        try
            for Index := ListBox1.Items.Count - 1 downto 0 do begin
                URL := ListBox1.Items[Index];
                if URL = '' then begin
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                Domain := ExtractDomain(Trim(UpperCase(URL)));
                if Dict.TryGetValue(Domain, Value) then begin
                    // Domain already found, delete from ListBox
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                // Domain not seen before, add to dictionary and don't remove from list
                Dict.Add(Domain, 0);
            end;
        finally
            ListBox1.Items.EndUpdate;
        end;
    finally
        FreeAndNil(Dict);
    end;
end;

This will check for domain by ignoring character casing and ignoring the protocol.
You may adept this to a TStrings easily.

I used a dictionary because you said you have 10K items. A dictionary should be faster than a simple list when there are a lot of items but I have not checked how many items are required so that dictionary is faster than list.

Edited by FPiette
  • Thanks 1

Share this post


Link to post
16 hours ago, FPiette said:

function ExtractDomain(const URL : String) : String;
var
    I, J : Integer;
begin
    I := Pos('://', URL);
    if I <= 0 then
        I := 1
    else
        Inc(I, 3);
    J := Pos('/', URL, I);
    if J <= 0 then begin
        Result := Copy(URL, I, MAXINT);
        Exit;
    end;
    Result := Copy(URL, I, J - I);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
    Index  : Integer;
    Dict   : TDictionary<String, Integer>;
    URL    : String;
    Domain : String;
    Value  : Integer;
begin
    Dict  := TDictionary<String, Integer>.Create(10000);
    try
        ListBox1.Items.BeginUpdate;
        try
            for Index := ListBox1.Items.Count - 1 downto 0 do begin
                URL := ListBox1.Items[Index];
                if URL = '' then begin
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                Domain := ExtractDomain(Trim(UpperCase(URL)));
                if Dict.TryGetValue(Domain, Value) then begin
                    // Domain already found, delete from ListBox
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                // Domain not seen before, add to dictionary and don't remove from list
                Dict.Add(Domain, 0);
            end;
        finally
            ListBox1.Items.EndUpdate;
        end;
    finally
        FreeAndNil(Dict);
    end;
end;

This will check for domain by ignoring character casing and ignoring the protocol.
You may adept this to a TStrings easily.

I used a dictionary because you said you have 10K items. A dictionary should be faster than a simple list when there are a lot of items but I have not checked how many items are required so that dictionary is faster than list.

 

Thank you its work perfectly !

How to convert it, to look at a file called url.txt that contain URL, and delete all duplicate domains in my listbox1 ?

and like your code I want its look only domain root. If domain root are in Url.txt delete the entire URL in listbox1.

 

Thank you

Share this post


Link to post
5 hours ago, clubreseau said:

How to convert it, to look at a file called url.txt that contain URL, and delete all duplicate domains in my listbox1 ?

 

The idea is just the same as above and then read the url file line by line, extract the domain, check the dictionary. If already there, just read next line if not there, add the domain in the dictionary and add the url to the listbox. You should be able to do that easily yourself.

Quote

and like your code I want its look only domain root. If domain root are in Url.txt delete the entire URL in listbox1.

I don't understand what you call "domain root". Give an example.

Share this post


Link to post
2 hours ago, FPiette said:

 

The idea is just the same as above and then read the url file line by line, extract the domain, check the dictionary. If already there, just read next line if not there, add the domain in the dictionary and add the url to the listbox. You should be able to do that easily yourself.

I don't understand what you call "domain root". Give an example.

 

this is my code and not working

trying to loook into files and delete all duplicated URL in listbox1.


 

procedure TForm1.DELETEURLClick(Sender: TObject);
var
    Index  : Integer;
    Dict   : TDictionary<String, Integer>;
    URL    : String;
    Domain : String;
    Value  : Integer;
    sl : TStringList;
    ix : Integer;
begin
    sl := TStringList.Create;
    Dict  := TDictionary<String, Integer>.Create(100000);
     try
        ListBox1.Items.BeginUpdate;
        try
            for Index := ListBox1.Items.Count - 1 downto 0 do begin
                URL := ListBox1.Items[Index];
                sl.LoadFromFile('C:\Users\myname\Documents\Url.txt');
                ix := sl.IndexOf(URL);

                if URL = '' then begin
                    ListBox1.Items.Delete(Index);
                    continue;
                end;

                Domain := ExtractDomain(Trim(UpperCase(ix.ToString)));
                if Dict.TryGetValue(Domain, Value) then begin
                    // Domain already found, delete from ListBox
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                // Domain not seen before, add to dictionary and don't remove from list
                Dict.Add(Domain, 0);
            end;
        finally
            ListBox1.Items.EndUpdate;
        end;
    finally
        FreeAndNil(Dict);
        sl.Free;
    end;
end;





function ExtractDomain(const URL : String) : String;
var
    I, J : Integer;
begin
    I := Pos('://', URL);
    if I <= 0 then
        I := 1
    else
        Inc(I, 3);
    J := Pos('/', URL, I);
    if J <= 0 then begin
        Result := Copy(URL, I, MAXINT);
        Exit;
    end;
    Result := Copy(URL, I, J - I);
end;

 

thank you SIR !

Share this post


Link to post

You have to keep the code as I designed it to initialize the dictionary and clean - if needed - the list box content. Then, after the for/loop, without clearing the dictionary, add a second for/loop to scan the URL file as I explained in my previous message.

 

To read the URL file, of course you may use a string list or you may use TStreamReader or traditional file I/O to read the file line by line. The later will avoid loading the entire file in memory and only then scan for the lines.

Share this post


Link to post
8 hours ago, clubreseau said:

look at a file called url.txt that contain URL, and delete all duplicate domains in my listbox1 ?

What if an URL from url.txt is not in the ListBox? Simply ignore it or add it to the ListBox?

In other words, does url.txt contain unwanted URL or  does it contain URL to be added to the ListBox avoiding duplicates?

 

And you didn't answered my question about "root domain".

 

Share this post


Link to post

Here is code to add all URL from url.txt except those already existing in the ListBox and avoiding all duplicates.


 

function ExtractDomain(const URL : String) : String;
var
    I, J : Integer;
begin
    I := Pos('://', URL);
    if I <= 0 then
        I := 1
    else
        Inc(I, 3);
    J := Pos('/', URL, I);
    if J <= 0 then begin
        Result := Copy(URL, I, MAXINT);
        Exit;
    end;
    Result := Copy(URL, I, J - I);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
    Index   : Integer;
    Dict    : TDictionary<String, Integer>;
    URL     : String;
    Domain  : String;
    Value   : Integer;
    UrlFile : TStreamReader;
begin
    Dict  := TDictionary<String, Integer>.Create(10000);
    try
        ListBox1.Items.BeginUpdate;
        try
            for Index := ListBox1.Items.Count - 1 downto 0 do begin
                URL := Trim(ListBox1.Items[Index]);
                if URL = '' then begin
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                Domain := ExtractDomain(UpperCase(URL));
                if Dict.TryGetValue(Domain, Value) then begin
                    // Domain already found, delete from ListBox
                    ListBox1.Items.Delete(Index);
                    continue;
                end;
                // Domain not seen before, add to dictionary and don't remove
                Dict.Add(Domain, 0);
            end;
            // Now process url.txt file to add to the ListBox all URL found in
            // it, avoiding to add duplicates
            UrlFile := TStreamReader.Create('url.txt');
            try
                while not UrlFile.EndOfStream do begin
                    URL := Trim(UrlFile.ReadLine);
                    if URL = '' then
                        continue;
                    Domain := ExtractDomain(UpperCase(URL));
                    if Dict.TryGetValue(Domain, Value) then
                        // Domain already found, ignore it
                        continue;
                    // Domain not seen before, add to dictionary
                    Dict.Add(Domain, 0);
                    // and add the URL to the ListBox
                    ListBox1.Items.Add(URL);
                end;
            finally
                FreeAndNil(UrlFile);
            end;
        finally
            ListBox1.Items.EndUpdate;
        end;
    finally
        FreeAndNil(Dict);
    end;
end;

 

  • Like 1

Share this post


Link to post

May suggest to look at my last demo, i produced a list mask where duplicated domains are replaced with empty strings, i think it should fit your need, thanks to Francois and his code, you can use both of them.

 

The code is simple and short and support what you called root domains or subs, but i didn't use case insensitive comparison.

Share this post


Link to post
52 minutes ago, Kas Ob. said:

 what you called root domains or subs

I still do see what he call "root domain".

What are the root domains of "www.company.co.uk", "www.ulg.ac.be" and "www.company.com"?

Share this post


Link to post
5 hours ago, FPiette said:

I still do see what he call "root domain".

What are the root domains of "www.company.co.uk", "www.ulg.ac.be" and "www.company.com"?

domain = http://welcome.com/hi.html  the root or this domain is welcome.com  remove http:// until / 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×