Jump to content

Recommended Posts

I have a very simple text search feature that uses Pos() to locate search text in string.

Now I want to extend it to handle just a little more complex phrases. I think the most complex examples I need to implement includes:

- brackets

- quotes for full phrase

- and AND/OR

 

Probably most complex example: ("test phrase" OR testing) AND text OR "anything else"

I assume most common will be simple examples like:

test OR test2

test OR "test case"

test AND test2

 

I have all text in arrays with object IDs to know where it belongs, records like: CustomerID, Address, Name, Notes;  DocID, Title, Description, Notes...  - and all fields except CustomerID, DocID are searchable strings.

 

One idea is to try to convert search expression into regex, the other one if to parse the expression and try to make logical search as needed (this would probably be most complex to implement).

 

I see examples of using database and use their advanced search capabilities like: https://zarko-gajic.iz.hr/full-text-search-functionality-in-delphi-applications-implementation-idea/

But this is just one small feature of my full project, so I'm not really sure I want this to be reason to implement database, even the lightest ones.

 

I appreciate any suggestions on parsing such expression, converting to regex.. perhaps there are components that already do same or similar things.

 

 

 

 

 

Share this post


Link to post
Guest

I am not sure i do understand your quest completely, but will guess a little !

 

First, the records you listed look like a DB columns, so, is SQL approach is on the table ? 

By using SQL you do need nothing else, SQL in fact built to such search and much more complex syntaxes, but again i am guessing here, and to add to this, to my knowledge the most comprehensive SQL parser for pascal is the one by NexusDB, there is one file called nxSQLParse.pas and it is freaking scary !

Also there is many Pascal/Delphi scripters engine that do similar things, most of them are insightful, From the powerful TMS Scripter, to RTC, or Delphi HTML, ... the list is long and browsing the code is valuable knowledge.

 

But again, guessing you did not miss the SQL choice, so let me explain how i did many similar filtering on big data with complex syntaxes, although in my case it wasn't text at all.

I will try to explain with an example from your sample syntax, and hoping you will get the idea.

 

Your syntax is ("test phrase" OR testing) AND text OR "anything else"

First lets dissect syntax into pieces, we will solve this by the most simple and least complex way, divide and conquer !

1) Separate by the logic operators, means we will be left with a list of searchable text 

"test phrase"

testing

text

"anything else"

2) Build logic flow accordingly to brackets, here a little knowledge in how mathematical expression being parsed will be a great help, see we brackets and logic operators are similar to brackets, +, -, *,/ , after building the flow which almost modified version of the list in (1) with which one comes first then applied to which, there is many Pascal/Delphi mathematical expression parsing code out there.

3) Search for each item in (1) and build a new list (an array) for it, yes this seem expensive, but let it be for now.

4) Apply the Flow you built in (2) on the lists generated from (3), with each step you will be building a new list, this might be easier from adding or removing from/into (aka changing) a list, the last list is the search result.

 

Simple, Right ? 🙂

 

Later, and after making this work, you can go after eliminating unneeded lists, like the case with ("test phrase" OR testing), these can be handled as one list, but again make it work first in the simplest way you can, only after that you can refactor and redesign it with more complex but faster and less memory usage.

 

May be others will suggest different approaches, but now this what i can thing of, which i see can fit your need, and it will be elegant.

Share this post


Link to post

Thank you @Kas Ob., yes SQL is not on the table.. unless all other options are too complex to implement, but I doubt it.

 

I was thinking similar thing as you to parse expression and build some logical expression out of it and then run on text. But my idea was (just an idea so far) to run a full logical expression on each string, while your suggestion is to create lists based on parts of expression and then combine/cross-sect lists as operators say, right?

So, my idea was to parse expression to end up with something like:

 

this is as stage 3) in your example

function EvalPhraseInText(const aText: string): boolean;
begin
  Result := 
    ((Pos('text_phrase', aText)>0) OR (Pos('testing', aText) > 0 ))// replace 
    AND Pos('text', aString) > 0
    OR (Pos('anything else', aString)> 0;
end;

of course, I have no idea how to get to this point, at the moment, as this would be point 2) build logic flow.

 

I assume you thought of this approach when you were working on your solution and decided that is not as good as your solution.

 

 

 

 

 

Share this post


Link to post
27 minutes ago, Kas Ob. said:

the most comprehensive SQL parser for pascal is the one by NexusDB, there is one file called nxSQLParse.pas and it is freaking scary !

HTML Library Bundle contains SQL framework with SQL parser for Firebird, Oracle, MySQL, Postgres, Microsof SQL, Elevate and SQLite. Library supports query parsing, formatting, translation between SQL dialects, modifying (change order, where, limit, fields, etc) syntax check and DB schema check, editor completion implementation and much more. https://delphihtmlcomponents.com/SQLLibrary.pdf

  • Like 1

Share this post


Link to post
Guest
41 minutes ago, Mike Torrettinni said:

But my idea was (just an idea so far) to run a full logical expression on each string, while your suggestion is to create lists based on parts of expression and then combine/cross-sect lists as operators say, right?

 

45 minutes ago, Mike Torrettinni said:

I assume you thought of this approach when you were working on your solution and decided that is not as good as your solution.

Well, in my many written filters and search expression were mostly math expression to isolate ranges, means i was for some criteria's with defined values while including/excluding ranges (>,<, <= ...) so lists was essential here as they some search statements explicitly required medians ( also deviations...) like values from ranges, so building lists was the fastest approach.

 

Also there is cache usage which in some cases i had to choose this based on my benchmarks, anyway i don't think extreme speed is essential in your case because you need result and that what you care about, were in my case i want to repeat this many times and render many details visually, so 50ms is my upper tolerated limit, when i have millions of records.

 

59 minutes ago, Mike Torrettinni said:

this is as stage 3) in your example

No !, far from that, this is what you is the only approach, 

  function Search(const Text:string; const List:TSomeList):TSomeList;
  ...


  TLogicOperation = (loAND, loOR, loXOR);

  TFilter = class
  protected
    //FLogicOperation: TLogicOperation;
    function ApplyFilter(const List1, List2: TSomeList): TSomeList; virtual; abstract;
  public
    //constructor Create(Operation: TLogicOperation = loAND);
    function Apply(const List1, List2: TSomeList): TSomeList;
    //property FilterOperation: TLogicOperation read FLogicOperation write FLogicOperation;
  end;

  TFilterOR = class(TFilter)
  protected
    function ApplyFilter(const List1, List2:TSomeList): TSomeList; override;
  end;

That is a messy salad of two or three approaches, also a helper will do !

ApplyFilter will do one thing, the real code can be very short and accurate, consider it as food for thoughts.

And there should be a list of the filters and their sequence built by your parser, just don't over think it, in the imaginary code you will feed the result from ApplyFilter to another ApplyFilter as parameter.

 

Anyway ,

48 minutes ago, Mike Torrettinni said:

of course, I have no idea how to get to this point, at the moment, as this would be point 2) build logic flow.

Well, you know how to do it and it will be clear if you give it some time, just think outside the box, and relax your assumptions, again there is many approaches to handle it your way, but the questions is how to build chain of action at runtime, you know how to it, right?

only the flow i mentioned it will be simple as TAnd and TOr where these are any creatures you like, record, class or even raw list of actions built with your parser, may be each action will have pointer to the next action, not much explained in this way, but try to think about it and all the pieces will fall in place, (trust me it is easy)

 

I can't remember if it was you or not, who asked about JIT in this forum, see when you finished resolving this, you already built a JIT !, this is off-topic but extreme JIT will replace these line of logic with may be assembly optimized code and run them.

Share this post


Link to post
Guest
1 hour ago, Alexander Sviridenkov said:

HTML Library Bundle contains SQL framework with SQL parser

Well, i don't have the Bundle License, it is nice to know.

 

Also i want to point to the fact that the page for "HTML Report Library" is mentioning script engine and "designed to generate reports using databases and XML"

Quote

Build-in powerful scripting and expression evaluation engine allow to implement complex data processing algorithms and encapsulate all logic inside report.

But doesn't mention SQL explicitly.

Share this post


Link to post
Posted (edited)
1 hour ago, Kas Ob. said:

Also there is cache usage which in some cases i had to choose this based on my benchmarks, anyway i don't think extreme speed is essential in your case because you need result and that what you care about, were in my case i want to repeat this many times and render many details visually, so 50ms is my upper tolerated limit, when i have millions of records.

It's triggered on user input, so no need for super fast. I do have 'instant search' (search as you type), but this is not applicable for complex cases or if you type really fast,

I guess under 1s should be just fine. 

 

Oh, I do have highlighting of text for visual representation, but that only works on a found results, so doesn't need to be super fast, either.

 

Thanks, you gave me a lot to think about!

Edited by Mike Torrettinni

Share this post


Link to post
4 hours ago, Bob Devine said:

+1 for HTML Library

Yes, it's very powerful component, shame it can't parse expressions that are not pure SQL. I could probably customize my expression to add 'SELECT' and other sql keywords to parse, but that is not what I'm after.

 

Share this post


Link to post
15 minutes ago, Mike Torrettinni said:

Yes, it's very powerful component, shame it can't parse expressions that are not pure SQL. I could probably customize my expression to add 'SELECT' and other sql keywords to parse, but that is not what I'm after.

 

 

It can. Just override GetDialect method (uses sqlparse)

 

type
  TCustomExpression = class(TSQLExpression)
  private
    FDialect: TSQLDialect;
  public
    function GetDialect: TSQLDialect; override;
    destructor Destroy; override;
  end;

destructor TCustomExpression.Destroy;
begin
  FDialect.Free;
  inherited;
end;

function TCustomExpression.GetDialect: TSQLDialect;
begin
  if  FDialect = nil then
    FDialect := TSQLDialect.Create;
  Result := FDialect;
end;

procedure TForm67.Button1Click(Sender: TObject);
var SE: TCustomExpression;
begin
  SE := TCustomExpression.Create(nil);
  SE.ParseString('user=''john'' or user = ''peter''');
  ShowMessage(SE.Node.Children[0].Children[1].Name); //=john
end;

 

Also there is pascal expression class in htscriptparse unit:

 

var SE: TScriptExpression;
begin
  SE := TScriptExpression.CreateAndParse('(user=''john'') or (user = ''peter'')');
  SE.Variables['user'] := 'peter';
  ShowMessage(SE.Calc)

 

  • Thanks 1

Share this post


Link to post
Posted (edited)

There are plenty of open-source parsers available.

- ZeosDBO has powerful engine in \src\core\ (units ZExpression.pas, ZExprParser.pas, ZExprToken.pas, ZTokenizer.pas), you could take any part of it and customize to your needs

- IBExpress from std lib has RAD\source\IBX\IBX.IBScript.pas with the parser as well, just as FireDAC in RAD\source\data\firedac\FireDAC.Stan.Expr.pas for expressions and FireDAC.Comp.Script.pas for scripts

- Template engine from DVDChief also has expression parser

- and so on

Moreover, with such simple tokens you can easily write it yourself

 

Btw, I don't realize what "testing" and "text" are in your example. Variables? Or just single words to search for, with quotes removed for simplicity (a la Google)?

That's a nice task to solve, huh. Depending on maximum nesting level allowed it could require some discrete maths.

Edited by Fr0sT.Brutal
  • Like 1
  • Thanks 1

Share this post


Link to post
1 hour ago, Fr0sT.Brutal said:

what "testing" and "text" are in your example

 

User input search string, yes like google. I mean simple edit box.

No nested levels at this point.

 

Thanks for suggestions!

Share this post


Link to post
13 hours ago, Mike Torrettinni said:

User input search string, yes like google. I mean simple edit box.

Then SQL parsers won't help you much or you'll have to manually mix what they consider variables and string literals into one category.

Share this post


Link to post
3 hours ago, Fr0sT.Brutal said:

Then SQL parsers won't help you much or you'll have to manually mix what they consider variables and string literals into one category.

Yes, I think you are right. Probably I can make it work with a lot of modification to fit within case like this:

 

19 hours ago, Alexander Sviridenkov said:

SE.ParseString('user=''john'' or user = ''peter''');

I think this would require quite a modification of expression from: 'john or peter' to 'user = john or user = peter'.

I assume to actually replace Pos(), would need to use LIKE (or CHARINDEX)

SE.ParseString('user LIKE ''%john%'' or user LIKE ''%peter%''');

 

Share this post


Link to post
2 hours ago, Mike Torrettinni said:

I think this would require quite a modification of expression from: 'john or peter' to 'user = john or user = peter'.

Just don't forget to foresee cases when someone would wish to search for "or" (as a word) 🙂

Minimalistic free-form searches are cool (imagine if Google required us to write "word='foo' or word='bar" instead of "foo bar") but slightly more complex to implement with all edge cases. That's why Google uses specific symbols for logical operations

Share this post


Link to post
16 minutes ago, Fr0sT.Brutal said:

Just don't forget to foresee cases when someone would wish to search for "or" (as a word) 🙂

Minimalistic free-form searches are cool (imagine if Google required us to write "word='foo' or word='bar" instead of "foo bar") but slightly more complex to implement with all edge cases. That's why Google uses specific symbols for logical operations

I think I could instruct to use double-quotes for or/and and other operators, like it's used for phrases: "or customer" or "and customer" or "or" or "and" ....

Share this post


Link to post
Posted (edited)
On 7/27/2021 at 12:52 PM, Bob Devine said:

+1 for HTML Library

I confirm that HTMLLibrary is an awesome HTML library. Congratulation Alexander 😉

 

But for the SQL parser of HTML library I don't find how to transform this :

Peter or "Bibi Paulo" and John

 

to a where clause to search on a field :
 

select xxx where
Name containing 'Peter' or Name containing 'Bibi Paulo' and Name containing 'John'

 

Edited by FabDev

Share this post


Link to post
Posted (edited)

I felt like playing with this idea for Delphi - as like writing parsers. 😉 Years ago I did something similar for a document management system in a different language. I've written a small demo illustrating 2 ways of parsing https://github.com/sempare/sempare-delphi-fts-expression

 

I did this for fun and in a few hours, so it is still very raw. It contains a hand rolled lexer and parser, but hopefully can give you some ideas how to do this too.

 

In the first (Sempare.FullTextSearch.SimpleExpression.Test.pas) you can see some scenarios:

procedure TSimpleExprTest.TestSimpleExpr;
begin
  assert.AreEqual('name = ''conrad''', tparser.ParseStr('name', '"conrad"'));
end;

procedure TSimpleExprTest.TestWildCard2Expr;
begin
  assert.AreEqual('(name like ''conrad%'') or (name like ''%pete'')', tparser.ParseStr('name', '"conrad%" "%pete"'));
end;

procedure TSimpleExprTest.TestWildCardExpr;
begin
  assert.AreEqual('name like ''conrad%''', tparser.ParseStr('name', '"conrad%"'));

end;

procedure TSimpleExprTest.TestNotExpr;
begin
  assert.AreEqual('not (name = ''conrad'')', tparser.ParseStr('name', 'not "conrad"'));
end;

procedure TSimpleExprTest.TestOrExpr;
begin
  assert.AreEqual('(name = ''abc'') or (name = ''def'')', tparser.ParseStr('name', '"abc" or "def"'));
end;

procedure TSimpleExprTest.TestPrecidentExpr;
begin
  assert.AreEqual('(name = ''abc'') and (not ((name = ''321'') or (name = ''def'')))', tparser.ParseStr('name', '''abc'' and  not ''321'' or  ''def'''));
end;

 

The second (Sempare.FullTextSearch.Expression.Test.pas) is a bit more general. It illustrates parsing, but implementation at present is fairly basic as just a pretty printer:

procedure TExprTest.TestNotExpr;
begin
  assert.Areequal('not (name != ''conrad'')', tparser.ParseStr('not (name!="conrad")'));
end;

procedure TExprTest.TestOrExpr;
begin
  assert.Areequal('(name = ''abc'') or (text = ''def'')', tparser.ParseStr('name = ''abc'' or text = ''def'''));
end;

procedure TExprTest.TestPrecidentExpr;
begin
  assert.Areequal('((name = ''abc'') and (value = ''321'')) or (text = ''def'')', //
    tparser.ParseStr('name = ''abc'' and value = ''321'' or text = ''def'''));
end;

procedure TExprTest.TestStartsWithExpr;
begin
  assert.Areequal('name like ''abc%''', tparser.ParseStr('name starts with "abc"'));
end;

above you can see that the parsing identifies the precedence and places brackets appropriately. Also it changed 'starts with' to 'like'...

 

NOTE: there are 2 different parsers implemented for the above scenarios (but both named TParser - just depends which unit is being referenced)

 

Hope you find this useful.

Edited by darnocian
  • Like 2

Share this post


Link to post

I really suggest to use a library that does the work for you.

To name at least one: Lucene 

 

If you want to use some server software, use elastic search or the build in fulltext catalogs in sql databases.

 

Share this post


Link to post
7 minutes ago, generic said:

I really suggest to use a library that does the work for you.

Why would you suggest that?

 

10 minutes ago, generic said:

Lucene 

 

I would prefer Delphi solution, do you know of Lucene in Delphi?

Share this post


Link to post

Would be nice to have a common parser what you posted in your startpost, something with a base class which does the basic parsing of the query, and which can be extended in execution functionality .

Share this post


Link to post
14 minutes ago, mvanrijnen said:

Would be nice to have a common parser what you posted in your startpost, something with a base class which does the basic parsing of the query, and which can be extended in execution functionality .

Will post when I make progress.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×