Jump to content
Sign in to follow this  
Mike Torrettinni

Analyze strings for common parts

Recommended Posts

Posted (edited)

I'm analyzing a log file full of sql scripts of I'm trying to come up with a some sort of summary view that would show what kind of scripts are in the log.

 

A simple example:

select fname, lname, city, zip, taxcategory, taxowed from taxcustomers where taxcategory = 'TXX-01' and taxowed>0

select fname, lname, city, zip, taxcategory, taxowed from taxcustomers where taxcategory = 'NL-xxxx' and taxowed>500

 

and 100s+ of same scripts with different values in taxcategory and taxowed fields. So, I would like to end with something like this:

SQL script: select fname, lname, city, zip, taxcategory, taxowed from taxcustomers where taxcategory = '?' and taxowed>?

Column: taxcategory, values: ['TXX-01', 'NL-xxxx' , ... ]

Column: taxowed, values: [0, 500 , ... ]

 

Any suggestions how to approach this? Any such string utilities already exist, commercial perhaps?

Edited by Mike Torrettinni

Share this post


Link to post

You are looking for regular expressions. I would like to point you to Delphis documentation, but it has been offline for weeks now. Everything you need is in the unit System.RegularExpressions.

Share this post


Link to post
14 minutes ago, Der schöne Günther said:

You are looking for regular expressions. I would like to point you to Delphis documentation, but it has been offline for weeks now. Everything you need is in the unit System.RegularExpressions.

Why do you suggest regular expression? I don't know what the scripts will be in the log, so I can't use TRegEx.Matches, if that's what you are hinting at. Or am I wrong?

Share this post


Link to post
Guest

A programmer has a problem... regexp... you know.

There is this one unit one out there called a "diff engine".

I think it is incorporated (and slightly amended) in mORMot (a FLOSS huge thingy you can cherry pick from).

The diff engine makes a hash of your strings and outputs the diff.

If you are interested, reply and i will supply a link when @office. Or simply goooogle...

Share this post


Link to post
Posted (edited)
1 hour ago, Dany Marmur said:

The diff engine makes a hash of your strings and outputs the diff. 

I see, this could be interesting.

 

I assume you are referring to this, TDiff/TextDiff at https://github.com/rickard67/TextDiff

 

Edited by Mike Torrettinni
correct link

Share this post


Link to post
Guest
1 hour ago, Mike Torrettinni said:

TDiffEngine

Yes, i think so. But i'll need to check @office while not being under influence.

Check mORMot repo and see if you find a similar, slightly amended/changed unit.

In that case you have it.

Share this post


Link to post

I'm afraid what you want couldn't be done exactly via string routines. In the best case you could detect values but you want field names as well. Likely you'd better go SQL parsing way

Share this post


Link to post
2 hours ago, Fr0sT.Brutal said:

I'm afraid what you want couldn't be done exactly via string routines. In the best case you could detect values but you want field names as well. Likely you'd better go SQL parsing way

I will see how far I get with string diff. Perhaps I was too ambitious about getting each column and it's possible values, and just showing common string part and then list of differences, could be useful enough.

 

SQL parsing will get me a lot of info, but will be hard to reconstruct the matching script part. Perhaps a combination of both approaches, but this is a lot of testing what will work for useful outcome.

 

Share this post


Link to post

A lot  of SQL is executed with parameters which would be a much easier to use form of the data than logs containing SQL with values inline. It also reduces clutter as static values show up as such vs looking the same as a parameter that might change. Where are these logs coming from? Can they be updated to log SQL + parameter value list?  

Share this post


Link to post
18 minutes ago, Brian Evans said:

A lot  of SQL is executed with parameters which would be a much easier to use form of the data than logs containing SQL with values inline. It also reduces clutter as static values show up as such vs looking the same as a parameter that might change. Where are these logs coming from? Can they be updated to log SQL + parameter value list?  

They can't control the content of the log. It's some type of an audit log that sometimes contain info on critical accounts, errors... and I offered to provide a viewer that would make sense and be useful to them.

It depends on date range, but there could be single or multiple log files, each containing from 10s to 1000s of scripts. So, I need to parse and reduce the scripts to manageable and viewable form.

Share this post


Link to post
Guest

If you're only task is strictly to provide a diff viewer, buy them a licence of BC and charge all your hours!!

Share this post


Link to post
2 minutes ago, Dany Marmur said:

If you're only task is strictly to provide a diff viewer, buy them a licence of BC and charge all your hours!!

Log viewer, not diff viewer. I wish it was this simple 🙂

Share this post


Link to post
Guest

Ah, some special logfile parsing. Would not be the first time in history.

Share this post


Link to post
On 3/3/2022 at 4:40 PM, Mike Torrettinni said:

Any such string utilities already exist, commercial perhaps?

Maybe you could hire a programmer to make that, what is your budget?

Share this post


Link to post

Actually it seems I could use regex at the end. I managed to find diff between 2 scripts and get results like this:

Common[0]  = select fname, lname, city, zip, taxcategory, taxowed from taxcustomers where taxcategory = '
Common[1]  = ' and taxowed>

then I can define regex like this:

 

(?<=Common[0]).*?(?=Common[1])

and will extract all taxcategory values from all scripts, so:

 

['TXX-01', 'NL-xxxx' , ... ]

 

I just need to then extract after Common[1], but I don't have correct regex script, yet.

 

But using regex gets a lot more complicated with more than 2 Common parts in same script.

 

Share this post


Link to post

Play with this unit, it seems to do what you want, at least with the kind of input you showed.

 

unit SqlLogParserU;

interface

uses
  System.Classes, System.Generics.Collections;

type
  TStatementParts = record
    Fieldlist, FromList, WhereClause, OrderAndGrouping: string;
  end;

  TColumnDigest = class(TObject)
  strict private
    FName: string;
    FValues: TStringList;
  public
    constructor Create(const AName: string);
    destructor Destroy; override;
    property Name: string read FName;
    property Values: TStringList read FValues;
  end;

  TColumnDigestList = class(TObjectDictionary<string, TColumnDigest>)
  end;

  TSqlStatementDigest = class(TObject)
  strict private
  const
    CRegEx = '([_\d\w]+)\s*([=<>]+)\s*(''.+''|[\d.+\-]+)';
  var
    FColumns: TColumnDigestList;
    FOccurences: Integer;
    FParts: TStatementParts;
    function GetColumnlist: string;
    function GetStatement: string;
    procedure NormalizeWhereClause;
  strict protected
    function GetReportItem: string;
    property Columns: TColumnDigestList read FColumns;
  public
    constructor Create(const AParts: TStatementParts);
    destructor Destroy; override;
    procedure AddColumnValue(const aColumn, aValue: string);
    procedure AnalyzeWhereclause(const aClause: string);
    procedure IncOccurences;
    property Occurences: Integer read FOccurences;
    property ReportItem: string read GetReportItem;
  end;

  TDigestList = class(TObjectDictionary<string, TSqlStatementDigest>)
  end;

  TLogSqlParser = class(TObject)
  strict private
    FDigestList: TDigestList;
  strict protected
    procedure AnalyzeAndLogStatement(const aParts: TStatementParts);
    procedure DigestStatement(const aLine: string);
    procedure DissectStatement(const aLine: string; var aParts: TStatementParts);
    procedure HandleOrderAndGrouping(var aParts: TStatementParts);
    function IsSelectStatement(const aLine: string): Boolean;
    function GetReport: string;
    property DigestList: TDigestList read FDigestList;
  public
    constructor Create; reintroduce; virtual;
    destructor Destroy; override;
    procedure Analyze(const aLogText: string); overload;
    procedure Analyze(aText: TStrings); overload;
    property Report: string read GetReport;
  end;


implementation

uses
  Sysutils, System.RegularExpressions;

{== TLogSqlParser =====================================================}

constructor TLogSqlParser.Create;
begin
  inherited Create;
  FDigestList := TDigestList.Create([doOwnsValues]);
end;

destructor TLogSqlParser.Destroy;
begin
  FDigestList.Free;
  inherited Destroy;
end;

procedure TLogSqlParser.Analyze(const aLogText: string);
var
  LText: TStringList;
begin
  LText := TStringList.Create();
  try
    LText.Text := aLogText;
    Analyze(LText);
  finally
    LText.Free;
  end;
end;

procedure TLogSqlParser.Analyze(aText: TStrings);
var
  I: Integer;
begin
  DigestList.Clear;
  for I := 0 to aText.Count-1 do
    DigestStatement(aText[I]);
end;

procedure TLogSqlParser.AnalyzeAndLogStatement(const aParts: TStatementParts);
var
  LDigest: TSqlStatementDigest;
begin
  if not Digestlist.TryGetValue(aParts.Fieldlist, LDigest) then begin
    LDigest := TSqlStatementDigest.Create(aParts);
    DigestList.Add(aParts.Fieldlist, LDigest);
  end;
  LDigest.IncOccurences;
  LDigest.AnalyzeWhereclause(aParts.WhereClause);
end;

procedure TLogSqlParser.DigestStatement(const aLine: string);
var
  LParts: TStatementParts;
begin
  if IsSelectStatement(aLine) then begin
    DissectStatement(aLine.Trim.ToLower, LParts);
    if not LParts.Fieldlist.IsEmpty then
      AnalyzeAndLogStatement(LParts);
  end; {if}
end;

procedure TLogSqlParser.DissectStatement(const aLine: string; var aParts:
    TStatementParts);
const
  CRegEx =
    'select (.*) from (.*) where (.*)';
var
  LMatch: TMatch;
begin
  LMatch := TRegEx.Match(aLine, CRegEx, [roSingleLine]);
  if LMatch.Success then begin
    aParts.Fieldlist := LMatch.Groups[1].Value;
    aParts.FromList := LMatch.Groups[2].Value;
    aParts.WhereClause := LMatch.Groups[3].Value;
    HandleOrderAndGrouping(aParts);
  end {if}
  else
    Finalize(aParts);
end;

function TLogSqlParser.GetReport: string;
var
  LReport: TStringList;
  I: Integer;
  LDigest: TSqlStatementDigest;
begin
  LReport := TStringList.Create();
  for LDigest in DigestList.Values do begin
    LReport.Add(LDigest.ReportItem);
  end; {for}

  Result := LReport.Text;
end;

procedure TLogSqlParser.HandleOrderAndGrouping(var aParts: TStatementParts);
const
  CGroupBy = ' group by ';
  COrderBy = ' order by ';
  // SQL requires grouping before ordering!
  CBoundaries: array [0..1] of string = (CGroupBy, COrderBy);
var
  I: Integer;
  LParts: TArray<string>;
  S: string;
begin
  S:= aParts.WhereClause;
  aParts.OrderAndGrouping := string.empty;
  for I := Low(CBoundaries) to High(CBoundaries) do
    if S.Contains(CBoundaries[I]) then begin
       LParts := S.Split([CBoundaries[I]]);
       aParts.WhereClause := LParts[0];
       aParts.OrderAndGrouping := CBoundaries[I] + LParts[1];
       Break;
    end; {if}
end;

function TLogSqlParser.IsSelectStatement(const aLine: string): Boolean;
begin
  Result := aLine.Trim.StartsWith('select ', true);
end;

{== TSqlStatementDigest ===============================================}

constructor TSqlStatementDigest.Create(const AParts: TStatementParts);
begin
  inherited Create;
  FParts := AParts;
  NormalizeWhereClause;
  FColumns := TColumnDigestList.Create([doOwnsValues]);
end;

destructor TSqlStatementDigest.Destroy;
begin
  FColumns.Free;
  inherited Destroy;
end;

procedure TSqlStatementDigest.AddColumnValue(const aColumn, aValue: string);
var
  LColumn: TColumnDigest;
begin
  if not Columns.TryGetValue(aColumn, LColumn) then begin
    LColumn := TColumnDigest.Create(aColumn);
    Columns.Add(aColumn, LColumn);
  end;
  LColumn.Values.Add(aValue);
end;

procedure TSqlStatementDigest.AnalyzeWhereclause(const aClause: string);
var
  LMatch: TMatch;
begin
  LMatch := TRegEx.Match(aClause, CRegEx);
  while LMatch.Success do begin
     AddColumnValue(LMatch.Groups[1].Value, LMatch.Groups[3].Value);
     LMatch := LMatch.NextMatch;
  end; {while}
end;

function TSqlStatementDigest.GetColumnlist: string;
var
  LColumn: TColumnDigest;
  LText: TStringList;
begin
  LText := TStringList.Create();
  for LColumn in Columns.Values  do
    LText.Add(
      Format('  Column: %s, values: [%s]',[LColumn.Name, LColumn.Values.CommaText]));
  Result := LText.Text;
end;

function TSqlStatementDigest.GetReportItem: string;
const
  CMask = 'Sql statement: %s'+SLineBreak+
    '  occurrence: %d'+SLineBreak+
    '%s';
begin
  Result := Format(CMask,[GetStatement, Occurences, GetColumnlist]);
end;

function TSqlStatementDigest.GetStatement: string;
begin
  Result := Format('select %s from %s where %s %s',
    [FParts.Fieldlist, FParts.FromList, FParts.WhereClause, FParts.OrderAndGrouping]);
end;

procedure TSqlStatementDigest.IncOccurences;
begin
    Inc(FOccurences);
end;

procedure TSqlStatementDigest.NormalizeWhereClause;
const
  CSpace = ' ';
  CSingleQuote = '''';
var
  LMatch: TMatch;
  S: string;
  LBuilder: TStringbuilder;
  LParts: TArray<string>;
  I: Integer;
begin
  S := FParts.WhereClause;
  LMatch := TRegEx.Match(S, CRegEx);
  if LMatch.Success then begin
    LBuilder:= TStringbuilder.Create;
    try
      while LMatch.Success do begin
        for I := 1 to 3 do begin
          {1: column name, 2: operator, 3: value}
          LParts := S.Split([LMatch.Groups[I].Value], 2);
          if I < 3 then begin
            LBuilder.Append(LParts[0]);
            LBuilder.Append(LMatch.Groups[I].Value);
          end {if}
          else begin
            if LParts[0].Contains(CSingleQuote) then
              LBuilder.Append(CSingleQuote + '?' + CSingleQuote)
            else
              LBuilder.Append('?');
          end; {else}
          LBuilder.Append(CSpace);
          S:= LParts[1];
        end; {for }

        LMatch := LMatch.NextMatch;
      end;
      FParts.WhereClause := LBuilder.ToString;
    finally
      LBuilder.Free;
    end;
  end; {if}
end;

{== TColumnDigest =====================================================}

constructor TColumnDigest.Create(const AName: string);
begin
  inherited Create;
  FName := AName;
  FValues := TStringList.Create;
  FValues.Duplicates := dupIgnore;
  FValues.Sorted := true;
end;

destructor TColumnDigest.Destroy;
begin
  FValues.Free;
  inherited Destroy;
end;

end.

 

SqlLogParserU.pas

  • Thanks 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×