Le-tools.com Forum Contact
XL-Parser - Documentation

First start

When you start XL-Parser for the first time, you are asked if you want to set default configuration. If you click Yes, you will be asked to select a directory where:

A XL-Parser.ini configuration file will also be created to keep your preferences in memory. You can also set elements above manually, in case you already have some of these databases.

Settings

General

Settings - General

In Tool section, we have the following functions and options:

  • Export Lang.ini: Use this function to translate XL-Parser GUI. See Translation for help about this functionality.
  • Open user dir: This is the directory where settings are saved. Copy the content of this directory if you want to backup your settings.
  • Check Update: Check on le-tools.com if a tool update is available.
  • Check for update at startup: When XL-Parser starts, check on website for available update of the tool.

In Functions section, you have the following options:

  • Choose to start XL-Parser in Full Screen;
  • Choose to Remember position of the GUI (usefull if you have multiple monitors);
  • GeoIP language: The GeoIP2 database from Maxmind provides data in different language. You can set your preferred language here.
  • Set the NsLookup timeout. Default is 10 seconds.
  • Set the Local timezone. Default is America/New_York.
  • Set the Default language. Default is en-US.

Input section

Input section

There are four types of input: folder, file (or list of files), database and clipboard. Each type has its own options:

File Filters Window

File Filters window

When you select a Folder as input, you can use filters to restrict the scope of the search to files that match your criteria. You can add, edit (by double clicking a filter in the grid) or delete filter. You can also save your filters for a later use. File Filters can be shared between XL-FileTools and XL-Parser. When you click on the add button, the following window will show up:

Select File Filters window

You have four types of file filters:

Contains

File Filters window - 
	  Contains

By default, the keyword or regular expression applies to the relative path (folder or file, but excluding input directory). You can select Filename only option to exclude folders. If you select Regex option, a colored border around the textfield will indicate if your regex is ok. If not, you can press the ? button to see the error message.

File Formats window

File Formats window

XL-Parser can read many file formats including: text, unicode, Doc and Docx, Xls and Xlsx, Evt and Evtx, Pdf, Msg and Zip (text files inside the archive). A cool feature here is that you can set the file extension to apply to the File format parser. For example, you can search a docx file as it was a zip file by selecting Zip and adding .docx in the extension list. This will allows you to search in metadata of the docx file.

When you select Doc or Docx format, you can choose to parse the file with MS Word (if installed on your system) or not. The legacy format (doc) will be search as unicode as the new format (docx) will be searched as zip.

For unicode format, you must select between two types: HTML and Other. With HTML, XL-Parser will search the charset in the HTML file header. With other, it will try to determine the appropriate encoding by analysing the file header.

Any change to extension list or options are saved in a file (name: FileFormats.json) in the user directory. The Reset button restore options to default values.

Important: XL-Parser is not a searching tool which mean that you should not check all formats, select a full volume and hope that it will find all the things.

Functions section

The functions section is divided with tabs. Shown tabs will depends on input type:

Extraction

XL-Parser provides two types of extraction:

Expressions

Extraction - Expression

You can add, edit (by double clicking an expression in the grid) or delete as much expression as you want. If more than one expression is used, priority depends on the order in the grid. You cannot set hierarchy of operators but you can change order of the expressions by using the up and down button on the right. If one expression contains capture groups (with an AND operator), it should be placed at end. All the expressions in the grid can be saved for a later used. There are also Extraction options available (see further down).

On top right corner, you have a Load results button. It can be use to open saved results. On bottom right corner, you have a Rebuild list of files button. When you do a search, XL-Parser has to browse the input and list the files that will be searched. The list of files is cached, so if you do another search without changing the input, XL-Parser doesn't have to rebuild this list, so it will be faster. If content of the directory (file(s) added or deleted), you should click on the Rebuild list of files button.

Expressions window

The Expression window is divided in three tabs:

Expressions
Extraction - Expression window

In the Expressions tab, you can set or edit a single, or a list of expressions. Expressions can be a string or a regular expression. Operator must be set if you have more than one expression. For regular expression, capture groups are supported. Note that you should not use capture groups with Invert option, as it cannot capture something on the a line that it doesn't keep. Comment (optional) can be used to identify what the expression is supposed to extract. You can use the regex shortcuts to help you build your regular expression. On the right side of the window, you have two button. The first one is used to add or edit the expression in the Expression database. The second one is used to open the Expression tool window:

Extraction - Expression tool window

The Expression tool window can be used to test your expression. If this one contains capture groups, the extracted data will be in the results section.

Extraction options

Extraction Options window

Here are some explanations about the options:

  • Mode: By default, XL-Parser search (and extract) your expressions on each line* of a file. You can change this mode to Search by file, so the whole file will have to meet all your expressions. With this mode, result will be a list of files instead a list of extracted lines.
  • Max results: This option is used to limit the number of extracted lines for each file or for the whole process.
  • Include context: By default, only lines that contains your expression will be extracted. But sometimes, for chat logs as an example, you want to see what is going one before and/or after the line that matches your expression.
  • No duplicates: If you have a list of expressions with the OR operator, lines are tested for each of the expression and duplicates are allowed. This allows you to get results for each expression, but this also leads to a longer processing time. If you don't need to get results for each expressions and want to accelerate the process, you can check this option.

* For some file formats, a line has a different meaning. For a doc(x), a line is a paragraph. For a Pdf, a line is a page.

Extraction Results

Extraction results

When you start extraction (Expressions or Special objects), a pop-up window appears. You can see the progress of the extraction (updated after each searched file). As the process runs, it stores extracted data in a temp file (.XL-Parser_tmp.txt), in the report directory (default is user directory, see report options further). When the process is done, the grid will contains stats about extracted expressions (or objects). You can see results by file or by expression/object.

Double click on a row to see results for each file or for each expression/object. This will create and open the results in a text file. If you right click on a row in the by file grid, you have access to the following functions:

If you right click on a row in the by expression/object grid, available are Show results for selected item(s), Show results for selected item(s) (No duplicates), Show results for selected item(s) (Results only), Show results for selected item(s) (Results only - No Duplicates), Select all and Copy selected rows.

You can Save results for a later use. If you click on the Build Report button, the following window will show up:

Extraction Report options window

Available options and default values will depend on the type of search and the number of results. For an extraction by expressions, default report format is TXT (TSV) as for an extraction by special objects, default report format is XLSX. If many expressions/special objects have been extracted, you have an additional option: One sheet per expression/object for XLSX, One page per expression/object for HTML or One file per expression/object for TXT (TSV). Include Source (the file where the result has been found) is only available for extraction by expressions.

Whatever report format you have choosen, you can include Column headers, Source (for extraction by expressions) and Expression/Object. These last two options will add a column containing the corresponding value.

Web log analysis for files

Functions Web log analysis

Web log analysis tab interface is different, based on the input. Above is what it's look like, if input is Folder/File(s). This function is for web logs only. The purpose of this function is to create (or update) a SQLite database to store web logs, so it will be possible to do advanced query on it.

When you create a database, informations about log files are stored in the database, including number of lines. For example, you create the database in the middle of April, including log files from January. When you update the database in june, XL-Parser examine the folder. Log files for January to March won't be added. The first part of April won't be added (based on number of lines), but all new lines will be added, as well as the other log files (May and June).

The log format is a regular expression that will match every fields of every lines in the log file. Access logs of a web server can be customized, so log formats in XL-Parser can also be customized. The first button on the right opens the Log format database Window and the second button call a function that will examine the input and try to determine the appropriate log format.

By default, XL-Parser parses log files by last-modified time ascending order. If you want to verify or change this, you can click on File order button. A window like the one below will open:

File Order window

You can sort the files by clicking on the header of the column you wanna sort with, or you can order files manually by clicking on the up and down button.

Set source files to readonly option will apply read-only attribute to every extracted log files. This will prevent any accidental modification.

When you create (or update) the database, you can add a lot of informations to the database including ISP (based on your XL-Whois database), GeoIP, Useragent details and Weekdays. You can also limit the number of added lines by using Filters. If you have a big bunch of log files (~1 Gb or bigger), you should filter additionnal resources if possible. This will reduce the size of the destination database. It's not unusual that requests for images represent more than half of all requests in the log files.

If you use filters at the creation of the database, you must be aware that some filters may increase the whole processing time. These filters are the ones that require resolving like ISP, GeoIP, useragent details or the filters based on time. If you have filters that don't require resolving, you better use them before. Filters based on existing fields are faster. For example, filtering useragent with the string Windows NT 6.1 will be faster than filtering UA_OS with the value Windows 7. That will produce the same result, but to get the UA_OS detail, the useragent will have to be resolved before testing the filter.

Resolving of ISP, GeoIP and useragent are done at the end (if you don't have filter that require these infos). Resolving Weekday is done while processing. Because each entry have a datetime field, the number of resolved weekdays should be exactly the same as the number of entries.

When you use filters, filtered lines (lines that don't matched the selected Filters) will be logged in a file ([Database filename].FilteredLines.log), except if you unselect the option.

When a parsed line doesn't match the Log format, it will be logged in a file ([Database filename].RejectedLines.log), except if you unselect the option. You may also ask XL-Parser to ignore comments while logging rejected lines (starting with #).

Finally, you must enter the output database file name and directory. The Select database when finished option will select the database as input when finished.

When you start the process, the main program will create the database, examine web logs and store all settings. Then the main program will start XL-Parser-process.exe (in the program folder). A window will appears and shows progress and stats:

Log analysis progress window

Log format database window

Log format Database window

The Log format database contains the regular expressions that XL-Parser will use to parse log files. Every format has a name, a sample, the pattern and the regex. You can add, edit or delete any log format.

Log format database - Add window

In this window, you can add or edit a log format. Here are some explanations about each field:

Datetime database window

Datetime Database window

The Datetime database is used to store and manage all datetime formats that will be available in the tool. This database is used with Log format (input), with Split logs (input) and with report options when Query database. You can add, edit or delete any datetime object in the database. If you select same datetime database that you use with XL-Tools, new objects will be usable in both tools.

Datetime Database - Edit

When you add or edit a datetime object, you must provide the following details:

If the Add or Edit button are still disabled after you filled all the fileds, you can click on the ? to see what is missing or is incorrect.

Ambiguous timezone abbreviations

Some timezone abbreviations are ambiguous, because there are matching more than one timezone. For example, CST could be Central Standard Time (USA), China Standard Time, or Cuba Standard Time. You can take a look here. So, I decided to set those timezone with the following values:

I have try to set it to the most popular value, but I could be wrong. If the datetime you need to parse have a different offset, you can parse the datetime without the timezone and set the appropriate offset in the input format.

Filters window

Filters window

Filters are divided in tree groups: White List, Regular and IP List. When you use White List filters, every entries that will match the filters will be ignored (or removed). Using Regular filters does the opposite. Every entries that match those filters will be selected and others will be ignored. IP List is just a type of regular filters that will match the remoteIP field only.

Filters are grouped by category. If you want to create a new category, you must use the Add button without selecting any category. If you want to create a new filter in an existing category, you must select the category first, then push the Add button. You can edit or delete any filter.

Filters - Edit

Filters are in fact a set of filters. You can, edit or delete any filter in the set.

Filters - Add a filter rule

Here are some explanations about each field:

A field filter rule can be selected/added to the Field Filter database tab.

Split logs

This function can be used to split logs (or any text files) into smaller pieces. Files can be splitted by size, by number of lines or by time.

By size

Split logs by size

You have to provide a size for the chunks. Files that exceed this size will be splitted. The final size of each chunks may be a bit different because XL-Parser don't split lines. If you select the Use datetime in filename option (log files only), XL-Parser will extract the first timestamp of each file and use it to name the file. Otherwise, it will use a simple counter.

Web log analysis for XL-Parser database

Once you've created a log database with XL-Parser (see Web log analysis for files) and you select the database as input (don't forget to select the Database radio button), you have access to the following functions (tabs):

Current database

Log Analysis - Database - Current Database

When you select a database as input, XL-Parser opens it and gather informations about this database. In this tab, you can see these informations:

  • General:
    • Last update: Date and time when the database have been created or updated.
    • Period: Dates of the first and last entries of the database.
    • Number of:
      • Entries: Number of entries in the log table.
      • Filtered: Number of lines filtered (lines that doesn't matched the selected Filters).
      • Rejected: Number of line rejected (lines that doesn't matched the Log format).
  • Source:
    • Log format: The log format that was used to parse the logs.
    • Filters: The filters that were used (if so).
    • Source file(s)...: Details about source log file(s). This will open a pop-up window:
      Source File(s)
      In the source file(s) grid, you can see informations about the log files that have been parsed to build the database. The Exists column indicates if the file can still be found on the drive. You can change the path by right clicking on a line. If the path contains other log files, all the grid will be updated at the same time.
    • Rejected: This will open the rejected lines file ([Database filename].RejectedLines.log) created when database was created. This file should be in the same directory of the database.
    • Filtered: This will open the filtered lines file ([Database filename].FilteredLines.log) created when database was created. This file should be in the same directory of the database.
  • Resolved:
    • ISP: If the corresponding option has been used when creating/updating the database, this value indicates how much Internet Service Providers (ISP) have been resolved even if the result was unknown. So this may also indicate how much distinct IP addresses are in the logs.
    • GeoIP: Same as previous, except GeoIP only applies to IPv4 addresses in XL-Parser.
    • UAs: Number of resolved useragents.
    • Weekdays: Number of resolved weekdays. It should be the same value as the number of entries because each entry must contain a valid timestamp.

Copyright (c) 2015-2021 Alain Rioux