20,741
edits
Line 239: | Line 239: | ||
The basic idea is to run a fitness function and score regex expressions based on not throwing an exception (which makes them valid), and by checking if the desired tokens are part of the output string. | The basic idea is to run a fitness function and score regex expressions based on not throwing an exception (which makes them valid), and by checking if the desired tokens are part of the output string. | ||
Some of the metrics that can be used by the fitness function to determine if an expression is "fit", are: | |||
* valid expression | |||
* relative offset/excess bytes in matches string | |||
* number of examples it can successfully extract | |||
* length of the expression | |||
* runtime of the expression | |||
Ultimately, this would allow the script to self-update its regex/xpath expressions if/when the underlying website (themes) change, but it would also allow to add support for new websites, without ever manually adding the required xpath/regex expressions, i.e. all that is needed is a sufficiently large number of example datasets to obtain the author, date and title information. | Ultimately, this would allow the script to self-update its regex/xpath expressions if/when the underlying website (themes) change, but it would also allow to add support for new websites, without ever manually adding the required xpath/regex expressions, i.e. all that is needed is a sufficiently large number of example datasets to obtain the author, date and title information. |