Using → Recategorizing Applications
XDMoD automatically categorizes page impressions by application. This
categorization is imperfect since it depends on pre-defined filters that are
not aware of all possible applications. After page impressions have been
ingested, you may wish to change their categorization. For example, they may
have been categorized as “Unknown” when you would be able to properly
categorize them manually given their request paths. For example, if you know
the request path /node/[host]/[port]/foo/bar
matches the foo
application,
you may wish to add a filter that assigns the foo
application to any page
impressions with that request path, and apply that filter to all the previously
ingested page impressions without having to reingest the original log files.
To help with recategorization, as page impressions are ingested, their distinct
request paths are kept in the modw_ondemand.request_path
database table, with
a request_path_id
foreign key in the modw_ondemand.page_impressions
fact
table. These can be used to recategorize previously ingested page
impressions by selecting all the page impressions that have certain request
paths and reassigning an application to them (instructions for doing this are
further down).
In the cases where multiple distinct request paths can be grouped together in a
general form without losing any relevant information (e.g., /foo?a=b
,
/foo?c=d
, and /foo?e=f
might all be generalized as /foo?[params]
), there
is a configuration file that filters request paths into general forms such that
they appear in the modw_ondemand.request_path
table as, e.g.,
/foo?[params]
. You may wish to recategorize groups of request paths into a
more general form to keep the size of this table smaller and to simplify the
query for recategorizing applications of a general form (instructions for
doing this are immediately below).
Recategorizing future request paths
The file used for filtering request paths into more general forms is in the
configuration directory (whose location depends on how you configured your
installation, e.g., /etc/xdmod
, /opt/xdmod/etc
) under
etl/etl_data.d/ood/request-path-filter.json
. This file contains a JSON object
whose keys are regular expressions used for matching against request paths from
the ingested logs, and whose values are the corresponding general forms that
will be stored in the database and assigned to future page impressions when
they are ingested. Edit this file to configure the filter how you wish.
If you have multiple OnDemand resources configured in XDMoD, you can also
filter request paths on a per-resource basis by having other similar files at
etl/etl_data.d/ood/request-path-filter.d/${OOD_RESOURCE_CODE}.json
, where
${OOD_RESOURCE_CODE}
matches the value passed via the -r
argument to
xdmod-ondemand-ingestor
(see the ingestion and aggregation
instructions). Any keys that are the same between
these files and the main request-path-filter.json
will have their values
override the values in the main request-path-filter.json
.
Recategorizing existing request paths
To apply a new filter to page impressions that have already been ingested, follow the instructions below to run some SQL statements in the XDMoD data warehouse.
First you will want to back up the modw_ondemand.page_impressions
and
modw_ondemand.request_path
tables (e.g., using mysqldump
) so that you can
restore the backups if something goes wrong that cannot be undone during these
steps.
Get the current timestamp prior to running any statements. Save this timestamp somewhere. Later, you will use it when you re-aggregate all page impressions modified after it.
SELECT CURRENT_TIMESTAMP();
Set some variables that will be used in subsequent statements. Replace the
values below; set the desired regex from the filter and the desired generalized
request path. Be careful that if the regex from the filter contains a grouping
that gets backreferenced in the general form (e.g., $1
) that you only run all
these instructions for one member of the grouping at a time (e.g., if the regex
is a/(b|c|d)/e
and the general form is a/[foo]/e
, make sure to run all
these instructions once with the filter set to a/b/e
, once with it set to
a/c/e
, and once with it set to a/d/e
):
SET @request_path_filter = '^/rnode/[^/]+/[^/]+(/data/plugin/images/images\\?).+';
SET @general_request_path = '/rnode/[host]/[port]/data/plugin/images/images?[params]';
Set locks for the tables you will be modifying, so that the automatic ingestion pipeline does not try to modify the tables at the same time you are:
LOCK TABLES
modw_ondemand.page_impressions WRITE,
modw_ondemand.page_impressions AS p WRITE,
modw_ondemand.request_path WRITE,
modw_ondemand.request_path AS rp READ;
Insert the new request path into modw_ondemand.request_path
table and get its
ID:
INSERT INTO modw_ondemand.request_path (path)
VALUES (@general_request_path);
SET @request_path_id = (
SELECT id
FROM modw_ondemand.request_path
WHERE path = @general_request_path
);
If a Duplicate entry
error occurs, it just means the request path is already
in the table; you can continue with the instructions below.
Create a temporary table containing all the page impressions whose path matches the filter but which don’t already have the general form:
CREATE TEMPORARY TABLE modw_ondemand.tmp_request_path_updates AS SELECT p.*, rp.path
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
WHERE rp.path REGEXP @request_path_filter
AND rp.path != @general_request_path;
Select all the rows from the temporary table and confirm it is a correct list of page impressions whose request path should be set to the general form:
SELECT * FROM modw_ondemand.tmp_request_path_updates;
Then, set the corresponding new request_path_id
in the page_impressions
table, ignoring any rows that are now duplicates:
UPDATE IGNORE modw_ondemand.tmp_request_path_updates AS t
JOIN modw_ondemand.page_impressions AS p ON p.id = t.id
SET p.request_path_id = @request_path_id;
Confirm it worked, noting that some rows may not have been updated because they are now duplicates of other rows:
SELECT p.*, rp.path
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
WHERE rp.path REGEXP @request_path_filter;
Select any rows that are duplicates:
SELECT p.*, rp.path
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
WHERE rp.path REGEXP @request_path_filter
AND rp.path != @general_request_path;
And delete those:
DELETE p
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
WHERE rp.path REGEXP @request_path_filter
AND rp.path != @general_request_path;
Confirm it worked:
SELECT p.*, rp.path
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
WHERE rp.path REGEXP @request_path_filter;
And drop the temporary table:
DROP TABLE modw_ondemand.tmp_request_path_updates;
Mark all the page impressions that have the general form as being modified so they can be reaggregated later, since some may not have been modified because they were duplicates of ones that were deleted:
UPDATE modw_ondemand.page_impressions
SET last_modified = CURRENT_TIMESTAMP()
WHERE request_path_id = @request_path_id;
Next, select the rows from the request_path
table that match the filter:
SELECT *
FROM request_path
WHERE path REGEXP @request_path_filter
AND id != @request_path_id;
If those rows look correct to delete, run the SQL statement below to delete
them (the same statement as immediately above but replacing SELECT *
with
DELETE
).
DELETE
FROM request_path
WHERE path REGEXP @request_path_filter
AND id != @request_path_id;
Finally, unlock the tables:
UNLOCK TABLES;
Recategorizing future applications
The file used for mapping applications is in the configuration directory (whose
location depends on how you configured your installation, e.g., /etc/xdmod
,
/opt/xdmod/etc
) under etl/etl_data.d/ood/application-map.json
. This
file contains a JSON object whose keys are regular expressions used for
matching against request paths from the ingested logs, and whose values are the
corresponding applications that will be stored in the database and assigned to
future page impressions when they are ingested. Edit this file to configure the
application map how you wish.
If you have multiple OnDemand resources configured in XDMoD, you can also
map applications on a per-resource basis by having other similar files at
etl/etl_data.d/ood/application-map.d/${OOD_RESOURCE_CODE}.json
, where
${OOD_RESOURCE_CODE}
matches the value passed via the -r
argument to
xdmod-ondemand-ingestor
(see the ingestion and aggregation
instructions). Any keys that are the same between
these files and the main application-map.json
will have their values
override the values in the main application-map.json
.
Recategorizing existing applications
To recategorize applications for page impressions that have already been ingested, follow the instructions below to run some SQL statements in the XDMoD data warehouse.
First you will want to back up the modw_ondemand.page_impressions
and
modw_ondemand.app
tables (e.g., using mysqldump
) so that you can
restore the backups if something goes wrong that cannot be undone during these
steps.
Get the current timestamp prior to running any statements (if you already did this above when recategorizing request paths, keep that one instead). Save this timestamp somewhere. Later, you will use it when you re-aggregate all page impressions modified after it.
SELECT CURRENT_TIMESTAMP();
Get the list of all apps so you can get the IDs of the app whose ID you want to change and the ID of the app to which you want to change it.
SELECT * FROM modw_ondemand.app;
Set some variables that will be used in subsequent statements. Set
@request_path_filter
to a regex that will be used to match page impressions
whose request paths have a general form. Set @old_app_id
to the current app
ID of the page impressions you want to change. Set @new_app_id
to the ID of
the app to which you want to change them.
SET @request_path_filter = '^/rnode/[^/]+/[^/]+/(proxy/[^/]+/)?(data|experiment)/.*';
SET @old_app_id = 8;
SET @new_app_id = 7;
Get the list of page impressions that will be changed:
SELECT *
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
JOIN modw_ondemand.app AS a ON a.id = p.app_id
WHERE p.app_id = @old_app_id
AND rp.path REGEXP @request_path_filter;
If the list is correct, set locks for the tables you will be modifying, so that the automatic ingestion pipeline does not try to modify the tables at the same time you are:
LOCK TABLES
modw_ondemand.page_impressions AS p WRITE,
modw_ondemand.request_path AS rp READ,
modw_ondemand.app AS a READ;
Change the app ID from old to new, ignoring any rows that are now duplicates:
UPDATE IGNORE modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
SET p.app_id = @new_app_id
WHERE p.app_id = @old_app_id
AND rp.path REGEXP @request_path_filter;
Confirm it worked:
SELECT *
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
JOIN modw_ondemand.app AS a ON a.id = p.app_id
WHERE p.app_id = @new_app_id
AND rp.path REGEXP @request_path_filter;
Select any rows that are duplicates:
SELECT *
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
JOIN modw_ondemand.app AS a ON a.id = p.app_id
WHERE p.app_id = @old_app_id
AND rp.path REGEXP @request_path_filter;
And delete those:
DELETE p
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
JOIN modw_ondemand.app AS a ON a.id = p.app_id
WHERE p.app_id = @old_app_id
AND rp.path REGEXP @request_path_filter;
Confirm there is now an empty set of duplicates:
SELECT *
FROM modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
JOIN modw_ondemand.app AS a ON a.id = p.app_id
WHERE p.app_id = @old_app_id
AND rp.path REGEXP @request_path_filter;
Mark all the page impressions whose request path has the general form as being modified so they can be reaggregated later, since some may not have been modified because they were duplicates of ones that were deleted:
UPDATE modw_ondemand.page_impressions AS p
JOIN modw_ondemand.request_path AS rp ON rp.id = p.request_path_id
SET last_modified = CURRENT_TIMESTAMP()
WHERE p.app_id = @new_app_id
AND rp.path REGEXP @request_path_filter;
And unlock the tables:
UNLOCK TABLES;
Reaggregating
After you have run all the SQL statements, use the xdmod-ondemand-ingestor
shell command to re-aggregate all of the page impressions that were modified
after the value for CURRENT_TIMESTAMP()
that you obtained above (replace
'2024-05-09 14:10:33'
in the example below):
$ xdmod-ondemand-ingestor -a -m '2024-05-09 14:10:33'