IoT data analytics

By John Kirch

04 August 2023 - v1.3

emnify provides a comprehensive solution for connecting IoT devices to over 540 mobile network providers in more than 180 countries. To provide their customers with a truly global experience, data is routed through their Amazon AWS VPN and made accessible via 4 AWS regions around the world.

The metadata emnify collects is important for a number of reasons and use cases:

Tracking data and SMS usage can help reduce monthly costs.
Identifying IoT devices that are offline due to power supply problems, malfunction, theft, etc.
Discovering usage trends based on aggregated data, e.g., hourly, daily, by weekday, weekly, monthly, quarterly, or annual.
Security: authentication failures, unusual data usage spikes or excessive SMS messages received can be indicative of attempts to gain unauthorized access to IoT devices.

However, the tools and reports that the web-based emnify Portal provides do not always meet every customer’s needs due to various factors.

Reporting is not always customizable, e.g., event descriptions are collapsed by default and can only be expanded per the currently selected event.
Although many reports are customizable regarding date ranges and other filters, the results can take several seconds to appear, presumably due to the multitenancy environment and other factors.
Raw data usage for each IoT device is not visible, only aggregated usage.

Because IoT devices are so diverse and collect such a variety of data, this post focuses on the mobile network metadata they generate for maintaining connectivity. For an example of sensor-based data, see IOT Data analysis - data from 54 sensors and the dataset it uses.

Managing IoT metadata and creating a custom data analytics solution

In this post, you will acquire skills for visualizing data from the emnify Data Streamer according to your organization’s specific reporting needs. To achieve this, you will learn how to integrate the following third-party solutions into your workflow: NXLog, Raijin, Python pandas, and Python seaborn.

The exercises presented in this post will guide you through the following tasks:

Create webhooks for NXLog to capture emnify metadata
Capture raw, unprocessed logs for long-term archival
Prepare the logs for insertion in a Raijin database
Explore the metadata in Raijin using SQL queries
Write queries for analyzing various distributions by categories
Use Jupyter with Python data analysis libraries to explore and visualize the metadata for creating reports

Before you start, you will need to get acquainted with the emnify metadata and infrastructure that delivers it.

The IoT log source

The emnify Data Streamer takes the IoT connection metadata, formats it as JSON records, and sends it to subscribers in real time. emnify allows you to subscribe to this data using webhooks. NXLog can capture this data and provide additional functionality that emnify cannot provide without third-party integrations:

Data archival for generating quarterly, annual, or lifetime metrics
Forwarding the data captured in real time to a Raijin database to leverage its time series capabilities for forensic purposes
Filter security events and forward them in real time to a SIEM

emnify provides integrations with a small number of third-party data analytics solutions. However, unlike NXLog, they currently do not provide SIEM integrations with Google Chronicle, IBM QRadar, Microsoft Sentinel, Securonix, Splunk, or Sumo Logic, just to name a few.

emnify metadata types

The metadata emnify provides can be viewed in real time via the web-based emnify Portal. The emnify Data Streamer provides two basic types of metadata:

Data and SMS usage
- Mobile data uploaded (in MB)
- Mobile data downloaded (in MB)
- Number of SMS messages sent
- Number of SMS messages received
Events
- Type
- Severity
- Description

Sample Data Streamer usage records (1 Data and 1 SMS)

{
    "cost": 0.0304797,
    "id": 961767216873563,
    "operator": {
      "id": 5,
      "name": "Telefonica O2",
      "mnc": "07",
      "country": {
        "id": 74,
        "mcc": "262",
        "name": "Germany"
      }
    },
    "organisation": {
      "id": 543210,
      "name": "7LX Solutions Ltd"
    },
    "tariff": {
      "id": 557,
      "name": "Regional Pro EUR",
      "ratezone": {
        "id": 3398,
        "name": "Europe Basic"
      }
    },
    "traffic_type": {
      "id": 5,
      "description": "Data"
    },
    "endpoint": {
      "id": 31432106,
      "name": "IoT Device 008",
      "ip_address": "10.196.67.36",
      "tags": "DE,NRW,OP,S1",
      "imei": "3344098765432102",
      "balance": null
    },
    "imsi": "295098765432118",
    "volume": {
      "total": 0.304797,
      "rx": 0.10377,
      "tx": 0.201027
    },
    "start_timestamp": "2023-06-01T21:44:13Z",
    "sim": {
      "id": 19356708,
      "iccid": "69476543210136526897",
      "msisdn": "429555555555508",
      "production_date": "2020-12-23T13:14:43Z"
    },
    "currency": {
      "id": 1,
      "code": "EUR",
      "symbol": "€"
    },
    "end_timestamp": "2023-06-01T22:00:00Z",
    "imsi_id": 59876512
  }
  {
    "cost": 0.01,
    "id": 951100051194091,
    "operator": {
      "id": 29,
      "name": "Proximus",
      "mnc": "01",
      "country": {
        "id": 22,
        "mcc": "206",
        "name": "Belgium"
      }
    },
    "organisation": {
      "id": 543210,
      "name": "7LX Solutions Ltd"
    },
    "tariff": {
      "id": 1,
      "name": "Internal Test Tariff",
      "ratezone": {
        "id": 1,
        "name": "Zone 1"
      }
    },
    "traffic_type": {
      "id": 6,
      "description": "SMS"
    },
    "endpoint": {
      "id": 31432106,
      "name": "IoT Device 008",
      "ip_address": "10.196.67.36",
      "tags": "DE,NRW,OP",
      "imei": "3344098765432102",
      "balance": null
    },
    "imsi": "295098765432118",
    "volume": {
      "total": 1,
      "rx": 0,
      "tx": 1
    },
    "start_timestamp": "2023-05-25T09:08:51Z",
    "sim": {
      "id": 19356708,
      "iccid": "69476543210136526897",
      "msisdn": "429555555555508",
      "production_date": "2020-12-23T13:14:43Z"
    },
    "currency": {
      "id": 1,
      "code": "EUR",
      "symbol": "€"
    },
    "end_timestamp": "2023-05-25T09:08:53Z",
    "imsi_id": 59876512
  }

Sample Data Streamer event record

{
    "id": 902815564038271,
    "timestamp": "2023-04-21T06:31:21Z",
    "event_source": {
      "id": 1,
      "description": "Policy Control"
    },
    "event_severity": {
      "id": 1,
      "description": "WARN"
    },
    "event_type": {
      "id": 11,
      "description": "Endpoint blocked"
    },
    "organisation": {
      "id": 543210,
      "name": "7LX Solutions Ltd"
    },
    "user": null,
    "alert": true,
    "description": "Blocking data access for endpoint, because organisation exceeded monthly limit.",
    "endpoint": {
      "id": 31432105,
      "name": "IoT Device 006",
      "ip_address": "10.196.67.33",
      "tags": "",
      "imei": "3344098765432111"
    },
    "sim": {
      "id": 19356703,
      "iccid": "69476543210186526892",
      "msisdn": "429555555555503",
      "production_date": "2019-10-10T13:06:16Z"
    },
    "imsi": {
      "id": 59876503,
      "imsi": "295098765432104",
      "import_date": "2019-10-10T13:06:16Z"
    },
    "detail": null
  }

Raw data collection limitations

As an emnify customer, the emnify Data Streamer is your only source of raw data usage logs. The emnify REST API only returns aggregated data for a specific IoT device. This method has serious limitations making it practically useless for security or forensic use cases:

You need make a separate request for each endpoint_id (device ID).
The data is aggregated by month or day.
The finest granularity of aggregated usage data is by minute, but only for the 60 minutes prior to the API request.

Setting up log collection and forwarding

Intuitively, you might be inclined to install/configure each component of this integration in the same order the data is traveling:

emnify webhooks
NXLog agent
Raijin database

However, we will work in the opposite direction to avoid error messages about unreachable hosts and undefined databases. Also, the emnify Portal expects the IP address and port number of each webhook to be reachable and capable of receiving POSTed data before you configure them.

The emnify Data Streamer expects webhooks to be reachable from the internet. If you plan to install your NXLog agent behind a firewall, make sure the IP address and port numbers are open to the public. You might also consider port forwarding or installing a virtual host with a cloud provider if necessary.

If your organization is using Amazon AWS, check to see if they already have emnify Cloud Connect set up. With a functional AWS Transit Gateway, you could install an NXLog agent and Raijin on an EC2 instance that is already connected to your Transit Gateway. emnify has a video tutorial on how to configure Cloud Connect. This solution keeps all of your data private because you are working entirely within a VPN.

Configure Raijin

If you are not already using the Raijin Database Engine to collect your logs for data analysis, install Raijin. Since Raijin uses the HTTP protocol, your web browser serves as a Raijin client for sending SQL queries as well as creating, listing, and dropping databases and tables. Just enter the hostname or IP address of the host where Raijin is installed with its default port 2500 to connect: http://<raijin-host>:2500/.

The first step is to set up a new database and tables for collecting metadata from the emnify Data Streamer:

Figure 1. Raijin: Create a new database and tables using the web interface

View/copy the SQL for setting up the new database and tables

Raijin database and table setup

CREATE DATABASE emnify;
USE emnify;
CREATE TABLE IF NOT EXISTS usage(
  "operator.mnc" STRING,
  "operator.country.mcc" STRING,
  "endpoint.imei" STRING,
  "imsi.imsi" STRING,
  "sim.msisdn" STRING
);
CREATE TABLE IF NOT EXISTS events(
  "endpoint.imei" STRING,
  "sim.msisdn" STRING,
  "imsi.imsi" STRING,
  "detail.country.mcc" STRING,
  "detail.country.country_code" STRING,
  "detail.pdp_context.mcc" STRING,
  "detail.pdp_context.imsi" STRING,
  "detail.pdp_context.rat_type" STRING,
  "detail.pdp_context.tx_teid_control_plane" STRING,
  "detail.pdp_context.tx_teid_data_plane" STRING,
  "detail.pdp_context.mnc" STRING,
  "detail.pdp_context.rx_teid" STRING,
  "detail.pdp_context.gtp_version" STRING,
  "detail.pdp_context.imeisv" STRING
);

Configure NXLog

If you are not already an NXLog customer, download NXLog Enterprise Edition v5 trial for your platform to get started for free. The Deployment section of the NXLog User Guide has instructions for all supported platforms if you have any questions.

Once you have NXLog installed, the next step is to find and edit your nxlog.conf configuration file. On UNIX-like platforms the default path and filename are /opt/nxlog/etc/nxlog.conf.

The following configuration will enable you to:

Save a copy of the raw, unprocessed logs for long-term archival
Process and forward the logs for storage in your Raijin database

nxlog.conf

<Extension _json>
    Module      xm_json
    Flatten     TRUE (1)
</Extension>

# Receive data and SMS usage from the emnify Data Streamer 
<Input usage_received>
    Module      im_http
    ListenAddr  0.0.0.0:1514

</Input>

# Receive event messages from the emnify Data Streamer 
<Input events_received>
    Module      im_http
    ListenAddr  0.0.0.0:1515
</Input>

# Log emnify network and SMS usage data to file
<Output usage_log> (2)
    Module      om_file
    File        "emnify-usage.ndjson"
</Output>

# Log emnify events to file
<Output event_log>
    Module      om_file
    File        "emnify-events.ndjson"
</Output>

# Forward emnify network and SMS usage data to Raijin 
<Output raijin_usage>
    Module      om_raijin
    URL         http://192.168.0.111:2500
    DBName      emnify
    DBTable     usage
    <Exec>
        parse_json(); (3)
        delete($id); (4)
        to_json(); (5)
    </Exec>
</Output>

# Forward emnify events to Raijin 
<Output raijin_events>
    Module      om_raijin
    URL         http://192.168.0.111:2500
    DBName      emnify
    DBTable     events
    <Exec>
        parse_json();
        delete($id);
        to_json();
    </Exec>
</Output>

<Route usage_route>
    Path        usage_received   => usage_log, raijin_usage
</Route>

<Route event_route>
    Path        events_received  => event_log, raijin_events
</Route>

1	Setting the `Flatten` directive to `TRUE` converts nested JSON key-value pairs to dot notation in order to deliver a flattened JSON structure to Raijin.
2	Both the `usage_log` and `event_log` output modules write the raw, structured JSON objects to file in NDJSON format. This preserves the captured metadata in an unaltered state in case it ever needs to be reloaded into Raijin.
3	The raw JSON payload needs to be parsed as JSON before it can be processed.
4	The original emnify primary key `id` can be discarded because it is only of value within the confines of the native emnify database. Since emnify customers don’t have access to the database, these primary keys are merely excess baggage that waste storage.
5	Calling the `to_json()` procedure will output flattened JSON objects that the Raijin database server supports.

Restart the NXLog service for the changes to take effect.

Configure the webhooks

The emnify Portal provides an Integrations section where webhooks can be configured.

Create your webhooks using the IP address of the NXLog host and the port numbers you used for the ListenAddr directives in your NXLog configuration.

Figure 2. emnify Portal - Integrations: Define a webhook for usage logs

emnify Portal: Create webhook for events

Figure 3. emnify Portal - Integrations: Define a webhook for event logs

Triggering events and data usage

Now that you have configured the required components, you will need to ensure that you have at least one IoT device equipped with an emnify IoT eSIM that is powered up. If you do not want to wait for events or data usage statistics to be generated, you can force some events by restarting the device.

If one of your emnify-enabled mobile devices is a smart phone, you could send an SMS text or open an application that requires a connection for generating new data usage logs. To receive SMS messages, the Connected Devices section of the emnify Portal provides an SMS console for sending SMS messages to your IoT devices. This feature is useful since many industrial IoT devices can be configured through special AT commands sent via SMS.

Verify the setup

If you are seeing new events and usage logs appearing in the emnify Portal, query the Raijin database to confirm that the metadata is being inserted by checking the row count of both tables:

Figure 4. Raijin: Using COUNT(*) to verify data ingestion

Copy the SQL

USE emnify;
SELECT COUNT(*) FROM events;
SELECT COUNT(*) FROM usage;

Troubleshooting

If the Raijin tables are empty, make sure that the webhook configurations are correct and the NXLog agent is capturing the raw data:

On the NXLog host, ensure that the raw data files are not empty:

# cd /opt/nxlog/var/spool/nxlog
# ls -s1
total 18140
12984 emnify-events.ndjson
 5156 emnify-usage.ndjson

If both files are empty, consult the Troubleshooting section of the NXLog User Guide.

If NXLog is capturing the metadata but the Raijin tables are still empty, consult the Raijin and NXLog logs.

Show the most recent messages from NXLog’s internal log file

# cd /opt/nxlog/var/log/nxlog
# tail nxlog.log

Show the most recent messages from Raijin’s internal log file

# cd /opt/raijin/log
# tail raijin.log

Exploring the metadata with Raijin

The first step in exploring data is to find out which fields might be important and what data types they use. The SQL DESCRIBE command will provide this information.

Events

From a security perspective, the events table is most important.

Show the fields and their data types for the events table

USE emnify;
DESCRIBE events;

Show the results of DESCRIBE events (95 fields)

field	type
endpoint.imei	STRING
sim.msisdn	STRING
imsi.imsi	STRING
detail.country.mcc	STRING
detail.country.country_code	STRING
detail.pdp_context.mcc	STRING
detail.pdp_context.imsi	STRING
detail.pdp_context.rat_type	STRING
detail.pdp_context.tx_teid_control_plane	STRING
detail.pdp_context.tx_teid_data_plane	STRING
detail.pdp_context.mnc	STRING
detail.pdp_context.rx_teid	STRING
detail.pdp_context.gtp_version	STRING
detail.pdp_context.imeisv	STRING
MessageSourceAddress	STRING
EventReceivedTime	DATETIME
SourceModuleName	STRING
SourceModuleType	STRING
timestamp	DATETIME
event_source.id	UINT32
event_source.description	STRING
event_severity.id	UINT32
event_severity.description	STRING
event_type.id	UINT32
event_type.description	STRING
organisation.id	UINT32
organisation.name	STRING
alert	BOOL
description	STRING
endpoint.id	UINT32
endpoint.name	STRING
endpoint.ip_address	STRING
endpoint.tags	STRING
sim.id	UINT32
sim.iccid	STRING
sim.production_date	DATETIME
imsi.id	UINT32
imsi.import_date	DATETIME
detail.country.iso_code	STRING
detail.country.id	UINT32
detail.country.name	STRING
detail.id	UINT32
detail.pdp_context.apn	STRING
detail.pdp_context.tunnel_created	DATETIME
detail.pdp_context.ratezone_id	INT32
detail.pdp_context.ci	UINT32
detail.pdp_context.ggsn_control_plane_ip_address	STRING
detail.pdp_context.pdp_context_id	VARIANT
detail.pdp_context.sgsn_data_plane_ip_address	STRING
detail.pdp_context.ggsn_data_plane_ip_address	STRING
detail.pdp_context.breakout_ip	STRING
detail.pdp_context.ue_ip_address	STRING
detail.pdp_context.operator_id	INT32
detail.pdp_context.sgsn_control_plane_ip_address	STRING
detail.pdp_context.nsapi	UINT32
detail.pdp_context.tariff_id	INT32
detail.pdp_context.tariff_profile_id	INT32
detail.pdp_context.region	STRING
detail.name	STRING
detail.volume.rx	VARIANT
detail.volume.tx	VARIANT
detail.volume.total	VARIANT
detail.tapcode	STRING
detail.mnc	STRING
detail.stale	BOOL
user.id	UINT32
user.username	STRING
user.name	STRING
detail.user.id	INT32
detail.user.username	STRING
detail.support_user_org.id	UINT32
detail.support_user_org.name	STRING
detail.tariff_plan.id	UINT32
detail.tariff_plan.name	STRING
detail.tariff_plan.start_date	DATETIME
detail.tariff_plan.end_date	DATETIME
detail.tariff_plan.evaluation	BOOL
detail.tariff_plan.expiry_time	UINT32
detail.tariff_plan.price.sim_activated_rate	STRING
detail.tariff_plan.rate	UINT32
detail.tariff_plan.yearly_rate	UINT32
detail.tariff_plan.currency.id	UINT32
detail.tariff_plan.service_level.id	UINT32
detail.customer_id	INT64
detail.payment_token	STRING
detail.preferred_payment_option_id	UINT32
detail.username	STRING
detail.breakout_id	UINT32
detail.creation_date	DATETIME
detail.region	STRING
detail.resource_share_name	STRING
detail.breakout_type_id	UINT32
detail.breakout_type	STRING
detail.price	UINT32
detail.pdp_context.lac	UINT32

field

type

endpoint.imei

STRING

sim.msisdn

STRING

imsi.imsi

STRING

detail.country.mcc

STRING

detail.country.country_code

STRING

detail.pdp_context.mcc

STRING

detail.pdp_context.imsi

STRING

detail.pdp_context.rat_type

STRING

detail.pdp_context.tx_teid_control_plane

STRING

detail.pdp_context.tx_teid_data_plane

STRING

detail.pdp_context.mnc

STRING

detail.pdp_context.rx_teid

STRING

detail.pdp_context.gtp_version

STRING

detail.pdp_context.imeisv

STRING

MessageSourceAddress

STRING

EventReceivedTime

DATETIME

SourceModuleName

STRING

SourceModuleType

STRING

timestamp

DATETIME

event_source.id

UINT32

event_source.description

STRING

event_severity.id

UINT32

event_severity.description

STRING

event_type.id

UINT32

event_type.description

STRING

organisation.id

UINT32

organisation.name

STRING

alert

BOOL

description

STRING

endpoint.id

UINT32

endpoint.name

STRING

endpoint.ip_address

STRING

endpoint.tags

STRING

sim.id

UINT32

sim.iccid

STRING

sim.production_date

DATETIME

imsi.id

UINT32

imsi.import_date

DATETIME

detail.country.iso_code

STRING

detail.country.id

UINT32

detail.country.name

STRING

detail.id

UINT32

detail.pdp_context.apn

STRING

detail.pdp_context.tunnel_created

DATETIME

detail.pdp_context.ratezone_id

INT32

detail.pdp_context.ci

UINT32

detail.pdp_context.ggsn_control_plane_ip_address

STRING

detail.pdp_context.pdp_context_id

VARIANT

detail.pdp_context.sgsn_data_plane_ip_address

STRING

detail.pdp_context.ggsn_data_plane_ip_address

STRING

detail.pdp_context.breakout_ip

STRING

detail.pdp_context.ue_ip_address

STRING

detail.pdp_context.operator_id

INT32

detail.pdp_context.sgsn_control_plane_ip_address

STRING

detail.pdp_context.nsapi

UINT32

detail.pdp_context.tariff_id

INT32

detail.pdp_context.tariff_profile_id

INT32

detail.pdp_context.region

STRING

detail.name

STRING

detail.volume.rx

VARIANT

detail.volume.tx

VARIANT

detail.volume.total

VARIANT

detail.tapcode

STRING

detail.mnc

STRING

detail.stale

BOOL

user.id

UINT32

user.username

STRING

user.name

STRING

detail.user.id

INT32

detail.user.username

STRING

detail.support_user_org.id

UINT32

detail.support_user_org.name

STRING

detail.tariff_plan.id

UINT32

detail.tariff_plan.name

STRING

detail.tariff_plan.start_date

DATETIME

detail.tariff_plan.end_date

DATETIME

detail.tariff_plan.evaluation

BOOL

detail.tariff_plan.expiry_time

UINT32

detail.tariff_plan.price.sim_activated_rate

STRING

detail.tariff_plan.rate

UINT32

detail.tariff_plan.yearly_rate

UINT32

detail.tariff_plan.currency.id

UINT32

detail.tariff_plan.service_level.id

UINT32

detail.customer_id

INT64

detail.payment_token

STRING

detail.preferred_payment_option_id

UINT32

detail.username

STRING

detail.breakout_id

UINT32

detail.creation_date

DATETIME

detail.region

STRING

detail.resource_share_name

STRING

detail.breakout_type_id

UINT32

detail.breakout_type

STRING

detail.price

UINT32

detail.pdp_context.lac

UINT32

The next step is to identify and confirm fields that function as categories. In this case, these two fields appear to be candidates:

event_severity.description
event_type.description

For this purpose, you can use the SQL SELECT DISTINCT <field name> to see if any categories emerge.

USE emnify;
SELECT DISTINCT "event_severity.description"
FROM events;

This query returns two categories:

INFO
WARN

USE emnify;
SELECT DISTINCT "event_type.description"
FROM events
ORDER BY "event_type.description";

This query returns a much longer list of categories that you might find useful for future queries involving types of events.

Show all 23 categories

Billing configuration updated
CloudConnect TGW breakout terminated
Create PDP Context
Data quota enabled
Delete PDP Context
Endpoint blocked
Endpoint data traffic limit warning
Endpoint disabled
Endpoint enabled
Generic
Organisation blocked
Purge GPRS location
Purge location
SIM Assigned
SIM Released
SIM activation
SIM suspension
Support Access
Tariff plan updated
Update GPRS location
Update location
User authentication failed
User deleted

With the ability to write ad hoc SQL queries, Raijin can provide tabular views of events that are not possible with the views available in emnify Portal Dashboard or Connected Devices » Details » Events where you will find Severity and Event Type but not Event Description. From the following query, you can see that Event Type tells very little about what has actually happened.

emnify raijin query match exceeded results

Figure 5. emnify events that contain "exceeded" in their description

View/copy the SQL for this query

Query for events that contain "exceeded" in their description

USE emnify;
SELECT _id,
       CONCAT(
         DATE(timestamp), ' ', TIME(timestamp)
       ) AS "Event Time",
       "endpoint.name" AS "Device",
       description AS "Description",
       "event_type.description" AS "Event Type"
FROM   events
WHERE  MATCH ("description") AGAINST('exceeded')
ORDER BY "endpoint.name",
         timestamp

From the query results above, you can see that both events are the result of data traffic limit that was configured in the Service Policy for this device. From the Description field, you would never imagine that these two event types, Create PDP Context and Endpoint blocked would ever have anything in common.

Data and SMS usage

From the data contained in the usage table, you can track your costs and gain insights on usage trends.

Show the fields and their data types for the usage table

USE emnify;
DESCRIBE usage;

Show the results of DESCRIBE usage (39 fields)

field	type
operator.mnc	STRING
operator.country.mcc	STRING
endpoint.imei	STRING
imsi.imsi	STRING
sim.msisdn	STRING
MessageSourceAddress	STRING
EventReceivedTime	DATETIME
SourceModuleName	STRING
SourceModuleType	STRING
cost	VARIANT
operator.id	UINT32
operator.name	STRING
operator.country.id	UINT32
operator.country.name	STRING
organisation.id	UINT32
organisation.name	STRING
tariff.id	UINT32
tariff.name	STRING
tariff.ratezone.id	UINT32
tariff.ratezone.name	STRING
traffic_type.id	UINT32
traffic_type.description	STRING
endpoint.id	UINT32
endpoint.name	STRING
endpoint.ip_address	STRING
endpoint.tags	STRING
imsi	INT64
volume.total	VARIANT
volume.rx	VARIANT
volume.tx	VARIANT
start_timestamp	DATETIME
sim.id	UINT32
sim.iccid	STRING
sim.production_date	DATETIME
currency.id	UINT32
currency.code	STRING
currency.symbol	STRING
end_timestamp	DATETIME
imsi_id	UINT32

field

type

operator.mnc

STRING

operator.country.mcc

STRING

endpoint.imei

STRING

imsi.imsi

STRING

sim.msisdn

STRING

MessageSourceAddress

STRING

EventReceivedTime

DATETIME

SourceModuleName

STRING

SourceModuleType

STRING

cost

VARIANT

operator.id

UINT32

operator.name

STRING

operator.country.id

UINT32

operator.country.name

STRING

organisation.id

UINT32

organisation.name

STRING

tariff.id

UINT32

tariff.name

STRING

tariff.ratezone.id

UINT32

tariff.ratezone.name

STRING

traffic_type.id

UINT32

traffic_type.description

STRING

endpoint.id

UINT32

endpoint.name

STRING

endpoint.ip_address

STRING

endpoint.tags

STRING

imsi

INT64

volume.total

VARIANT

volume.rx

VARIANT

volume.tx

VARIANT

start_timestamp

DATETIME

sim.id

UINT32

sim.iccid

STRING

sim.production_date

DATETIME

currency.id

UINT32

currency.code

STRING

currency.symbol

STRING

end_timestamp

DATETIME

imsi_id

UINT32

Data usage for a specific month

Creating customs data usage reports for a variety of time spans is possible now that you have raw data and the ability to write aggregation queries using SQL.

In the following exercise, the database should report device data usage for a single month, May 2023. The following metrics should be returned for each IoT device:

Device: Device name
RX: Data received (MB)
TX: Data transmitted (MB)
Combined: Total data traffic (MB) for both received and transmitted data
Duration: Total time spent receiving/transmitting data (in seconds)
Sessions: Number of data transfer sessions

Raijin’s native SQL functionality is indispensible for such reports.

emnify raijin data usage by device may 2023

View/copy the SQL for this query

Query data usage by IoT device for a specific month

USE emnify;
SELECT "endpoint.name" AS "Device",
       SUM("volume.rx") AS "RX",
       SUM("volume.tx") AS "TX",
       SUM("volume.total") AS "Combined",
       SUM( (end_timestamp - start_timestamp)/1000000) AS "Duration",
       COUNT("endpoint.name") AS "Sessions"
FROM   usage
WHERE  YEAR(end_timestamp) = 2023
  AND  MONTH(end_timestamp) = 5
  AND  "traffic_type.description" = 'Data'
GROUP BY "endpoint.name",
         month
ORDER BY "endpoint.name";

Visualizing data with Python and Jupyter Notebook

Data analysis requires visualization of data distributions. With Python and its libraries specifically written for data science, you can gain insights that would otherwise not be possible.

Jupyter Notebook is an interactive development environment for testing and collaborating with others. It integrates editable code and documentation in a single package. By selecting Run All, Python code cells will be executed sequentially. See Structure of a notebook document for more information. See Create a Jupyter notebook file if you would like create your notebook and run it within Visual Studio Code.

For the exercises in this tutorial, your Python environment will require the following libraries:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests as req
import json as json

Most Jupyter Notebook resources recommend Anaconda for managing your Python environments.

Querying Raijin with Python

In this Jupyter notebook, the following Python code cell demonstrates how you can query a Raijin database and transform the Raijin response into a pandas DataFrame. Finally, invoking the name of the pandas DataFrame events outputs a tabular view of the data.

View/copy the SQL for this query

Query events for descriptions containing "exceeded"

USE emnify;
SELECT _id,
       CONCAT(
         DATE(timestamp), ' ', TIME(timestamp)
       ) AS "Event Time",
       "endpoint.name" AS "Device",
       description AS "Description",
       "event_type.description" AS "Event Type"
FROM   events
WHERE  MATCH ("description") AGAINST('exceeded')
ORDER BY "endpoint.name",
         timestamp

Create a Python function for querying Raijin

It is more efficient to create a Python function from the code required to take an SQL statement, send it to Raijin, receive a response, and convert it to a pandas DataFrame. Another good practice is to store each SQL query as a file. SQL stored as files enables you to:

Add queries to a Git repository for maintaining version control
Create multiline, indented SQL for easier visualization of the various functional elements of a query
Include them as external code samples in your documentation

The queryRaijin(filename,raijin_uri) function

# Dependencies
# ------------
# import pandas as pd
# import requests as req
# import json as json
def queryRaijin(
    filename,
    raijin_uri = 'http://192.168.0.111:2500'
):
    headers = {"Accept": "application/x-ldjson"}
    with open(filename) as file:
        lines = [line.rstrip() for line in file]
    sql = ' '.join(lines)
    sql = sql.replace('  ',' ')
    query = {"query": ''}
    query['query'] = sql
    jsonquery = json.dumps(query)
    res = req.post(
        raijin_uri,
        headers=headers,
        stream=True,
        data=json.dumps(query)
    )

    list_res = res.text.splitlines()
    json_res = list(map(lambda x: json.loads(x), list_res))
    return pd.DataFrame(json_res)

Using the function

After copying the SQL from the previous exercise to a file called emnify-events-match-exceeded.sql, you can now create the same pandas DataFrame from above by providing the filename to the new function:

When you run this in your Jupyter notebook, you should see the same 2,224 rows of 5 fields displayed above.

Using seaborn to plot data

Within Jupyter Notebook, you can use plotting libraries like seaborn to visualize your data. The seaborn library is especially suited to plotting data distributions from pandas DataFrames. In most cases, data manipulation is not required, only a single function call with a few parameters will suffice.

Using the previous DataFrame, the following Jupyter Notebook code cell sets the style of the plot, customizes the title and labels for each axis, and defines a histogram that will aggregate events by device name on the x-axis.

Plotting a histogram of emnify device policy limits exceeded events

# Remove the top and right borders of the plot
custom_params = {
    "axes.spines.right": False,
    "axes.spines.top": False,
    "ytick.left": True
}

# Apply the custom params and set the theme
sns.set_theme(style='whitegrid', palette='pastel', rc=custom_params)

# Define the seaborn plot
exceeded = sns.histplot(
    events,
    x='Device',
    hue='Device'
)

# Create a title and override the default axis labels
exceeded.set(
    title='emnify device policy limits exceeded: (All dates)',
    xlabel='Registered Devices',
    ylabel='Event Count'
)

Because of the extreme range of event count across IoT devices, an additional line is needed to change the y-axis to a logarithmic scale.

Plotting a histogram of emnify device policy limits exceeded events using a logarithmic scale

# Define the seaborn plot
# Use a logarithmic scale for the y-axis
exceeded = sns.histplot(
    events,
    x='Device',
    hue='Device',
    log_scale=(False,True)
)

emnify device policy limits exceeded logscale

Identifying duplicate events for exclusion

From the previous exercise, you saw the problem that an extreme range of values presents when plotting data. When an application continually logs the same message in short intervals until the state/error/warning changes, generating a histogram that accurately describes the true distribution of event categories is impossible. NXLog can filter duplicate events. However, you need to first identify which events are duplicates and find a way to define them.

Plotting a histogram of events by day of year

By converting each warning event’s timestamp to the day number of the current year, you can create a dataset for plotting a histogram of the number of warnings per day.

emnify-events-warning.sql

USE emnify;
SELECT DAYOFYEAR(timestamp) AS "Day",
       "event_type.description",
       timestamp
FROM   events
WHERE  "event_severity.id" = 1 /* WARNING */
ORDER BY timestamp;

Plot a histogram of emnify warning events aggregated by day

warnings = queryRaijin('emnify-events-warning.sql')

warnings_dist = sns.displot(
    warnings,
    x='Day',
    height=7, # make the plot 7 units high
    aspect=2
)
warnings_dist.set(title='emnify Events: Distribution of warnings')
plt.xticks(np.arange(60, 180, step=5), rotation=90)

Plotting a histogram of emnify warning events aggregated by day using a logarithmic scale

# Define the seaborn plot
# Use a logarithmic scale for the y-axis
exceeded = sns.histplot(
    events,
    x='Device',
    hue='Device',
    log_scale=(False,True)
)

emnify events warning distribution logscale

The description contains very long, detailed messages that are often very similar but can vary by something as minor as a mobile network provider’s IP address. By using only the first 30 characters, you can categorize such events as duplicates despite these insignificant differences.

emnify raijin duplicates by day device description results

Figure 6. Raijin results: Top duplicate event counts per device aggregated by calendar day

View/copy the SQL for this query

Query by day of year for events that are warnings

USE emnify;
SELECT COUNT(*) AS "Duplicates",
       "endpoint.name" AS "Device",
       DAYOFYEAR(timestamp) AS "Day",
       SUBSTRING("description",1,30) AS "Desc"
FROM   events
WHERE  "event_severity.id" = 1 /* WARNING */
GROUP BY "Day",
         "Device",
         "Desc"
ORDER BY "Duplicates" DESC,
         "Device",
         "Day",
         "Desc";

The first 15 rows of these query results indicate that over 9,500 of these warning events are duplicates. This shows that 72% of the 13,299 events the emnfiy Data Streamer sent were essentially worthless.

The histograms above are confined to events that have a severity of Warning. Given a count of 9,748 warning events, this means that less than 3% of the warning events collected from the emnify Data Streamer are significant.

Next, you will learn how to leverage SQL and pandas DataFrames to isolate and remove duplicates from your emnify warning events.

The filter (warnings only) and new field Desc (the description field with all but the first 30 characters truncated) will pave the way for dropping superfluous events.

emnify-events-warning-truncated-description.sql

USE emnify;
SELECT DAYOFYEAR(timestamp) AS "Day",
       "endpoint.name" AS "Device",
       timestamp,
       SUBSTRING(description,1,30) AS "Desc"
FROM   events
WHERE  "event_severity.id" = 1
ORDER BY timestamp;

First, you will construct a Python dictionary called previous that uses the IoT device name as a key and the truncated description Desc as its value. This will serve as a lookup table for seeing if the current device has the same Desc value as the previous one while you iterate through each row of the DataFrame that Raijin has already sorted by timestamp. If you use the current IoT device’s name as the key for returning its previous Desc value, and it matches the Desc value for this row, a duplicate has been detected. In this case, the DataFrame drop function removes it.

Drop all duplicate warning events and plot the histogram

# Create the DataFrame from the query
events = queryRaijin('emnify-events-warning-truncated-description.sql')

# Iterate through each row of the DataFrame to find and drop duplicates
previous = {}
for index, row in events.iterrows():
    device = row['Device']
    if type(device) == str:
        if device in previous:
            if row['Desc'] == previous[device]:
                # Drop this duplicate row
                events = events.drop(labels=index, axis=0)
        previous[device] = row['Desc']

# Define the histogram plot
warnings = sns.histplot(
    events,
    x='Device',
    hue='Device'
)

# Create a title and override the default axis labels
warnings.set(
    title='emnify device warnings (All dates)',
    xlabel='Registered Devices',
    ylabel='Event Count'
)
plt.xticks(plt.xticks()[0], rotation=60)

After plotting this cleaned data, you can see that IoT Device 011 only has 4 significant warnings, not 2,155.

Plotting data usage

emnify logs sessions of data usage based on bursts of data transmissions using beginning and ending timestamps:

start_timestamp
end_timestamp

If you would like to work with a more logical pair of field names, e.g., start/stop, ts_begin/ts_end, t0/t1, etc., you could use the rename_field procedure in your NXLog configuration file to customize these or any other field names.

If you want to analyze the duration of data sessions, you could use milliseconds as a unit of time. However, the shortest duration recorded in this dataset of 6,410 sessions was 3.027 seconds.

For aggregating data over any time interval, the best practice is to use the timestamp that marks the completion of the data transmission. In this case, end_timestamp.

emnify data usage for a specific month

Create the DataFrame for May 2023 data usage

data_usage = queryRaijin('emnify-data-usage-may-2023.sql')
data_usage

emnify-data-usage-may-2023.sql

USE emnify;
SELECT "endpoint.name" AS "Device",
       SUM("volume.rx") AS "RX",
       SUM("volume.tx") AS "TX",
       SUM("volume.total") AS "Combined",
       SUM( (end_timestamp - start_timestamp)/1000000) AS "Duration",
       COUNT("endpoint.name") AS "Sessions"
FROM   usage
WHERE  YEAR(end_timestamp) = 2023
  AND  MONTH(end_timestamp) = 5
  AND  "traffic_type.description" = 'Data'
GROUP BY "endpoint.name",
         month
ORDER BY "endpoint.name";

Figure 7. Jupyter Notebook output of this DataFrame

Create a bar chart of May 2023 data usage

u5 = sns.barplot(
    data_usage,
    x='Device',
    y='Combined',
    hue='Device',
    dodge=False
)
u5.set(
    title='Data usage by emnify-enabled device (May 2023)',
    xlabel='',
    ylabel='Combined (RX/TX) data throughput (MB)',

)
plt.xticks(plt.xticks()[0], rotation=60)

Create a report of data usage by month

In this exercise, you will create a plot of data usage by month for two categories: data transmitted and data received.

Query Raijin to retrieve TX/RX data values by month

data_usage = queryRaijin('emnify-data-tx-rx-by-month.sql')
data_usage

The SQL query aggregates the appropriate fields by month.

emnify-data-tx-rx-by-month.sql

USE emnify;
SELECT SUM("volume.rx") AS "Data RX",
       SUM("volume.tx") AS "Data TX",
       MONTH(end_timestamp) AS month
FROM   usage
WHERE  YEAR(end_timestamp) = 2023
  AND  "traffic_type.id" = 5 /* data usage */
GROUP BY MONTH(end_timestamp)
ORDER BY month;

Figure 8. Jupyter Notebook output of this DataFrame

Create a field for the name of the month

# Create a new column in the dataset containing the month name
months = {
    1:'January',
    2:'February',
    3:'March',
    4:'April',
    5:'May',
    6:'June',
    7:'July',
    8:'August',
    9:'September',
    10:'October',
    11:'November',
    12:'December'
}
data_usage['Month'] = data_usage['month'].map(months)
data_usage

emnify data tx rx by month with month name

Figure 9. DataFrame after adding a field for month name

Convert wide-form data to long-form data for seaborn

# Create a "Mode" column as a category to differentiate TX from RX
# Move TX and RX values (data usage in MB) to a new "MB" column
data_usage = data_usage.melt(
    id_vars=['month','Month'],
    value_vars=['Data RX','Data TX'],
    var_name='Mode',
    value_name='MB'
)
data_usage

emnify data tx rx by month with month name melted

Figure 10. Resultant data structure needed for plotting by category

In the following Jupyter Notebook code cell, the seaborn barplot() function takes a category Mode and creates a bar chart of both values, either Data RX or Data TX.

Create the category plot

sp = sns.barplot(
    data_usage,
    x='Month',
    y='MB',
    hue='Mode'
)
sp.set_yscale("log")

Create a report of SMS usage for May 2023

Query Raijin to retrieve May 2023 RX and TX SMS counts for each IoT device

sms_usage = queryRaijin('emnify-sms-tx-rx-by-device-may-2023.sql')
sms_usage

emnify-sms-tx-rx-by-device-may-2023.sql

USE emnify;
SELECT "endpoint.name" AS "Device",
       SUM("volume.rx") AS "SMS TX",
       SUM("volume.tx") AS "SMS RX"
FROM   usage
WHERE  YEAR(end_timestamp) = 2023
  AND  MONTH(end_timestamp) = 5
  AND  "traffic_type.id" = 6 /* SMS usage */
GROUP BY "endpoint.name"
ORDER BY "Device";

Because emnify uses volume.rx for sent SMS messages and volume.tx for SMS messages received, you need to switch the field labels in the SQL query as shown above. Detailed documentation regarding this issue was made available to emnify in May 2023. To date, no response has been received. Thus, one could conclude that this issue might never be pursued.

emnify sms tx rx by device may 2023 data

Figure 11. Jupyter Notebook output of this DataFrame

# Set the y-axis tick values to integer instead of floating point
from matplotlib.ticker import MaxNLocator
ax = plt.figure().gca()
ax.yaxis.set_major_locator(MaxNLocator(integer=True))

# Create the SMS usage bar chart using Mode to differentiate between RX and TX
sms_plt = sns.barplot(
    sms_usage,
    x='Device',
    y='Count',
    hue='Mode'
)
# Create a title and remove the superfluous x-axis label
sms_plt.set(
    title='emnify SMS usage by device: May 2023',
    xlabel=''
)

Summary

By hosting your own emnify metadata, you are no longer limited to the fixed set of reports that emnify provides. Once you are free from the multitenancy of the emnify Portal, you can generate multiple reports in a fraction of a second, rather than a fraction of a minute for a single report.

Now that you have completed this tutorial, you have the foundation for creating custom reports for your organization. Even though these exercises are specific to emnify’s IoT metadata, you are now equipped to apply the concepts you have learned to other types of log data.