Changing the Game When it Comes to Auditing in Big Data – Part 2
In my previous blog post we enabled auditing at the various levels of your MapR cluster. In this follow up post we will analyse the audit logs using Apache Drill to start answering questions like:
- Unauthorized cluster changes and data access
- Complying with regulatory frameworks and legislation
- Data usage heatmaps on cold, warm and hot data
- Data access analytics and performance improvements
- Data protection policies on data that matters using snapshots and mirroring
To start with, the audit log files generated by MapR Auditing include the type of action performed on the filesystem or MapR-DB table, the date & time of the action performed and the specific details on the file or table being part of the activity.
Let’s have a look at an example record in the filesystem audit log after a user creates a file in an auditing enabled volume and directory. To create the file (assuming the MapR cluster is mounted using our unique NFS capabilities) execute the following:
# touch /mapr/demo.mapr.com/myauditvolume/myauditdirectory/myauditfile
The result of creating this file looks the following in the filesystem audit log:
# cat /mapr/demo.mapr.com/var/mapr/local/maprdemo/audit/FSAudit.log-2015-08-20-001.json |grep myauditfile {"timestamp":{"$date":"2015-08-20T07:28:13.487Z"},"operation":"CREATE","uid":0,"ipAddress":"127.0.0.1","nfsServer":"10.0.2.15","parentFid":"2178.32.131412","childFid":"2178.33.131414","childName":"myauditfile","volumeId":138315622,"status":0}
Great! We’ve captured the file creation using the auditing functionality. As you notice there are various id’s mentioned in the audit log (e.g. uid, parentFid, childFid, volumeId etc.). To convert these id’s to a human readable format we can make use of the expandaudit utility. Let’s do that!
Expanding the audit logs to include human readable information
To prepare the audit logs for analyzing let’s create a separate volume to store the expanded audit logs. To create a volume please execute the following command (don’t forget to set correct user permissions if so required):
# maprcli volume create -name myexpandedauditlogs -path /myexpandedauditlogs
Next step is to run the expandaudit log utility and let the output be written to our newly created volume. The following command expands all the audit logs for the volume ‘myauditvolume’ on the ‘demo.mapr.com’ cluster and writes the output to the ‘myexpandedauditlogs’ folder using the NFS mount:
# /opt/mapr/bin/expandaudit -cluster demo.mapr.com -volumename myauditvolume -o /mapr/demo.mapr.com/myexpandedauditlogs/
Now when we look at the expanded audit log file and locate the creation of the ‘myauditfile’ as follows, we can see that besides the id’s mentioned previously also the corresponding filenames, volumenames, usernames etc. matching the id’s have been added to the logfile:
# cat /mapr/demo.mapr.com/myexpandedauditlogs/138315622/maprdemo/FSAudit.log-2015-08-20-001.part.json |grep myauditfile {"timestamp":{"$date":"2015-08-20T07:28:13.487Z"},"operation":"CREATE","user":"root","uid":0,"ipAddress":"127.0.0.1","nfsServer":"10.0.2.15","parentPath":"/myauditvolume/myauditdirectory","parentFid":"2178.32.131412","childPath":"/myauditvolume/myauditdirectory/myauditfile","childFid":"2178.33.131414","childName":"myauditfile","VolumeName":"myauditvolume","volumeId":138315622,"status":0}
When launching the expandaudit log utility at a timed interval (e.g. once per day), analysts always have the latest expanded audit log files available. Please note that expanding audit logs requires cluster resources like memory and cpu. Therefore please plan the execution of this utility accordingly on your production cluster.
Final action to perform prior to start analyzing the data using Apache Drill is to separate the filesystem and MapR-DB audit logs into different folders:
mkdir /mapr/demo.mapr.com/myexpandedauditlogs/fsaudit && \ find /mapr/demo.mapr.com/myexpandedauditlogs/ -name 'FS*.json' -exec mv {} /mapr/demo.mapr.com/myexpandedauditlogs/fsaudit/ \; &&\ mkdir /mapr/demo.mapr.com/myexpandedauditlogs/dbaudit && \ find /mapr/demo.mapr.com/myexpandedauditlogs/ -name 'DB*.json' -exec mv {} /mapr/demo.mapr.com/myexpandedauditlogs/dbaudit/ \;
That’s it, let’s get those audit logs analyzed using Apache Drill!
Analyze the audit logs using Apache Drill
Since the audit log files are stored in the JSON file format, we can use Apache Drill to analyze the log files using ANSI SQL language and Business Intelligence tools.
Before we can access the audit data we need to tell Apache Drill where the data is stored. We can do so adding the following lines to the Distributed FileSystem (DFS) Storage Plugin. How to install Apache Drill and add a storage plugin is out of the scope for this blog post, please find information related to this on the Apache Drill documentation website.
"myexpandedauditlogs": { "location": "/myexpandedauditlogs", "writable": true, "defaultInputFormat": "json" }
With the Storage Plugin configured and pointing towards our ‘myexpandedauditlogs’ folder on the cluster, we can start analyzing the logs using any SQL or Business Intelligence tool.
Using sqlline to analyze the audit logs
Using the sqlline command line utility, we can execute ANSI SQL commands against Apache Drill to analyze the audit logs. To connect to your Apache Drill cluster using sqlline please execute:
/opt/mapr/drill/drill-1.1.0/bin/sqlline -u jdbc:drill:zk=localhost:5181 -n admin -p admin
We will be using database views on the audit JSON log files to ease the typecasting and prepare the data for analysis in your favourite Business Analytics tool. To create the view for the filesystem audit log files:
0: jdbc:drill:zk=localhost:5181> create or replace view dfs.myexpandedauditlogs.fsaudit_view as select cast(`fsaudit`.`timestamp` AS TIMESTAMP) as `timestamp`, cast(`fsaudit`.`operation` as varchar(255)) as `operation`, cast(`fsaudit`.`user` as varchar(255)) as `user`, cast(`fsaudit`.`uid` as int) as `uid`, cast(`fsaudit`.`ipAddress` as varchar(255)) as `ipAddress`, cast(`fsaudit`.`srcPath` as varchar(65536)) as `srcPath`, cast(`fsaudit`.`srcFid` as varchar(255)) as `srcFid`, cast(`fsaudit`.`volumename` as varchar(255)) as `volumename`, cast(`fsaudit`.`volumeid` as int) as `volumeid`, cast(`fsaudit`.`status` as varchar(255)) as `status`, cast(`fsaudit`.`nfsServer` as varchar(255)) as `nfsServer`, cast(`fsaudit`.`srcName` as varchar(255)) as `srcName`, cast(`fsaudit`.`parentPath` as varchar(65536)) as `parentPath`, cast(`fsaudit`.`parentFid` as varchar(255)) as `parentFid`, cast(`fsaudit`.`childPath` as varchar(65536)) as `childPath`, cast(`fsaudit`.`childFid` as varchar(255)) as `childFid`, cast(`fsaudit`.`childName` as varchar(255)) as `childName`, cast(`fsaudit`.`dstPath` as varchar(65536)) as `dstPath`, cast(`fsaudit`.`dstFid` as varchar(255)) as `dstFid` from dfs.myexpandedauditlogs.`/fsaudit` fsaudit;
This view allows us to analyze the filesystem audit logs using your favourite Business Intelligence tooling. Lets also create a view to analyze the MapR-DB audit logs:
0: jdbc:drill:zk=localhost:5181> create or replace view dfs.myexpandedauditlogs.dbaudit_view as select cast(`dbaudit`.`timestamp` AS TIMESTAMP) as `timestamp`, cast(`dbaudit`.`operation` as varchar(255)) as `operation`, cast(`dbaudit`.`user` as varchar(255)) as `user`, cast(`dbaudit`.`uid` as int) as `uid`, cast(`dbaudit`.`ipAddress` as varchar(255)) as `ipAddress`, cast(`dbaudit`.`srcName` as varchar(255)) as `srcName`, cast(`dbaudit`.`VolumeName` as varchar(255)) as `VolumeName`, cast(`dbaudit`.`volumeId` as int) as `volumeId`, cast(`dbaudit`.`parentPath` as varchar(65536)) as `parentPath`, cast(`dbaudit`.`parentFid` as varchar(255)) as `parentFid`, cast(`dbaudit`.`tablePath` as varchar(65536)) as `tablePath`, cast(`dbaudit`.`tableFid` as varchar(255)) as `tableFid`, cast(`dbaudit`.`status` as varchar(255)) as `status`, cast(`dbaudit`.`columnFamily` as varchar(255)) as `columnFamily`, cast(`dbaudit`.`columnQualifier` as varchar(255)) as `columnQualifier` from dfs.myexpandedauditlogs.`/dbaudit` dbaudit;
So, now we have our SQL views in place, let’s test them prior to analyzing our audit logs by using a simply SQL query to count the amount of Filesystem and MapR-DB audit records:
To count the amount of FileSystem audit records using the newly created view:
0: jdbc:drill:zk=localhost:5181> select count(1) from dfs.myexpandedauditlogs.fsaudit_view;
To count the amount of MapR-DB audit records using the newly created view:
0: jdbc:drill:zk=localhost:5181> select count(1) from dfs.myexpandedauditlogs.dbaudit_view;
That’s it! You are all set to analyze your audit logs by simply pointing your Business Intelligence tool to the newly created views. In the following video, my colleague Nick Amato shows you how to use Tableau as one of the BI tools to analyze the audit logs: https://www.youtube.com/watch?v=AwEgOixs4ZU
With that, you’re all set to start answering questions like:
- Unauthorized cluster changes and data access
- Complying with regulatory frameworks and legislation
- Data usage heatmaps on cold, warm and hot data
- Data access analytics and performance improvements
- Data protection policies on data that matters using snapshots and mirroring
Conclusion
In my previous blog post I’ve described the powerful capabilities of MapR Auditing and guided you through the easy setup of this feature. In this follow up post I’ve shown you how to setup your MapR environment to analyse the audit log information. This allows you to answer very important questions like who is doing what from a cluster management and data usage point of view, as well as other use cases when it comes to compliance, performance, and data protection policies.
Reference: | Changing the Game When it Comes to Auditing in Big Data – Part 2 from our JCG partner Martijn Kieboom at the Mapr blog. |