High-Performance Computing Clusters (HPCC) and Cassandra on OS X
Our new parent company, LexisNexis, has one of the world’s largest public records database:
“…our comprehensive collection of more than 46 billion records from more than 10,000 diverse sources—including public, private, regulated, and derived data. You get comprehensive information on approximately 269 million individuals and 277 million unique businesses.”
http://www.lexisnexis.com/en-us/products/public-records.page
And they’ve been managing, analyzing and searching this database for decades. Over that time period, they’ve built up quite an assortment of “Big Data” technologies. Collectively, LexisNexis refers to those technologies as their High-Performance Computing Cluster (HPCC) platform.
HPCC is entirely open source:
Naturally, we are working through the marriage of HPCC with our real-time data management and analytics stack. The potential is really exciting. Specifically, HPCC has sophisticated machine learning and statistics libraries, and a query engine (Roxie) capable of serving up those statistics.
Low and behold, HPCC can use Cassandra as a backend storage mechanism! (FTW!)
The HPCC platform isn’t technically supported on a Mac, but here is what I did to get it running:
HPCC Install
- Clone the github repository, and its submodules (git submodule update –init –recursive)
- Pull my patches (https://github.com/hpcc-systems/HPCC-Platform/pull/7166)
- Install the dependencies using brew
brew install icu4c brew install boost brew install libarchive brew install bison27 brew install openldap brew install nodejs
- Make a build directory, and run cmake from there:
export CC=/usr/bin/clang export CXX=/usr/bin/clang++ cmake ../ -DICU_LIBRARIES=/usr/local/opt/icu4c/lib/libicuuc.dylib -DICU_INCLUDE_DIR=/usr/local/opt/icu4c/include -DLIBARCHIVE_INCLUDE_DIR=/usr/local/opt/libarchive/include -DLIBARCHIVE_LIBRARIES=/usr/local/opt/libarchive/lib/libarchive.dylib -DBOOST_REGEX_LIBRARIES=/usr/local/opt/boost/lib -DBOOST_REGEX_INCLUDE_DIR=/usr/local/opt/boost/include -DUSE_OPENLDAP=true -DOPENLDAP_INCLUDE_DIR=/usr/local/opt/openldap/include -DOPENLDAP_LIBRARIES=/usr/local/opt/openldap/lib/libldap_r.dylib -DCLIENTTOOLS_ONLY=false -DPLATFORM=true
- Then, compile and install with (sudo make install)
- After that, you’ll need to muck with the permissions a bit:
chmod -R a+rwx /opt/HPCCSystems/ chmod -R a+rwx /var/lock/HPCCSystems chmod -R a+rwx /var/log/HPCCSystems
- Now, ordinarily you would run hpcc-init to get the system configured, but that script fails on OS X, so I used linux to generate config files that work and posted those to a repository here: https://github.com/boneill42/hpcc_on_mac
- Clone this repository and replace /var/lib/HPCCSystems with the content of var_lib_hpccsystems.zip
sudo rm -fr /var/lib/HPCCSystems sudo unzip var_lib_hpccsystems.zip -d /var/lib chmod -R a+rwx /var/lib/HPCCSystems
- Then, from the directory containing the xml files in this repository, you can run:
- daserver: (Runs the Dali server, which is the persistence mechanism for HPCC)
- esp: (Runs the ESP server, which is the web services and UI layer for HPCC)
- eclccserver: (Runs the ECL compile server, which takes the ECL and compiles it down to C and then a dynmic library)
- roxie (Runs the Roxie server, which is capable of responding to queries)
- Kickoff each one of those, then you should be ready to run some ECL. Then, go to http://localhost:8010 in a browser. You are ready to run some ECL!
Running ECL
Like Pig with Hadoop, HPCC runs a DSL called ECL. More information on ECL can be found here: http://hpccsystems.com/download/docs/learning-ecl
- As a simple smoke test, go into your HPCC-Platform repository, and go under: ./testing/regress/ecl.
- Then, run the following:
ecl run hello.ecl --target roxie --server=localhost:8010
<dataset name="Result 1"> <row><result_1>Hello world</result_1></row> </dataset>
Cassandra Plugin
With HPCC up and running, we are ready to have some fun with Cassandra. HPCC has plugins. Those plugins reside in /opt/HPCC/plugins. For me, I had to copy those libraries into /opt/HPCCSystems/lib to get HPCC to recognize them.
Go back to the testing/regress/ecl directory and have a look at cassandra-simple.ecl. A snippet is shown below:
childrec := RECORD string name, integer4 value { default(99999) }, boolean boolval { default(true) }, real8 r8 {default(99.99)}, real4 r4 {default(999.99)}, DATA d {default (D'999999')}, DECIMAL10_2 ddd {default(9.99)}, UTF8 u1 {default(U'9999 ß')}, UNICODE u2 {default(U'9999 ßßßß')}, STRING a, SET OF STRING set1, SET OF INTEGER4 list1, LINKCOUNTED DICTIONARY(maprec) map1{linkcounted}; END; init := DATASET([{'name1', 1, true, 1.2, 3.4, D'aa55aa55', 1234567.89, U'Straße', U'Straße','Ascii',['one','two','two','three'],[5,4,4,3],[{'a'=>'apple'},{'b'=>'banana'}]}, {'name2', 2, false, 5.6, 7.8, D'00', -1234567.89, U'là', U'là','Ascii', [],[],[]}], childrec); load(dataset(childrec) values) := EMBED(cassandra : user('boneill'),keyspace('test'),batch('unlogged')) INSERT INTO tbl1 (name, value, boolval, r8, r4,d,ddd,u1,u2,a,set1,list1,map1) values (?,?,?,?,?,?,?,?,?,?,?,?,?); ENDEMBED;
In this example, we define childrec as a RECORD with a set of fields. We then create a DATASET of type childrec. Then we define a method that takes a dataset of type childrec and runs the Cassandra insert command for each of the records in the dataset.
Startup a Cassandra locally. (download Cassandra, unzip it, then run bin/cassandra -f (to keep it in foreground))
Once Cassandra is up, simply run the ECL like you did the hello program.
ecl run cassandra-simple.ecl --target roxie --server=localhost:8010
You can then go over to cqlsh and validate that all the data made it back into Cassandra:
➜ cassandra bin/cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh> select * from test.tbl1 limit 5; name | a | boolval | d | ddd | list1 | map1 | r4 | r8 | set1 | u1 | u2 | value -----------+---+---------+----------------+------+-------+------+--------+--------+------+ name1575 | | True | 0x393939393939 | 9.99 | null | null | 1576.6 | 1575 | null | 9999 ß | 9999 ßßßß | 1575 name3859 | | True | 0x393939393939 | 9.99 | null | null | 3862.9 | 3859 | null | 9999 ß | 9999 ßßßß | 3859 name11043 | | True | 0x393939393939 | 9.99 | null | null | 11054 | 11043 | null | 9999 ß | 9999 ßßßß | 11043 name3215 | | True | 0x393939393939 | 9.99 | null | null | 3218.2 | 3215 | null | 9999 ß | 9999 ßßßß | 3215 name7608 | | False | 0x393939393939 | 9.99 | null | null | 7615.6 | 7608.1 | null | 9999 ß | 9999 ßßßß | 7608
OK, that should give a little taste of ECL and HPCC. It is a powerful platform.
As always, let me know if you run into any trouble.
Reference: | High-Performance Computing Clusters (HPCC) and Cassandra on OS X from our JCG partner Brian ONeill at the Brian ONeill’s Blog blog. |