Seven Databases in Seven Weeks – Hbase Day 2
This post is a recap of the second day of Hbase from the Seven Databases in Seven Weeks book.
Most of the commands and scripts can be found at GitHub: https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_2
Streaming Script
The first thing in day 2 was to download lots of data (big data) and stream it into Hbase. There’s a JRuby script, which I had to alter in order for it to work: https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/import_from_wikipedia.rb
After altering it, as the book suggested, I had to add some compression to the column family. After that, I could run the script:
curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_from_wikipedia.rb curl http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell import_from_wikipedia.rb
This is the output while the script runs
1 10.0G 1 128M 0 0 456k 0 6:23:37 0:04:48 6:18:49 817k19000 records inserted (Serotonin) 1 10.0G 1 131M 0 0 461k 0 6:19:03 0:04:51 6:14:12 921k19500 records inserted (Serotonin specific reuptake inhibitors) 1 10.0G 1 135M 0 0 469k 0 6:12:45 0:04:54 6:07:51 1109k20000 records inserted (Tennis court) 1 10.0G 1 138M 0 0 477k 0 6:06:12 0:04:57 6:01:15 1269k20500 records inserted (Tape drive)
The next part in this chapter talks about regions and some other plumbing stuff.
Build links table
In this part the source is the large Wiki table and the output is ‘links’ table. Each link has ‘From:’ and ‘To:’. Here’s a link to the altered working script: https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/generate_wiki_links.rb
The rest of the chapter shows how to look at the data, count it and more.
Homework
The main part in the homework, was to create a new table: ‘foods’ that takes data from an XML, which can be downloaded from the US’s health & nutrition site. This data shows the nutrition facts per type of food.
I decided to create a very simple table. The column family does not have any special options. I created one column family:facts. Each row data from the XML file will be part of facts. I also decided that the row’s key would be the Display_Name. After all, it’s much easier to look by key and not by some ID.
create 'foods' , 'facts'
In order to see how I should create the script I looked at two sources:
- The script that imported data for the Wiki table
- One element (food) from the XML
Here’s one element:
<Food_Display_Row> <Food_Code>12350000</Food_Code> <Display_Name>Sour cream dip</Display_Name> <Portion_Default>1.00000</Portion_Default> <Portion_Amount>.25000</Portion_Amount> <Portion_Display_Name>cup </Portion_Display_Name> <Factor>.25000</Factor> <Increment>.25000</Increment> <Multiplier>1.00000</Multiplier> <Grains>.04799</Grains> <Whole_Grains>.00000</Whole_Grains> <Vegetables>.04070</Vegetables> <Orange_Vegetables>.00000</Orange_Vegetables> <Drkgreen_Vegetables>.00000</Drkgreen_Vegetables> <Starchy_vegetables>.00000</Starchy_vegetables> <Other_Vegetables>.04070</Other_Vegetables> <Fruits>.00000</Fruits> <Milk>.00000</Milk> <Meats>.00000</Meats> <Soy>.00000</Soy> <Drybeans_Peas>.00000</Drybeans_Peas> <Oils>.00000</Oils> <Solid_Fats>105.64850</Solid_Fats> <Added_Sugars>1.57001</Added_Sugars> <Alcohol>.00000</Alcohol> <Calories>133.65000</Calories> <Saturated_Fats>7.36898</Saturated_Fats> </Food_Display_Row>
I created the script by examining the wiki script and one element. Opening a document is when seeing an open XML element tag: Food_Display_Row. When seeing Food_Display_Row as the close tag, the script creates the document.
include Java import 'org.apache.hadoop.hbase.client.HTable' import 'org.apache.hadoop.hbase.client.Put' import 'org.apache.hadoop.hbase.HBaseConfiguration' import 'javax.xml.stream.XMLStreamConstants' def jbytes( *args ) args.map { |arg| arg.to_s.to_java_bytes } end factory = javax.xml.stream.XMLInputFactory.newInstance reader = factory.createXMLStreamReader(java.lang.System.in) document = nil buffer = nil count = 0 puts( @hbase ) conf = HBaseConfiguration.new table = HTable.new( conf, "foods" ) table.setAutoFlush( false ) while reader.has_next type = reader.next if type == XMLStreamConstants::START_ELEMENT # (3) case reader.local_name when 'Food_Display_Row' then document = {} when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ then buffer = [] when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ then buffer = [] when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ then buffer = [] when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ then buffer = [] end elsif type == XMLStreamConstants::CHARACTERS buffer << reader.text unless buffer.nil? elsif type == XMLStreamConstants::END_ELEMENT case reader.local_name when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ document[reader.local_name] = buffer.join when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ document[reader.local_name] = buffer.join when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ document[reader.local_name] = buffer.join when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ document[reader.local_name] = buffer.join when 'Food_Display_Row' key = document['Display_Name'].to_java_bytes p = Put.new( key ) p.add( *jbytes( "facts", "Display_Name", document['Display_Name'] ) ) p.add( *jbytes( "facts", "Portion_Default", document['Portion_Default'] ) ) p.add( *jbytes( "facts", "Portion_Amount", document['Portion_Amount'] ) ) p.add( *jbytes( "facts", "Portion_Display_Name", document['Portion_Display_Name'] ) ) p.add( *jbytes( "facts", "Factor", document['Factor'] ) ) p.add( *jbytes( "facts", "Increment", document['Increment'] ) ) p.add( *jbytes( "facts", "Multiplier", document['Multiplier'] ) ) p.add( *jbytes( "facts", "Grains", document['Grains'] ) ) p.add( *jbytes( "facts", "Whole_Grains", document['Whole_Grains'] ) ) p.add( *jbytes( "facts", "Vegetables", document['Vegetables'] ) ) p.add( *jbytes( "facts", "Orange_Vegetables", document['Orange_Vegetables'] ) ) p.add( *jbytes( "facts", "Drkgreen_Vegetables", document['Drkgreen_Vegetables'] ) ) p.add( *jbytes( "facts", "Starchy_vegetables", document['Starchy_vegetables'] ) ) p.add( *jbytes( "facts", "Other_Vegetables", document['Other_Vegetables'] ) ) p.add( *jbytes( "facts", "Fruits", document['Fruits'] ) ) p.add( *jbytes( "facts", "Milk", document['Milk'] ) ) p.add( *jbytes( "facts", "Meats", document['Meats'] ) ) p.add( *jbytes( "facts", "Drybeans_Peas", document['Drybeans_Peas'] ) ) p.add( *jbytes( "facts", "Soy", document['Soy'] ) ) p.add( *jbytes( "facts", "Oils", document['Oils'] ) ) p.add( *jbytes( "facts", "Solid_Fats", document['Solid_Fats'] ) ) p.add( *jbytes( "facts", "Added_Sugars", document['Added_Sugars'] ) ) p.add( *jbytes( "facts", "Alcohol", document['Alcohol'] ) ) p.add( *jbytes( "facts", "Calories", document['Calories'] ) ) p.add( *jbytes( "facts", "Saturated_Fats", document['Saturated_Fats'] ) ) table.put( p ) count += 1 table.flushCommits() if count % 10 == 0 if count % 500 == 0 puts "#{count} records inserted (#{document['Display_Name']})" end end end end table.flushCommits() exit
Following are the shell commands that take the XML file and stream them to Hbase. The first command runs against the file with the single element. After I verified the correctness, I ran it against to full file.
curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/food-display-example.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/MyFoodapediaData/Food_Display_Table.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb
Let’s get some food…
get 'foods' , 'fruit smoothie made with milk'
And the result:
COLUMN CELL facts:Added_Sugars timestamp=1399932481440, value=82.54236 facts:Alcohol timestamp=1399932481440, value=.00000 facts:Calories timestamp=1399932481440, value=197.96000 facts:Display_Name timestamp=1399932481440, value=fruit smoothie made with milk facts:Drkgreen_Vegetables timestamp=1399932481440, value=.00000 facts:Drybeans_Peas timestamp=1399932481440, value=.00000 facts:Factor timestamp=1399932481440, value=1.00000 facts:Fruits timestamp=1399932481440, value=.56358 facts:Grains timestamp=1399932481440, value=.00000 facts:Increment timestamp=1399932481440, value=.25000 facts:Meats timestamp=1399932481440, value=.00000 facts:Milk timestamp=1399932481440, value=.22624 facts:Multiplier timestamp=1399932481440, value=.25000 facts:Oils timestamp=1399932481440, value=.00808 facts:Orange_Vegetables timestamp=1399932481440, value=.00000 facts:Other_Vegetables timestamp=1399932481440, value=.00000 facts:Portion_Amount timestamp=1399932481440, value=1.00000 facts:Portion_Default timestamp=1399932481440, value=2.00000 facts:Portion_Display_Name timestamp=1399932481440, value=cup facts:Saturated_Fats timestamp=1399932481440, value=1.91092 facts:Solid_Fats timestamp=1399932481440, value=24.14304 facts:Soy timestamp=1399932481440, value=.00000 facts:Starchy_vegetables timestamp=1399932481440, value=.00000 facts:Vegetables timestamp=1399932481440, value=.00000 facts:Whole_Grains timestamp=1399932481440, value=.00000
Reference: | Seven Databases in Seven Weeks – Hbase Day 2 from our JCG partner Eyal Golan at the Learning and Improving as a Craftsman Developer blog. |
Hey,
Thank you for the post. I am facing difficulties in the long java code. Whenever I write it, my vm doesn’t respond.
May I know how to execute these?