Skip to main content

Posts

Showing posts with the label Hadoop

Hadoop for beginners

I just completed by hadoop fundamentals course from Udemy.com . The videos were very well organized so that you will get the glance of what is this world of big data and how hadoop framework can play a major role is processing this big data. The course was insisting in downloading hortonworks hadoop development sandbox and working with it. Hortonworks are providing the hadoop environment setup to download and we can load it in a virtual machine. I have downloaded the virtual box sandbox file. The course gave a string insight on hadoop architecture and buzz words around it. It gave a in depth idea of hive and pig tools and how they play the key role in storing and processing data in the framework.

Sqoop example to pull data from mysql

Example1: sqoop import --connect jdbc:mysql:<localhost>/<databasename> --username=<username> --password-file <password file location> -table <tablename> -m 1 --compression-codec=snappy --as-avrodatafile --warehouse-dir=<hdfs location for creating datafiles> Example 2: sqoop import --connect jdbc:mysql:<localhost>/<databasename> --username=<username> --password=<password> -table <tablename> -m 1 --compression-codec=snappy --as-avrodatafile --warehouse-dir=<hdfs location for creating datafiles> We can use this flag to overwrite the data pulled --hive-overwrite flag

PIG : Reading data from file

To read the data from a file we can use the LOAD command. Assume there is a file named player.csv (downloaded public dataset of english premier league player from one of the open data set). Sample Data from player.csv file Player id,Player,Position,Number,Club,Club (country),D.O.B,Age,Height (cm),Country,Caps,International goals,Plays in home country 336722,Alan PULIDO,Forward,11,Tigres UANL,Mexico,08.03.1991,23,176,Mexico,5,4,TRUE 368902,Adam TAGGART,Forward,9,Newcastle United Jets FC,Australia,02.06.1993,21,172,Australia,4,3,TRUE 362641,Reza GHOOCHANNEJAD,Forward,16,Charlton Athletic FC,England,20.09.1987,26,181,Iran,13,9,FALSE Pig script to load the data. We must specify the record structure of the file. grunt> player_data = LOAD 'players.csv' USING PigStorage( ',' ) AS (player_id:int, player:chararray, position:chararray, number:int, club:chararray, club_country:chararray, d_o_b:chararr

HADOOP : Load and store command in PIG

The below script show how to load the data file for processing using LOAD command. grunt> VAR = LOAD 'sample.csv' USING PigStorage() AS (lines:chararray); grunt> STORE VAR into 'sampleout.csv' USING PigStorage(); The STORE command is used to store the output of contents to the output file.

HADOOP : Get file from local system to HDFS system

Suppose we have a file in out system (locally present) and need to move to HDFS , use the following command. > hadoop fs -copyFromLocal <source location in local system> <target location in HDFS> eg : Suppose we have a file "sample.csv" and need to move to FILES directory in HDFS system > hadoop fs -copyFromLocal sample.csv FILES