Skip to main content

PIG : Reading data from file


To read the data from a file we can use the LOAD command. Assume there is a file named player.csv (downloaded public dataset of english premier league player from one of the open data set).

Sample Data from player.csv file

Player id,Player,Position,Number,Club,Club (country),D.O.B,Age,Height (cm),Country,Caps,International goals,Plays in home country
336722,Alan PULIDO,Forward,11,Tigres UANL,Mexico,08.03.1991,23,176,Mexico,5,4,TRUE
368902,Adam TAGGART,Forward,9,Newcastle United Jets FC,Australia,02.06.1993,21,172,Australia,4,3,TRUE
362641,Reza GHOOCHANNEJAD,Forward,16,Charlton Athletic FC,England,20.09.1987,26,181,Iran,13,9,FALSE

Pig script to load the data. We must specify the record structure of the file.

grunt> player_data  = LOAD 'players.csv'
       USING PigStorage(',')
       AS
       (player_id:int,
       player:chararray,
       position:chararray,
       number:int,
       club:chararray,
       club_country:chararray,
       d_o_b:chararray,
       age:int,
       height_cm:int,
       country:chararray,
       caps:chararray,
       international_goals:chararray,
       plays_home_country:chararray);

grunt> DUMP player_data;

Sample Output

(380000,Marcelo BROZOVIC,Midfielder,14,GNK Dinamo Zagreb,Croatia,16.11.1992,21,180,Croatia,0,0,TRUE)
(380009,Luis LOPEZ,Goalkeeper,1,Real Espana,Honduras,13.09.1993,20,182,Honduras,0,0,TRUE)
(379910,Adnan JANUZAJ,Midfielder,20,Manchester United FC,England,05.02.1995,19,180,Belgium,0,0,FALSE)



Comments

Popular posts from this blog

Excel : How to pad zeros

Today I got a requirement to format the number in excel cell - to left pad number with zeros.i find the following function very useful to do it. In case one to make the number left padded with "0" s give the formula =TEXT(A1,"0000") In case two even more enhanced form to make it left padded with "0" and add two decimal places give the formula as =TEXT(A2,"0000.00")

Mount an iso image in ubuntu using no GUI

We can easily mount an iso image file in our system to a directory in linux.We need not require an GUI application to do this.It is easy to do from the terminal itself. We use the mount command for that. The steps are as follows  1. Open the terminal in ubuntu (Shortcut press ctrl+Alt+t) 2. Create a directory to mount the iso image              $sudo mkdir /media/myimage 3. Then type the following command     $sudo modprobe loop 4.After that go to the location of the iso image file and type     $sudo mount /media/myimage -t iso9660 -0 loop     The iso file is mounted to the specified directory To unmount it give the command   sudo unmount /media/myimage That all.Cool isn't! sudo modprobe loop will loads the module for loopback file system iso9660 is the file system used by CD-ROM -t specify the file system type -o loop additional option used by a loopback filesystem

UNIX : How to get record count from zipped file

Sometimes we may need to get records count from file . For that we can use wc -l , command with file name. In some situation the file will be in compressed format . wc -l will not directly work with zipped files . In this case we can do zcat the file and pipe the word count command with it. Example : Let say we have a file cricketData.dat.gz To get word count from the file use : zcat cricketData.dat.gz | wc -l This will give the record count.