Skip to main content

PIG : Reading data from file


To read the data from a file we can use the LOAD command. Assume there is a file named player.csv (downloaded public dataset of english premier league player from one of the open data set).

Sample Data from player.csv file

Player id,Player,Position,Number,Club,Club (country),D.O.B,Age,Height (cm),Country,Caps,International goals,Plays in home country
336722,Alan PULIDO,Forward,11,Tigres UANL,Mexico,08.03.1991,23,176,Mexico,5,4,TRUE
368902,Adam TAGGART,Forward,9,Newcastle United Jets FC,Australia,02.06.1993,21,172,Australia,4,3,TRUE
362641,Reza GHOOCHANNEJAD,Forward,16,Charlton Athletic FC,England,20.09.1987,26,181,Iran,13,9,FALSE

Pig script to load the data. We must specify the record structure of the file.

grunt> player_data  = LOAD 'players.csv'
       USING PigStorage(',')
       AS
       (player_id:int,
       player:chararray,
       position:chararray,
       number:int,
       club:chararray,
       club_country:chararray,
       d_o_b:chararray,
       age:int,
       height_cm:int,
       country:chararray,
       caps:chararray,
       international_goals:chararray,
       plays_home_country:chararray);

grunt> DUMP player_data;

Sample Output

(380000,Marcelo BROZOVIC,Midfielder,14,GNK Dinamo Zagreb,Croatia,16.11.1992,21,180,Croatia,0,0,TRUE)
(380009,Luis LOPEZ,Goalkeeper,1,Real Espana,Honduras,13.09.1993,20,182,Honduras,0,0,TRUE)
(379910,Adnan JANUZAJ,Midfielder,20,Manchester United FC,England,05.02.1995,19,180,Belgium,0,0,FALSE)



Comments

Popular posts from this blog

Excel : How to pad zeros

Today I got a requirement to format the number in excel cell - to left pad number with zeros.i find the following function very useful to do it. In case one to make the number left padded with "0" s give the formula =TEXT(A1,"0000") In case two even more enhanced form to make it left padded with "0" and add two decimal places give the formula as =TEXT(A2,"0000.00")

Mount an iso image in ubuntu using no GUI

We can easily mount an iso image file in our system to a directory in linux.We need not require an GUI application to do this.It is easy to do from the terminal itself. We use the mount command for that. The steps are as follows  1. Open the terminal in ubuntu (Shortcut press ctrl+Alt+t) 2. Create a directory to mount the iso image              $sudo mkdir /media/myimage 3. Then type the following command     $sudo modprobe loop 4.After that go to the location of the iso image file and type     $sudo mount /media/myimage -t iso9660 -0 loop     The iso file is mounted to the specified directory To unmount it give the command   sudo unmount /media/myimage That all.Cool isn't! sudo modprobe loop will loads the module for loopback file system iso9660 is the file system used by CD-ROM -t specify the file system type -o loop additional option used by a loopback filesystem

UNIX : How to ignore lines with certain names

Sometimes we need to ignore multiple lines with certain words and get the list out of the file. usually it will be a log file to read . The below grep command can be used to ignore multiple words present in a text file. Lets say the file contain $ cat list.txt apple orange apple banana papaya Now we need to ignore line with orange , banana and papaya . So we can use the below grep command. $ cat list.txt | grep -Ev "orange|banana|papaya" apple apple It will ignore lines with the words in -v part of grep.