Skip to main content

Developing small Haskell parser

After watching few Richard' Hickey talks I decided to take a look at functional programming. After some researching in internet I choosen to learn Haskell. First time when I seen Haskell code I was scared like most peoples. But then... Haskell is simply awesome. If you are newbie I'd  suggest you read this book - this will be useful for any developer. After read it I decided to write small web crawler to summarize what I learn and write post about it. Pls note this post is just my notes to summarize my knowledges, I am do not trying explain clearly FP or haskell monads. If you don't know Haskell suggest you read book first :)

My parser has few modules: work with network, parsing, work with DB(Postgre SQL). Here is notes how I did it module by module

Network access

At the beginning I created very small module which can download html by url. We'll need use HTTP module. Then we can create easy function function which will download html:

getHTML:: String -> IO String
getHTML url = do
  rsp <- simpleHTTP (getRequest url)
  body <- getResponseBody rsp
  return (decodeString body)

First row just says: we are getting String parameter and return IO String. To understand what's IO means need to read about monads and Haskell IO. This code looks imperative but this is not true. Do notation is just syntax sugar for monad function >>. For example
If you have
main = do
  putStrLn "Line 1"
  putStrLn "Line 2"
it could be transformed to
main = putStrLn "Line 1" >> putStrLn "Line 2"

Also
main = do
  name <- getLine
  putStrLn name

just sugar for:
main = getLine >>= /name -> putStrLn name

return - also monad function it's does't return us from function it's just wrap into monad.
Now this become more clear I hope.

Parsing

Now when we have html we can create module that parse it. Here is example of function for parsing links.

parseLinks:: String -> IO [String]-- [String]
parseLinks html = do
  let doc = readString [withParseHTML yes, withWarnings no] html
  let res = runX $ doc >>> css "a" >>> getAttrValue "href"
  res >>= \r-> return r

For parsing I used HXT library. At first time, it's scaring when you see in haskell something like runx $ doc. Actually it's just sugar to remove scobes. So it's just means: pass right side to left side :) Could be replaced with: runx (doc >>> css "a"….). Also you can often find function composition in functional programs

Database

For DB access we'll use HDBC PostgreSQL library. First of all let's create algebraic type Page:
data Page = Page { pageId :: Integer,
                pageTitle :: String,
                pageKeywords :: String,
                pageDescription :: String,
                pageOgImage :: String,
                pageOgDescription :: String,
                url :: String } deriving (Show)

Note type shoud starts from capitalize letter.
Now let's create function that returns list of parsed pages

selectPages::IConnection a => a -> IO [Page]
selectPages con = do
  result <- quickQuery' con "select * from page" []
  return $ map unpack result
  where unpack [SqlInteger page_id, SqlByteString page_title, SqlByteString page_keywords, SqlByteString page_description, SqlByteString page_og_image, SqlByteString page_og_description, SqlByteString domain] =            
          Page { pageId = page_id,
                 pageTitle = toString page_title,
                 pageKeywords = toString page_keywords,
                 pageDescription = toString page_description,
                 pageOgImage = toString page_og_image,
                 pageOgDescription = toString page_description,
                 url = toString domain }

This looks a bit harder yeah? But at second look you'll see:
  1. we run quickQuery' and passing 3 arguments: connection, query text, list of parameters. Parameters esapes with "?" sign.
  2. function map calls function to each list item. So actually we are calling unpack function for each result item. Unpack defined in where section.

How to receive conection? It's easy:
connect:: String -> IO Connection
connect conString = do 
  conn <- connectPostgreSQL conString
  return conn

I'll provide github sources so you can see all CRUD for this module.

Main module

Now lets finish our parser.

parseUrl:: String->[String] -> [String] ->  IO()
parseUrl _ [] parsedUrls = do
  putStrLn "Finished parsing"
parseUrl baseDomain (u:urls) parsedUrls
  | (startswith "http" u) || ( startswith "//" u) = do             
    html <- getBrowserHTML(u)
    title <- parseTitle html           
    metaKeywords <- parseMeta "keywords" html  
    metaDescriptions <- parseMeta "description" html  
    metaOgImage <- parseMeta "og:image" html  
    metaOgDescription <- parseMeta "og:description" html      
    conn <- connect "host=127.0.0.1 dbname=databayo user=postgres password=******"      
    existedPage <- getPage conn u      
    createOrUpdate existedPage conn Page { pageId = 0,
                        pageTitle = getFirstOrEmpty title,
                        pageKeywords = getFirstOrEmpty metaKeywords,
                        pageDescription = getFirstOrEmpty metaDescriptions,
                        pageOgImage = getFirstOrEmpty metaOgImage,
                        pageOgDescription = getFirstOrEmpty metaOgDescription,
                        url = u }
    links <- parseLinks html         
      parseUrl
      baseDomain
      (filter (\f -> notIn (parsedUrls++[u]) f)
      (filter (startswith baseDomain) $ fixBaseLinks (baseDomain) (nub $ merge links urls)))
      (nub (parsedUrls++[u]))
  | otherwise = do
    putStrLn "Unknown url"   

Now you shoud almost understand this code. 2 things you can see here.
  1. Pattern matching
  2. Guards

Whew. Now we could compile it with ghc or create package (cabal init, cabal install) and then run app.


Hope you had fun in reading this post.

Comments

  1. Thx, I'm also learning haskell and it's not easy to find simple examples of usual tasks :)

    ReplyDelete

Post a Comment

Popular posts from this blog

QUICK START WITH XAMARIN.IOS

My company had few requests to build mobile application, but they had only me - .net guy. That’s why they chose Xamarin as framework for developing their mobile applications. Xamarin allow create iOS/Android apps in C#. And now I want to write small quick start tutorial how to create iOS app with Xamarin. So lets start… We’ll write small news app in this tutorial, the news source will be bbc.co.uk. Here you can find REST interface to access BBC News http://api.bbcnews.appengine.co.uk/ . When I am writing this article I supposing you already have Xamarin, Xamarin Studio installed.  In Xamarin Studio, choose C# > iOS > iPhone in the left-hand pane, and then, in the center pane, select Empty Project template from the center pane. This will create a new Xamarin.iOS iPhone application.  In AppDelegate on FinishedLaunching method write following code: This code will add root navigation controller and push new CategoryController into navigation. Create folder where we’l

angular in 10 steps

So, how to quick start with angular? For example, you need some kind of shopping cart. Here is step by step tutorial: 1. Create app/app.js file - this is our main module. var shoppingCart = angular.module('shoppingCart',['ngRoute']); 2. Then you can configure your routes: shoppingCart.config(['$routeProvider', function($routeProvider){     $routeProvider.when('/',{        templateUrl: 'app/views/home.html', //path to our view        controller: 'productController' //page controller     });     $routeProvider.otherwise({ redirectTo: '/' }); }]); 3. Create service (app/services/productService.js) which will deliver data into your app. shoppingCart.factory('productService',function($http){     return {         getAll: function(){             return $http({"method": "GET", "url": "app/data/data.json"}); /*also you can path here more parameters like data etc*/