Developing small Haskell parser

After watching few Richard' Hickey talks I decided to take a look at functional programming. After some researching in internet I choosen to learn Haskell. First time when I seen Haskell code I was scared like most peoples. But then... Haskell is simply awesome. If you are newbie I'd  suggest you read this book - this will be useful for any developer. After read it I decided to write small web crawler to summarize what I learn and write post about it. Pls note this post is just my notes to summarize my knowledges, I am do not trying explain clearly FP or haskell monads. If you don't know Haskell suggest you read book first :)

My parser has few modules: work with network, parsing, work with DB(Postgre SQL). Here is notes how I did it module by module

Network access

At the beginning I created very small module which can download html by url. We'll need use HTTP module. Then we can create easy function function which will download html:

getHTML:: String -> IO String
getHTML url = do
  rsp <- simpleHTTP (getRequest url)
  body <- getResponseBody rsp
  return (decodeString body)

First row just says: we are getting String parameter and return IO String. To understand what's IO means need to read about monads and Haskell IO. This code looks imperative but this is not true. Do notation is just syntax sugar for monad function >>. For example
If you have
main = do
  putStrLn "Line 1"
  putStrLn "Line 2"
it could be transformed to
main = putStrLn "Line 1" >> putStrLn "Line 2"

main = do
  name <- getLine
  putStrLn name

just sugar for:
main = getLine >>= /name -> putStrLn name

return - also monad function it's does't return us from function it's just wrap into monad.
Now this become more clear I hope.


Now when we have html we can create module that parse it. Here is example of function for parsing links.

parseLinks:: String -> IO [String]-- [String]
parseLinks html = do
  let doc = readString [withParseHTML yes, withWarnings no] html
  let res = runX $ doc >>> css "a" >>> getAttrValue "href"
  res >>= \r-> return r

For parsing I used HXT library. At first time, it's scaring when you see in haskell something like runx $ doc. Actually it's just sugar to remove scobes. So it's just means: pass right side to left side :) Could be replaced with: runx (doc >>> css "a"….). Also you can often find function composition in functional programs


For DB access we'll use HDBC PostgreSQL library. First of all let's create algebraic type Page:
data Page = Page { pageId :: Integer,
                pageTitle :: String,
                pageKeywords :: String,
                pageDescription :: String,
                pageOgImage :: String,
                pageOgDescription :: String,
                url :: String } deriving (Show)

Note type shoud starts from capitalize letter.
Now let's create function that returns list of parsed pages

selectPages::IConnection a => a -> IO [Page]
selectPages con = do
  result <- quickQuery' con "select * from page" []
  return $ map unpack result
  where unpack [SqlInteger page_id, SqlByteString page_title, SqlByteString page_keywords, SqlByteString page_description, SqlByteString page_og_image, SqlByteString page_og_description, SqlByteString domain] =            
          Page { pageId = page_id,
                 pageTitle = toString page_title,
                 pageKeywords = toString page_keywords,
                 pageDescription = toString page_description,
                 pageOgImage = toString page_og_image,
                 pageOgDescription = toString page_description,
                 url = toString domain }

This looks a bit harder yeah? But at second look you'll see:
  1. we run quickQuery' and passing 3 arguments: connection, query text, list of parameters. Parameters esapes with "?" sign.
  2. function map calls function to each list item. So actually we are calling unpack function for each result item. Unpack defined in where section.

How to receive conection? It's easy:
connect:: String -> IO Connection
connect conString = do 
  conn <- connectPostgreSQL conString
  return conn

I'll provide github sources so you can see all CRUD for this module.

Main module

Now lets finish our parser.

parseUrl:: String->[String] -> [String] ->  IO()
parseUrl _ [] parsedUrls = do
  putStrLn "Finished parsing"
parseUrl baseDomain (u:urls) parsedUrls
  | (startswith "http" u) || ( startswith "//" u) = do             
    html <- getBrowserHTML(u)
    title <- parseTitle html           
    metaKeywords <- parseMeta "keywords" html  
    metaDescriptions <- parseMeta "description" html  
    metaOgImage <- parseMeta "og:image" html  
    metaOgDescription <- parseMeta "og:description" html      
    conn <- connect "host= dbname=databayo user=postgres password=******"      
    existedPage <- getPage conn u      
    createOrUpdate existedPage conn Page { pageId = 0,
                        pageTitle = getFirstOrEmpty title,
                        pageKeywords = getFirstOrEmpty metaKeywords,
                        pageDescription = getFirstOrEmpty metaDescriptions,
                        pageOgImage = getFirstOrEmpty metaOgImage,
                        pageOgDescription = getFirstOrEmpty metaOgDescription,
                        url = u }
    links <- parseLinks html         
      (filter (\f -> notIn (parsedUrls++[u]) f)
      (filter (startswith baseDomain) $ fixBaseLinks (baseDomain) (nub $ merge links urls)))
      (nub (parsedUrls++[u]))
  | otherwise = do
    putStrLn "Unknown url"   

Now you shoud almost understand this code. 2 things you can see here.
  1. Pattern matching
  2. Guards

Whew. Now we could compile it with ghc or create package (cabal init, cabal install) and then run app.

Hope you had fun in reading this post.


  Thx, I'm also learning haskell and it's not easy to find simple examples of usual tasks :)


