After
watching few Richard' Hickey talks I decided to take a look at functional
programming. After some researching in internet I choosen to learn Haskell.
First time when I seen Haskell code I was scared like most peoples. But then...
Haskell is simply awesome. If you are newbie I'd suggest you read this book - this will be useful
for any developer. After read it I decided to write small web crawler to
summarize what I learn and write post about it. Pls note this post is just my
notes to summarize my knowledges, I am do not trying explain clearly FP or
haskell monads. If you don't know Haskell suggest you read book first :)
My parser
has few modules: work with network, parsing, work with DB(Postgre SQL). Here is
notes how I did it module by module
Network access
At the
beginning I created very small module which can download html by url. We'll
need use HTTP
module. Then we can create easy function function which will download html:
getHTML:: String -> IO String
getHTML url = do
rsp <- simpleHTTP (getRequest url)
body <- getResponseBody rsp
return (decodeString body)
First row
just says: we are getting String parameter and return IO String. To understand
what's IO means need to read about monads
and Haskell IO. This
code looks imperative but this is not true. Do notation is just syntax sugar
for monad function >>. For example
If you
have
main = do
putStrLn "Line 1"
putStrLn "Line 2"
it could
be transformed to
main =
putStrLn "Line 1" >> putStrLn "Line 2"
Also
main = do
name <- getLine
putStrLn name
just
sugar for:
main =
getLine >>= /name -> putStrLn name
return -
also monad function it's does't return us from function it's just wrap into
monad.
Now this
become more clear I hope.
Parsing
Now when
we have html we can create module that parse it. Here is example of function
for parsing links.
parseLinks::
String -> IO [String]-- [String]
parseLinks
html = do
let doc = readString [withParseHTML yes,
withWarnings no] html
let res = runX $ doc >>> css
"a" >>> getAttrValue "href"
res >>= \r-> return r
For
parsing I used HXT
library. At first time, it's scaring when you see in haskell something like
runx $ doc. Actually it's just sugar to remove scobes. So it's just means: pass
right side to left side :) Could be replaced with: runx (doc >>> css
"a"….). Also you can often find function
composition in functional programs
Database
For DB
access we'll use HDBC PostgreSQL library.
First of all let's create algebraic
type Page:
data Page
= Page { pageId :: Integer,
pageTitle :: String,
pageKeywords :: String,
pageDescription :: String,
pageOgImage :: String,
pageOgDescription :: String,
url :: String } deriving (Show)
Note type
shoud starts from capitalize letter.
Now let's
create function that returns list of parsed pages
selectPages::IConnection
a => a -> IO [Page]
selectPages
con = do
result <- quickQuery' con "select *
from page" []
return $ map unpack result
where unpack [SqlInteger page_id,
SqlByteString page_title, SqlByteString page_keywords, SqlByteString
page_description, SqlByteString page_og_image, SqlByteString
page_og_description, SqlByteString domain] =
Page { pageId = page_id,
pageTitle = toString
page_title,
pageKeywords = toString
page_keywords,
pageDescription = toString
page_description,
pageOgImage = toString
page_og_image,
pageOgDescription = toString
page_description,
url = toString domain }
This
looks a bit harder yeah? But at second look you'll see:
- we run quickQuery' and passing 3 arguments: connection, query text, list of parameters. Parameters esapes with "?" sign.
- function map calls function to each list item. So actually we are calling unpack function for each result item. Unpack defined in where section.
How to
receive conection? It's easy:
connect::
String -> IO Connection
connect
conString = do
conn <- connectPostgreSQL conString
return conn
I'll
provide github sources so you can see all CRUD for this module.
Main module
Now lets
finish our parser.
parseUrl:: String->[String] -> [String] -> IO()
parseUrl _ [] parsedUrls = do
putStrLn "Finished parsing"
parseUrl baseDomain (u:urls) parsedUrls
| (startswith "http" u) || ( startswith "//" u) = do
html <- getBrowserHTML(u)
title <- parseTitle html
metaKeywords <- parseMeta "keywords" html
metaDescriptions <- parseMeta "description" html
metaOgImage <- parseMeta "og:image" html
metaOgDescription <- parseMeta "og:description" html
conn <- connect "host=127.0.0.1 dbname=databayo user=postgres password=******"
existedPage <- getPage conn u
createOrUpdate existedPage conn Page { pageId = 0,
pageTitle = getFirstOrEmpty title,
pageKeywords = getFirstOrEmpty metaKeywords,
pageDescription = getFirstOrEmpty metaDescriptions,
pageOgImage = getFirstOrEmpty metaOgImage,
pageOgDescription = getFirstOrEmpty metaOgDescription,
url = u }
links <- parseLinks html
parseUrl
baseDomain
(filter (\f -> notIn (parsedUrls++[u]) f)
(filter (startswith baseDomain) $ fixBaseLinks (baseDomain) (nub $ merge links urls)))
(nub (parsedUrls++[u]))
| otherwise = do
putStrLn "Unknown url"
Now you
shoud almost understand this code. 2 things you can see here.
Whew. Now
we could compile it with ghc or create package (cabal init, cabal install) and
then run app.
Hope you had fun in reading this post.
Thx, I'm also learning haskell and it's not easy to find simple examples of usual tasks :)
ReplyDelete