Generate a dataset from a website using Scrapy (Laws of Bangladesh)

6 min readMar 29, 2022

A dataset is one of the important part of data analysis or any task related to machine learning. A dataset contains all the data and features of a specific purpose. It is a set of data records of the purpose which tracks every value with mostly relevant data and outliers.

Generally, a dataset is collected from different sources but mostly from databases. Sometimes its needed to collect the data from multiple sources and merged into a dataset. Scraping is also another form of collection data from different web platforms and process the data into a format to form a dataset of a purpose.

Recently, I was trying to looking into bangladeshi law dataset. I wanted to analyze NLP(Natural Language Processing)tasks on bangladeshi law. But I couldn’t find any dataset after searching on different sources or search engines. It intriged me to generate a dataset on bangladeshi law and work furthure on this topic to get more interesting points.

Crawling Processes

The whole scraping process is mainly configured by Scrapy web crawler. After collecting the crawled data, the data needs post-processing to come into data uniform.

Scrapy crawler has spiders to crawl over the website and scrape the content of the webpages.Spider configuration is done by defining ScrapeItem, Settings, Pipelines and Middlewires.

Scrape.Item
ScrapeItem is a class defined for scrapy crawled data where the crawled data is yielded to process for furthur action or saving the data into a format.

class ScrapLawItem(scrapy.Item):
    url_id = scrapy.Field()
    law_title = scrapy.Field()
    vol_ordinance = scrapy.Field()
    law_pass_date = scrapy.Field()
    law_subtitle = scrapy.Field()
    law_descripton = scrapy.Field()
    section_chapter_id = scrapy.Field()
    section_chapter_name = scrapy.Field()
    section_chapter_no = scrapy.Field()
    section_id = scrapy.Field()
    section_name = scrapy.Field()
    section_description = scrapy.Field()

Pipelines
Pipeline enables to create a process line of crawling to manage the crawling spider and processing the yielded item.

class ScrapCrawlPipeline(object):
    def __init__(self):
        self.files = defaultdict(list)
    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        csv_file = open('%s.csv' % spider.name, 'w+b')
        json_file = open('%s.json' % spider.name, 'w+b')

        self.files[spider].append(csv_file)
        self.files[spider].append(json_file)

        self.exporters = [
            JsonItemExporter(json_file),
            CsvItemExporter(csv_file)
        ]

        for exporter in self.exporters:
            exporter.start_exporting()

    def spider_closed(self, spider):
        for exporter in self.exporters:
            exporter.finish_exporting()

        files = self.files.pop(spider)
        for file in files:
            file.close()

    def process_item(self, item, spider):
        for exporter in self.exporters:
            exporter.export_item(item)
        return item

Data Nodes

The website is not well defined or structured where each of the data is having data. Due to irregular format of the crawled webpage, the spider had to put lots of condition to yield the correct items.

# A new event list crawler for meetup dot com
class BDLawSpider(scrapy.Spider):
    name = 'bdlaws'
    start_urls = ["http://bdlaws.minlaw.gov.bd/laws-of-bangladesh-alphabetical-index.html", ]

    def start_requests(self):
        for url in self.start_urls:
            # yield Request(url=url, callback=self.parse_bdlaws_start, dont_filter=False)
            # yield Request(url=url, callback=self.parse_missing_bdlaws_start, dont_filter=False)
            yield Request(url=url, callback=self.parse_missing_section_start, dont_filter=False)
    def parse_bdlaws_start(self, response):
        law_links = response.xpath("//div[@class='table-responsive']/table/tbody/tr/td/a/@href").extract()
        law_links = list(set(law_links))
        law_links = [txt.split('-')[1].split('.')[0] for txt in law_links]

        host = urlparse(response.url).hostname
        for item in law_links:
            detail_law_url = f'http://{host}/act-{item}.html'
            yield Request(
                url=detail_law_url,
                callback=self.parse_law_item,
                dont_filter=False,
                meta={
                    'url_id': item,
                    'dont_redirect': True,
                    'handle_httpstatus_list': [302]
                }
            )

    def parse_missing_bdlaws_start(self, response):

        host = urlparse(response.url).hostname
        for item in missing_id_list:
            detail_law_url = f'http://{host}/act-{item}.html'
            yield Request(
                url=detail_law_url,
                callback=self.parse_law_missing_item,
                dont_filter=False,
                meta={
                    'url_id': item,
                    'dont_redirect': True,
                    'handle_httpstatus_list': [302]
                }
            )
    #
    def parse_law_item(self, response):
        if response.status not in [ 404, 302]:
            host = urlparse(response.url).hostname
            item = ScrapLawItem()
            item['url_id'] = response.meta['url_id']
            url_id = response.meta['url_id']
            item['law_title'] = " ".join(response.xpath("//section[@class='bg-act-section padding-bottom-20']/div/div/div/h3/text()").getall()).strip()
            item['vol_ordinance'] = " ".join(response.xpath("//section[@class='bg-act-section padding-bottom-20']/div/div/div/h4/text()").get().strip()[1:-1].split())
            item['law_pass_date'] = response.xpath("//div[@class='form-group pull-right text-info publish-date']/text()").get().strip()[1:-1].strip()

            law_discription = "".join(response.xpath("//div[@class='col-md-12 pad-right']/text()").getall()).strip()
            if not law_discription:
                law_discription = "".join(response.xpath("//div[@class='col-md-10']/text()").getall()).strip()
            item['law_descripton'] = law_discription
            law_section_chapters = response.xpath("//section[@class=' search-here']/div/div[@class='act-chapter-group']")
            if law_section_chapters and len(law_section_chapters) > 0:
                chapter_id_list = law_section_chapters.xpath("//p[@class='act-chapter-no']/a/@href").getall()
                chapter_name = list(map(str.strip,law_section_chapters.xpath("//p[@class='act-chapter-no']/a/text()").getall()))
                chapter_mapping = dict(zip(chapter_name, chapter_id_list))
                for chapter in law_section_chapters:
                    law_sections = chapter.xpath("//section[@class=' search-here']/div/p/a/@href").getall()
                    for law_section_href in law_sections:
                        section_law_url = f'http://{host}{law_section_href}'
                        yield Request(
                            url=section_law_url,
                            callback=self.parse_law_section,
                            dont_filter=False,
                            meta={
                                'item': item,
                                'has_chapter': True,
                                'chapters': chapter_mapping
                            }
                        )

            else:
                    law_sections = response.xpath("//section[@class=' search-here']/div/p/a/@href").getall()

                    for law_section_href in law_sections:
                        section_law_url = f'http://{host}{law_section_href}'
                        yield Request(
                            url=section_law_url,
                            callback=self.parse_law_section,
                            dont_filter=False,
                            meta={
                                'item': item,
                                'has_chapter': False,
                                'chapters': []
                            }
                        )


    def parse_missing_section_start(self, response):

        host = urlparse(response.url).hostname
        for section_link in missing_section_list:
            url_id = section_link.split('/')[0].split('-')[1]
            section_id = section_link.split('/')[1].split('-')[1].split('.')[0]
            item = ScrapLawItem()
            item['url_id'] = url_id
            item['section_id'] = section_id
            detail_law_url = f'http://{host}/{section_link}'
            yield Request(
                url=detail_law_url,
                callback=self.parse_law_section,
                dont_filter=False,
                meta={
                    'item': item,
                    'has_chapter': False,
                    'dont_redirect': True,
                    'handle_httpstatus_list': [302]
                }
            )


    def parse_law_missing_item(self, response):
        if response.status not in [ 404, 302]:
            host = urlparse(response.url).hostname
            item = ScrapLawItem()
            item['url_id'] = response.meta['url_id']
            url_id = response.meta['url_id']
            law_title = " ".join(response.xpath("//section[@class='bg-act-section padding-bottom-20']/div/div/div/h3/text()").getall()).strip()
            item['law_title'] = law_title
            item['vol_ordinance'] = " ".join(response.xpath("//section[@class='bg-act-section padding-bottom-20']/div/div/div/h4/text()").get().strip()[1:-1].split())
            item['law_pass_date'] = response.xpath("//div[@class='form-group pull-right text-info publish-date']/text()").get().strip()[1:-1].strip()

            law_discription = "".join(response.xpath("//div[@class='col-md-12 pad-right']/text()").getall()).strip()
            if not law_discription:
                law_discription = "".join(response.xpath("//div[@class='col-md-10']/text()").getall()).strip()
            item['law_descripton'] = law_discription
            yield item

    def parse_law_section(self, response):
        if response.status not in [ 404, 302]:
            res = urlparse(response.url)
            section_id = res.path.split('/')[2].split('-')[1].split('.')[0]
            item = response.meta['item']
            item['section_id']= section_id
            has_chapter = response.meta['has_chapter']
            chapter_content = response.xpath("//p[@class='act-chapter-name']/text()").get()
            if has_chapter and chapter_content:
                section_chapter_name = response.xpath("//p[@class='act-chapter-name']/text()").get().strip()
                if response.xpath("//p[contains(@class,'act-chapter-no')]"):
                    section_chapter_no = response.xpath("//p[@class='act-chapter-no']/text()").get().strip()
                    if section_chapter_no in response.meta['chapters']:
                        chapter_id = response.meta['chapters'][section_chapter_no].split('/')[2].split('-')[1].split('.')[0]
                        item['section_chapter_id'] = chapter_id
                    item['section_chapter_no'] = response.xpath("//p[@class='act-chapter-no']/text()").get().strip()
                item['section_chapter_name'] = section_chapter_name

            item['section_name']= response.xpath("//div[@class='col-sm-3 txt-head']/text()").get().strip()
            section_description= "".join(response.xpath("//div[@class='col-sm-9 txt-details']/text()").extract()).strip()
            if not section_description:
                section_description = "".join(response.xpath("//div[@class='col-sm-9 txt-details']/p/text()").getall()).strip()
                if not section_description:
                    section_description = " ".join(response.xpath("//div[@class='col-sm-9 txt-details']/p/span/span/text()").extract()).strip()
            item['section_description'] = section_description
            yield item

The spider crawled the items with law name, date, description with section details.

Data Post-processing

After collecting all the data into a csv or json format, data post-processing is also necessary to put it into a data uniformity. The major issue is the language of the collected data. More than 20% data are stored in Bangla language. This language irregularity isn’t expected for NLP task.

Later, Notebook is initiated to find out the missing laws or law sections of a law id. The notebook is used for post-processing and translating the bangla language into english. We can also use it to translate from english to bangla language too.

def translate_text(txt, source='bn', target='en'):
    size = 1000
    splited_txt=[txt[y-size:y] for y in range(size, len(txt)+size,size)]
    translate_list = GoogleTranslator(source, target).translate_batch(splited_txt)return " ".join(translate_list)

Here, the text is translated by GoogleTranslator. Google translator has limits to translate the source text to target language. Due to the limit, I have joined all the text into one by a VERY UNIQUE seperator. The seperator is really important here cause the source text or translated text might have the seperator which can also cause error while splitting the text into its respective columns or fields.

def check_translate(x, target='en'):
    separator = " ###### "
    x.fillna('-',inplace=True)
    try:
        if x['translated_language'] == 'bn':
            texts = [
                x['law_descripton'],
                x['law_title'],
                x['law_pass_date'],
                x['vol_ordinance'],
            ]
            section_list = [
                x['section_name'],
                x['section_description'],
                x['section_chapter_name'],
                x['section_chapter_no'],
            ]
            law_texts = separator.join(texts)
            law_texts = re.sub('[^.,#a-zA-Z0-9 \n\.]', '', law_texts)
            section_texts = separator.join(section_list)
            complete_texts = separator.join([ law_texts, section_texts])
            translate = translate_text(complete_texts)
            translated = translate.split(separator.strip())
            x['law_descripton'] = translated[0]
            x['law_title'] = translated[1]
            x['law_pass_date'] = translated[2]
            x['vol_ordinance'] = translated[3]x['section_name'] = translated[4]
            x['section_description'] = translated[5]
            x['section_chapter_name'] = translated[6]
            x['section_chapter_no'] = translated[7]
            x['is_translated'] = True
    except Exception as e:
        print(e)
        time.sleep(10)
        return x
    return x

Challenges and Difficulties

The primary challenges or difficulties of this task are the crawling the correct data from the irregular page format of the webpage and translating the data.

Some other challenges are related to postprocessing while translating the data with a very unique seperator. Sometimes a very unique seperator also couldn’t work well to translate and split the translated text to put the translated text into respective fields.

Conclusion

Generating the dataset of banglesh laws were one of a interesting task for me as the dataset wasn’t available. I find my experience shareable to my all fellow data enthusiast. I believe that my experience can also be helpful for someone who wants to generate new dataset with similar challenges and difficulties.

Finally, I would also like to request you all to go through the dataset and analyze the dataset to find out interesting topics or insights related to bangladesh laws.

References

You can have a look at these reference links. Kaggle link is having the final version of crawled data.

https://github.com/nahid002345/bdlaws-scrapy-crawler
https://www.kaggle.com/datasets/nahid002345/bangladesh-law-dataset

Generate a dataset from a website using Scrapy (Laws of Bangladesh)

Crawling Processes

Data Post-processing

Challenges and Difficulties

Conclusion

Written by MD Momenul Islam Bhuiyan (Nahid)