Skip to content

ixemad/devorame

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Programming Challenge

Overview

This project implements a solution to some coding challenges.

  1. Write an efficient algorithm to check if a string is a palindrome. A string is a palindrome if the string matches the reverse of string.

    This question is tackled by the palindrome command as shown below.

    » poetry run devorame -v palindrome 'Amore, Roma.'
    Great! That "Amore, Roma." is a palindrome.     
        
  2. Write an efficient algorithm to find K-complementary pairs in a given array of integers. Given Array A, pair (i, j) is K-complementary if K = A[i] + A[j].

    This question is tackled by the k-complementary command as show below.

    » poetry run devorame k-complementary 6 1 2 3 4 5
    3,3
    2,4
    1,5     
        
  3. Tf/idf (term frequency / inverse document frequency) is an statistic that reflects the importance of a term T in a document D (or the relevance of a document for a searched term) relative to a document set S.

    Tf/idf can be extended to a set of terms TT adding the tf/idf for each term.

    Assume that we have a directory D containing a document set S, with one file per document. Documents will be added to that directory by external agents, but they will never be removed or overwritten.

    We are given a set of terms TT, and asked to compute the tf/idf of TT for each document in D, and report the N top documents sorted by relevance.

    The program must run as a daemon/service that is watching for new documents, and dynamically updates the computed tf/idf for each document and the inferred ranking.

    The program will run with the parameters:

    • The directory D where the documents will be written.
    • The terms TT to be analyzed.
    • The count N of top results to show.
    • The period P to report the top N.

    For example:

    ./tdIdf -d dir -n 5 -p 300 -t "password try again"       
    ...
    doc1.txt        0.78
    doc73.txt        0.76
        

    Bonus:

    • Parallel solution to accelerate computation.
    • Extensible framework for tf/idf variants.

    This question is tackled by the tf-idf command as show below.

    » poetry run devorame -v tf-idf -d ../devo/documents -p 3 -n 3 -t "python film"                                    1 ↵
    2018-12-07 13:47:02,405 devorame.tf_idf - [INFO] Document python-colt.txt was processed
    2018-12-07 13:47:02,406 devorame.tf_idf - [INFO] Document python-def.txt was processed
    2018-12-07 13:47:02,407 devorame.tf_idf - [INFO] Document python-film.txt was processed
    Top-3 best documents
    python-film.txt 0.03052
    python-colt.txt 0.01852
    python-def.txt 0.01526
    
    ^C2018-12-07 13:47:03,336 devorame.tf_idf - [INFO] Finishing TF-IDF server     
        

    Some features are still pending.

    • The command does not run as a daemon/service as using daemontools, runit or similar is considered a better option.
    • Not extensible framework for tf/idf variants.

Installation

This project is managed with Poetry. The steps I followed to install it were.

  • I choosed the not-so-recommended pip alternative to install it.
    python -m pip install --user poetry
        

    pip was because like that because of a pip import error.

  • After poetry was installed with pip, let it to update itself.
    poetry self:update
        

Clone this project repository into your local machine.

git clone https://github.com/ixemad/devorame.git

Move to the project folder and run poetry install to resolve all the dependencies. Run pytest to validate that the application was installed correctly.

poetry run pytest --verbose

Finally, invoke the run command in the project folder as shown below to execute this application without installing it.

poetry run devorame --help

Code

Module devorame

This module contains all the code this project. The main function is the group function devorame.

# -*- coding: utf-8 -*-
__version__ = '0.1.0'

<<devorame:import>>

<<devorame:code>>

if __name__ == '__main__':
    sys.exit(devorame.main(standalone_mode=False))

Group function devorame

This group function manages the common parameters.

@click.group()
@click.option('--verbose', '-v', count=True, help="Verbosity level")
@click.pass_context
def devorame(ctx, verbose):
    FORMAT = '%(asctime)-15s %(name)s - [%(levelname)s] %(message)s'
    logging.basicConfig(format=FORMAT)
    logger_name, lvl = (lambda l: (None, logging.NOTSET) if l <= 0 else ('devorame', l))(
        logging.WARN - 10 * verbose
    )
    logging.getLogger(logger_name).setLevel(lvl) 
  • CLI arguments are managed with click library.
    import click
        
  • Logging is managed with logging library.
    import logging
        

Function ignore_sigint

This function is needed to be able to capture the KeyboardInterrupt exception disabling it in the Pool workers as explained in this stackoverflow question.

def ignore_sigint():
    signal.signal(signal.SIGINT, signal.SIG_IGN)
  • The module signal must be imported.
    import signal    
        

Call back function terms_callback

This is a Click callback function to convert the input string in a list of terms (words). It is used by the tf-idf command.

  • Input parameters
    ctx
    A Click context (not used).
    param
    The command line param name (not used).
    value
    The command line para value.
  • Returns

    A list of terms.

  • Testing
    • All terms are separated by non LOCALE alphanumeric characters.
      >>> terms_callback(None, None, 'STUPID   Whi$%&te  Men.')
      ['STUPID', 'Whi', 'te', 'Men']
              
    • Non LOCALE alphanumeric characters are erased.

      >>> terms_callback(None, None, ‘!”·$%&/ ()= '\ ¿?ÑÑ’)

      >>> terms_callback(None, None, '!"·$%&/ ()=  ')
      []
              
def terms_callback(ctx, param, value):
    """
    <<devorame:terms_callback:test>>
    """
    return re.findall('\w+', value, re.LOCALE)
  • Module re is imported to split the string in terms.
    import re
        

Command function palindrome

This command will read a string written by the user and will check in that phrase is a palindrome. It returns in the standard console a phrase indicating whether the input string is a palindrome.

This command will only works properly if the input string only contains ASCII characters. It will get rid of punctuation characters and spaces with the sanitize function and it verifies the result with the is_palindrome function.

@devorame.command(help="Check if a phrase (between quotation marks) is a palindrome.")
@click.argument('string', nargs=1)
@click.pass_context
def palindrome(ctx, string):
    logger = logging.getLogger('devorame.palindrome')
    
    if is_palindrome(sanitize(string)):
        print 'Great! That "%s" is a palindrome.' % string
        sys.exit(0)
    else:
        print 'That "%s" is not a palindrome. Keep trying!' % string
        sys.exit(1)
  • Module sys is imported to used its exit function.
    import sys
        

Command function k-complementary

This command will read the k integer followed by a list of integers. It will print every pair returned by the get_k_completary_pairs function in a new line.

@devorame.command(name='k-complementary', help="Return the k-complementary pairs of input integers")
@click.argument('k', type=int)
@click.argument('items', nargs=-1, type=int)
@click.pass_context
def k_complementary(ctx, k, items):
    for complementaries in get_k_complementary_pairs(k, items):
        print "%s,%s" % complementaries

Command function tf-idf

This command gathers the parameters for the TF-IDF challenge. A priority queue is used to maintain an ordered ranking of documents by their term frequencies. The IDF part is computed after new documents are added to the directory but it is not needed to maintain the ranking of documents because it only depends on the terms.

The ranking is maintained ordered with the bisect module. It’s insertion cost is O(n * log n) and it’s traversal cost is O(1). A better alternative would be to use a binary tree with an insertion cost of O(log n) and traversal cost of O(n).

@devorame.command(name='tf-idf', help="TF/IDF directory agent")
@click.option('-d', '--directory', required=True, type=click.Path(exists=True, dir_okay=True))
@click.option('-n', '--n-top', required=True, type=int)
@click.option('-p', '--period', required=True, type=int)
@click.option('-t', '--terms', required=True, callback=terms_callback)
@click.pass_context
def tf_idf(ctx, directory, n_top, period, terms):
    logger = logging.getLogger('devorame.tf_idf')

    tfidf = TfIdf()
    ranking = []
    idf = None

    try:
        while True:
            new_documents_and_paths = (
                (filename, os.path.join(directory, filename))
                for filename in os.listdir(directory)
                if not tfidf.has_document(filename)
                if os.path.isfile(os.path.join(directory, filename))
            )

            for document in tfidf.add_async(new_documents_and_paths):
                logger.info('Document %s was processed', document)
                bisect.insort(ranking, (1 - tfidf.tf(document, terms), document))
                idf = None

            idf = tfidf.idf(terms) if idf is None else idf
            print "Top-%s best documents" % n_top
            for inv_score, document in ranking[:n_top]:
                print "%s %.5f" % (document, (1 - inv_score) * tfidf.idf(terms))
            print ""
            time.sleep(period)
    except KeyboardInterrupt as e:
        logger.info('Finishing TF-IDF server')
  • The bisect module is used to insert new documents in the sorted ranking list efficiently.
    import bisect
        
  • Function walk of module os is used.
    import os
        
  • Function join of module os.path is used.
    import os.path
        
  • Function sleep of module time is used.
    import time
        

Function sanitize

This function will lowercase the input string and remove any character that is not an ASCII letter or a digit.

  • Input parameters
    string
    The string to be sanitize
  • Returns

    A sanitized string

  • Tests
    • A string empty remains empty
      >>> sanitize('')
      ''
              
    • Upper case letters are put in lowercase.
      >>> sanitize("JOjojO")
      'jojojo'
              
    • Any character that is not a letter or a digit is removed.
      >>> sanitize("What a f#$? ¢a (b) ulous day? 101")
      'whatafabulousday101'
              
  • Complexity

    This implementation traverses the string two successive times, so its complexity is O(2*n) but constant factor can be ignored so, this is equivalent to O(n).

def sanitize(string):
    """
    <<devorame:sanitize:test>>
    """
    return ''.join(
        ch for ch in string.lower()
        if 48 <= ord(ch) <= 57 or 97 <= ord(ch) <= 122
    )
  • A faster solution would probably be to use a regex expression like the one below, but to make the complexity analysis easier I will use the implementation above. It is also a better solution from the user’s point of view because it will allow current LOCALE characters.
    re.findall('\w+', string, re.LOCALE)
        

Function is_palindrome

This predicate checks if a string is a palindrome. Basically, it uses two indexes to traverse the string from beginning to end and vice versa, checking that each character matches.

  • Input parameters
    string
    The input string to check.
  • Returns

    True is the string is a palindrome. False, otherwise.

  • Tests
    • The empty string is a palindrome.
      >>> is_palindrome("")
      True
              
    • A single character string is also a palindrome.
      >>> is_palindrome("x")
      True
              
    • The function is case-sensitive, so the next input is not a palindrome.
      >>> is_palindrome("xX")
      False
              
    • To be used with phrases, punctuation characters and spaces have to be removed. An example of a palindrome phrase. The phrase A man, a plan, a canal, Panama! will return True if it is transformed as below.
      >>> is_palindrome("amanaplanacanalpanama")
      True
              
    • You can use the sanitize function to prepare a phrase.
      >>> is_palindrome(sanitize("Was it a car or a cat I saw?"))
      True
              
  • Complexity

    In the worst case, that is, the string is a palindrome, the string is traversed completely so its cost is O(n). Because of a python string is internally stored in a C array, accessing to each character by index is O(1).

def is_palindrome(string):
    """
    <<devorame:is_palindrome:test>>
    """
    forward_idx = 0
    backward_idx = len(string) - 1
    
    while (forward_idx < backward_idx):
        if string[forward_idx] != string[backward_idx]:
            return False
        forward_idx += 1
        backward_idx -= 1
    return True

Function get_k_complementary_pairs

Get the k-complementary pairs from a list. A pair (x, y) is k-complementary if x + y = k.

  • Input parameters
    k
    An int number that determine the pairs.
    items
    A list of intergers to traverse.
  • Returns

    A generator of the unique pairs that are k-complementary.

  • Tests
    • This left item of the pair is less than or equal that the right item of that pair.
      >>> list(get_k_complementary_pairs(3, [1, 2])) == list(get_k_complementary_pairs(3, [2, 1]))
      True
              
    • The 6-complementary pairs for the natural numbers.
      >>> list(get_k_complementary_pairs(6, [1, 2, 3, 4, 5]))
      [(3, 3), (2, 4), (1, 5)]
              
    • But order of pairs depends on the order of items.
      >>> list(get_k_complementary_pairs(6, [5, 1, 2, 3, 4]))
      [(1, 5), (3, 3), (2, 4)]
              
  • Complexity

    The items list is traversed just once. Access and insert cost of a dictionary is O(1) so the complexity of this function is O(n).

def get_k_complementary_pairs(k, items):
    """
    <<devorame:get_k_complementary_pairs:test>>
    """
   
    spotted = {}
    for item in items:
        if item in spotted: continue
        spotted[item] = None
        k_complementary = k - item
        if k_complementary in spotted:
            yield min(k_complementary, item), max(k_complementary, item)

Function collect_frequencies

This function collects the term frequencies in a given file. This function is used in the method add_async of the TfIdf class but because of some limitations of cPickle the input function must be a function defined at module level.

  • Input parameters
    document_and_path
    It’s a tuple of a document and a path. A tuple is used because this method is passed to the imap_unordered function and it only allows a function with one argument. For Python 3.3 onwards there is starmap function.
  • Returns

    A dictionary with the terms frequencies in that document.

  • Complexity

    As it only needs one pass over input file to collect its frequencies, its complexity is linear to the size of that input file, that is O(n).

def collect_frequencies(document_and_path):
    try:
        logger = logging.getLogger('devorame.tf_idf')
        document, path = document_and_path
        logger.debug('Collecting frequencies of document %s at %s', document, path)
    
        frequencies = {}
        with open(path, 'r') as file:
            for line in file:
                for term in re.findall('\w+', line.lower(), re.LOCALE):
                    frequencies.update({term: frequencies.get(term, 0) + 1})
    
        logger.debug('Frequencies at %s: %s', document, frequencies)
    
        return document, frequencies
    except KeyboardInterrupt:
        return document, {}
  • Module re must be imported to split the string in terms. A term is a sequence of letters. and digits leaving out, i.e., punctuation characters and the space character. It is aware of the user’s language.

Class TfIdf

An instance of this class gathers TF-IDF statistics of the added documents.

class TfIdf(object):
    <<devorame:TfIdf:method>>

Constructor

The instance constructor.

  • Attributes
    documents
    It is dictionary where every key is the name of a document and every value is the number of terms in that document.
    terms
    It is a dictionary where every key is a term and every value is a dictionary that gathers the frequency of that term in a given document.

    For instance, the terms dictionary below shows that the term song appears 10 times in document-A and 7 times in document-B

    { 'song': { 'document-A': 10, 'document-B': 7} }
            
def __init__(self):
    self.documents = {}
    self.terms = {}

Method add

This method adds document to the TF-IDF rank. The document must be readable.

  • Input Parameters
    document
    A string that stands for the name of the document.
    path
    A string that stands for the path of the document.
  • Complexity

    This algorighm is O(l * t) where l is the number of lines and t is the number of terms. That l * t is lower that n so, complexity is also O(n), being n the number of characters in that document.

def add(self, document, path):
    logger = logging.getLogger('devorame.tf_idf')
     
    logger.debug('Adding %s at %s', document, path)
    _, frequencies = collect_frequencies((document, path))     
    logger.debug('%s frequencies: %s', document, frequencies)
     
    for term, frequency in frequencies.iteritems():
        self._update_term_frequency(term, document, frequency)

Method add_async

This method process all pending documents in parallel.

  • Input parameters
    documents_and_paths
    it is a stream of pairs. Each pair consists of a document name and the path to that document.
  • Returns

    A stream of documents in the order it was processed.

def add_async(self, documents_and_paths):
    logger = logging.getLogger('devorame.tf_idf')

    logger.debug('Processing a stream of input documents')

    pool = Pool(None, ignore_sigint)
    try:
        documents_and_frequencies = pool.imap_unordered(
            collect_frequencies, documents_and_paths
        )

        logger.debug('Gathering document frequencies')
        for document, frequencies in documents_and_frequencies:
            for term, frequency in frequencies.iteritems():
                self._update_term_frequency(term, document, frequency)
            yield document
    except KeyboardInterrupt:
        logger.info("TF-IDF loop finished")
        pool.terminate()
        pool.join()
    except Exception:
        logger.info("TF-IDF loop finished")
        pool.terminate()
        pool.join()
        return
  • The Pool class of the multiprocessing module is used
    from multiprocessing import Pool
        

    Since this process is IO intensive it would be interesting to use the multiprocessing.dummy module (threading based) to compare the multiprocessing vs the multithreading versions.

Method tf

This method calculates the term frequency of a list of terms in a document.

  • Input parameters
    document
    the document where to look for that term frequency.
    terms
    a list of term.
  • Returns

    The average of the sum of the frequencies of all terms.

  • Complexity

    Access to the terms dictionary is O(1) so complexity is O(n) where n is the lenght of terms parameters.

def tf(self, document, terms):
    if not terms: return 0.0

    terms_in_document = float(self.documents.get(document, 0))
    if not terms_in_document: return 0.0

    result = sum(
        self.terms.get(term, {}).get(document, 0)
        for term in terms
    ) / terms_in_document

    return result

Method idf

This method calculates the inverse document frequency of a list of terms over the analyzed documents.

  • Input parameters
    terms
    a list of term.
  • Returns

    The average of the sum of the inverse document frequencies of all terms.

  • Completixy

    Access to the terms dictionary is O(1) so complexity is O(n) where n is the lenght of terms parameters.

def idf(self, terms):
    n_documents = float(len(self.documents))
    if not n_documents: return 0.0
    if not terms: return 0.0

    logger = logging.getLogger('devorame.tfidf')
    def _idf(term):
        n_term = len(self.terms.get(term, {}))
        result = math.log(n_documents / n_term)
        logger.debug('log(%s/%s)= %s', n_documents, n_term, result)
        return result

    return sum(_idf(term) for term in terms) / len(terms)
  • Function log of module math is used.
    import math
        

Method has_document

This method returns True if the document passed by parameter has been processed.

def has_document(self, document):
    return document in self.documents

Method _update_term_frequency

This function updates the term frequency and the document frequency.

  • Input parameters
    term
    the term at which to update the frequency.
    document
    The document where the term was found.
    frequency
    The number of times the term appears in the document.
  • Returns

    Nothing is returned. The internal terms and documents dictionaries are updated.

  • Complexity

    Reading and writing a dictionary is O(1) so this function is also O(1).

def _update_term_frequency(self, term, document, frequency):
    self.terms.update({
        term: dict( self.terms.get(term, {}),
                    **{ document: self.terms.get(
                        term, {}).get(
                            document, 0) + frequency})})
    self.documents.update({
        document: self.documents.get(document, 0) + frequency })

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages