This project implements a solution to some coding challenges.
- Write an efficient algorithm to check if a string is a palindrome. A string is a palindrome if
the string matches the reverse of string.
This question is tackled by the palindrome command as shown below.
» poetry run devorame -v palindrome 'Amore, Roma.' Great! That "Amore, Roma." is a palindrome. - Write an efficient algorithm to find K-complementary pairs in a given array of integers.
Given Array
A, pair(i, j)is K-complementary ifK = A[i] + A[j].This question is tackled by the k-complementary command as show below.
» poetry run devorame k-complementary 6 1 2 3 4 5 3,3 2,4 1,5 - Tf/idf (term frequency / inverse document frequency) is an statistic that reflects the
importance of a term
Tin a documentD(or the relevance of a document for a searched term) relative to a document setS.Tf/idf can be extended to a set of terms TT adding the tf/idf for each term.
Assume that we have a directory
Dcontaining a document setS, with one file per document. Documents will be added to that directory by external agents, but they will never be removed or overwritten.We are given a set of terms
TT, and asked to compute the tf/idf ofTTfor each document inD, and report theNtop documents sorted by relevance.The program must run as a daemon/service that is watching for new documents, and dynamically updates the computed tf/idf for each document and the inferred ranking.
The program will run with the parameters:
- The directory
Dwhere the documents will be written. - The terms
TTto be analyzed. - The count
Nof top results to show. - The period
Pto report the topN.
For example:
./tdIdf -d dir -n 5 -p 300 -t "password try again" ... doc1.txt 0.78 doc73.txt 0.76Bonus:
- Parallel solution to accelerate computation.
- Extensible framework for tf/idf variants.
This question is tackled by the tf-idf command as show below.
» poetry run devorame -v tf-idf -d ../devo/documents -p 3 -n 3 -t "python film" 1 ↵ 2018-12-07 13:47:02,405 devorame.tf_idf - [INFO] Document python-colt.txt was processed 2018-12-07 13:47:02,406 devorame.tf_idf - [INFO] Document python-def.txt was processed 2018-12-07 13:47:02,407 devorame.tf_idf - [INFO] Document python-film.txt was processed Top-3 best documents python-film.txt 0.03052 python-colt.txt 0.01852 python-def.txt 0.01526 ^C2018-12-07 13:47:03,336 devorame.tf_idf - [INFO] Finishing TF-IDF serverSome features are still pending.
- The command does not run as a daemon/service as using daemontools, runit or similar is considered a better option.
- Not extensible framework for tf/idf variants.
- The directory
This project is managed with Poetry. The steps I followed to install it were.
- I choosed the not-so-recommended pip alternative to install it.
python -m pip install --user poetrypip was because like that because of a pip import error.
- After poetry was installed with pip, let it to update itself.
poetry self:update
Clone this project repository into your local machine.
git clone https://github.com/ixemad/devorame.gitMove to the project folder and run poetry install to resolve all the dependencies. Run pytest to validate that the application was installed correctly.
poetry run pytest --verboseFinally, invoke the run command in the project folder as shown below to execute this application without installing it.
poetry run devorame --helpThis module contains all the code this project. The main function is the group function devorame.
# -*- coding: utf-8 -*-
__version__ = '0.1.0'
<<devorame:import>>
<<devorame:code>>
if __name__ == '__main__':
sys.exit(devorame.main(standalone_mode=False))This group function manages the common parameters.
@click.group()
@click.option('--verbose', '-v', count=True, help="Verbosity level")
@click.pass_context
def devorame(ctx, verbose):
FORMAT = '%(asctime)-15s %(name)s - [%(levelname)s] %(message)s'
logging.basicConfig(format=FORMAT)
logger_name, lvl = (lambda l: (None, logging.NOTSET) if l <= 0 else ('devorame', l))(
logging.WARN - 10 * verbose
)
logging.getLogger(logger_name).setLevel(lvl) - CLI arguments are managed with click library.
import click
- Logging is managed with logging library.
import logging
This function is needed to be able to capture the KeyboardInterrupt exception disabling it in the Pool workers as explained in this stackoverflow question.
def ignore_sigint():
signal.signal(signal.SIGINT, signal.SIG_IGN)- The module signal must be imported.
import signal
This is a Click callback function to convert the input string in a list of terms (words). It is used by the tf-idf command.
- Input parameters
- ctx
- A Click context (not used).
- param
- The command line param name (not used).
- value
- The command line para value.
- Returns
A list of terms.
- Testing
def terms_callback(ctx, param, value):
"""
<<devorame:terms_callback:test>>
"""
return re.findall('\w+', value, re.LOCALE)- Module re is imported to split the string in terms.
import re
This command will read a string written by the user and will check in that phrase is a palindrome. It returns in the standard console a phrase indicating whether the input string is a palindrome.
This command will only works properly if the input string only contains ASCII characters. It will get rid of punctuation characters and spaces with the sanitize function and it verifies the result with the is_palindrome function.
@devorame.command(help="Check if a phrase (between quotation marks) is a palindrome.")
@click.argument('string', nargs=1)
@click.pass_context
def palindrome(ctx, string):
logger = logging.getLogger('devorame.palindrome')
if is_palindrome(sanitize(string)):
print 'Great! That "%s" is a palindrome.' % string
sys.exit(0)
else:
print 'That "%s" is not a palindrome. Keep trying!' % string
sys.exit(1)This command will read the k integer followed by a list of integers. It will print every pair
returned by the get_k_completary_pairs function in a new line.
@devorame.command(name='k-complementary', help="Return the k-complementary pairs of input integers")
@click.argument('k', type=int)
@click.argument('items', nargs=-1, type=int)
@click.pass_context
def k_complementary(ctx, k, items):
for complementaries in get_k_complementary_pairs(k, items):
print "%s,%s" % complementariesThis command gathers the parameters for the TF-IDF challenge. A priority queue is used to maintain an ordered ranking of documents by their term frequencies. The IDF part is computed after new documents are added to the directory but it is not needed to maintain the ranking of documents because it only depends on the terms.
The ranking is maintained ordered with the bisect module. It’s insertion cost is O(n * log n)
and it’s traversal cost is O(1). A better alternative would be to use a binary tree with an
insertion cost of O(log n) and traversal cost of O(n).
@devorame.command(name='tf-idf', help="TF/IDF directory agent")
@click.option('-d', '--directory', required=True, type=click.Path(exists=True, dir_okay=True))
@click.option('-n', '--n-top', required=True, type=int)
@click.option('-p', '--period', required=True, type=int)
@click.option('-t', '--terms', required=True, callback=terms_callback)
@click.pass_context
def tf_idf(ctx, directory, n_top, period, terms):
logger = logging.getLogger('devorame.tf_idf')
tfidf = TfIdf()
ranking = []
idf = None
try:
while True:
new_documents_and_paths = (
(filename, os.path.join(directory, filename))
for filename in os.listdir(directory)
if not tfidf.has_document(filename)
if os.path.isfile(os.path.join(directory, filename))
)
for document in tfidf.add_async(new_documents_and_paths):
logger.info('Document %s was processed', document)
bisect.insort(ranking, (1 - tfidf.tf(document, terms), document))
idf = None
idf = tfidf.idf(terms) if idf is None else idf
print "Top-%s best documents" % n_top
for inv_score, document in ranking[:n_top]:
print "%s %.5f" % (document, (1 - inv_score) * tfidf.idf(terms))
print ""
time.sleep(period)
except KeyboardInterrupt as e:
logger.info('Finishing TF-IDF server')- The bisect module is used to insert new documents in the sorted ranking list efficiently.
import bisect
- Function walk of module os is used.
import os
- Function join of module os.path is used.
import os.path
- Function sleep of module time is used.
import time
This function will lowercase the input string and remove any character that is not an ASCII letter or a digit.
- Input parameters
- string
- The string to be sanitize
- Returns
A sanitized string
- Tests
- A string empty remains empty
>>> sanitize('') ''
- Upper case letters are put in lowercase.
>>> sanitize("JOjojO") 'jojojo'
- Any character that is not a letter or a digit is removed.
>>> sanitize("What a f#$? ¢a (b) ulous day? 101") 'whatafabulousday101'
- A string empty remains empty
- Complexity
This implementation traverses the string two successive times, so its complexity is
O(2*n)but constant factor can be ignored so, this is equivalent toO(n).
def sanitize(string):
"""
<<devorame:sanitize:test>>
"""
return ''.join(
ch for ch in string.lower()
if 48 <= ord(ch) <= 57 or 97 <= ord(ch) <= 122
)- A faster solution would probably be to use a regex expression like the one below, but to make
the complexity analysis easier I will use the implementation above. It is also a better
solution from the user’s point of view because it will allow current LOCALE characters.
re.findall('\w+', string, re.LOCALE)
This predicate checks if a string is a palindrome. Basically, it uses two indexes to traverse the string from beginning to end and vice versa, checking that each character matches.
- Input parameters
- string
- The input string to check.
- Returns
Trueis the string is a palindrome.False, otherwise. - Tests
- The empty string is a palindrome.
>>> is_palindrome("") True
- A single character string is also a palindrome.
>>> is_palindrome("x") True
- The function is case-sensitive, so the next input is not a palindrome.
>>> is_palindrome("xX") False
- To be used with phrases, punctuation characters and spaces have to be removed. An example of
a palindrome phrase. The phrase A man, a plan, a canal, Panama! will return
Trueif it is transformed as below.>>> is_palindrome("amanaplanacanalpanama") True
- You can use the sanitize function to prepare a phrase.
>>> is_palindrome(sanitize("Was it a car or a cat I saw?")) True
- The empty string is a palindrome.
- Complexity
In the worst case, that is, the string is a palindrome, the string is traversed completely so its cost is
O(n). Because of a python string is internally stored in a C array, accessing to each character by index isO(1).
def is_palindrome(string):
"""
<<devorame:is_palindrome:test>>
"""
forward_idx = 0
backward_idx = len(string) - 1
while (forward_idx < backward_idx):
if string[forward_idx] != string[backward_idx]:
return False
forward_idx += 1
backward_idx -= 1
return TrueGet the k-complementary pairs from a list. A pair (x, y) is k-complementary if x + y = k.
- Input parameters
- k
- An int number that determine the pairs.
- items
- A list of intergers to traverse.
- Returns
A generator of the unique pairs that are k-complementary.
- Tests
- This left item of the pair is less than or equal that the right item of that pair.
>>> list(get_k_complementary_pairs(3, [1, 2])) == list(get_k_complementary_pairs(3, [2, 1])) True
- The 6-complementary pairs for the natural numbers.
>>> list(get_k_complementary_pairs(6, [1, 2, 3, 4, 5])) [(3, 3), (2, 4), (1, 5)]
- But order of pairs depends on the order of items.
>>> list(get_k_complementary_pairs(6, [5, 1, 2, 3, 4])) [(1, 5), (3, 3), (2, 4)]
- This left item of the pair is less than or equal that the right item of that pair.
- Complexity
The
itemslist is traversed just once. Access and insert cost of a dictionary isO(1)so the complexity of this function isO(n).
def get_k_complementary_pairs(k, items):
"""
<<devorame:get_k_complementary_pairs:test>>
"""
spotted = {}
for item in items:
if item in spotted: continue
spotted[item] = None
k_complementary = k - item
if k_complementary in spotted:
yield min(k_complementary, item), max(k_complementary, item)This function collects the term frequencies in a given file. This function is used in the method add_async of the TfIdf class but because of some limitations of cPickle the input function must be a function defined at module level.
- Input parameters
- document_and_path
- It’s a tuple of a document and a path. A tuple is used because this method is passed to the imap_unordered function and it only allows a function with one argument. For Python 3.3 onwards there is starmap function.
- Returns
A dictionary with the terms frequencies in that document.
- Complexity
As it only needs one pass over input file to collect its frequencies, its complexity is linear to the size of that input file, that is
O(n).
def collect_frequencies(document_and_path):
try:
logger = logging.getLogger('devorame.tf_idf')
document, path = document_and_path
logger.debug('Collecting frequencies of document %s at %s', document, path)
frequencies = {}
with open(path, 'r') as file:
for line in file:
for term in re.findall('\w+', line.lower(), re.LOCALE):
frequencies.update({term: frequencies.get(term, 0) + 1})
logger.debug('Frequencies at %s: %s', document, frequencies)
return document, frequencies
except KeyboardInterrupt:
return document, {}- Module re must be imported to split the string in terms. A term is a sequence of letters. and digits leaving out, i.e., punctuation characters and the space character. It is aware of the user’s language.
An instance of this class gathers TF-IDF statistics of the added documents.
class TfIdf(object):
<<devorame:TfIdf:method>>The instance constructor.
- Attributes
- documents
- It is dictionary where every key is the name of a document and every value is the number of terms in that document.
- terms
- It is a dictionary where every key is a term and every value is a dictionary that
gathers the frequency of that term in a given document.
For instance, the terms dictionary below shows that the term
songappears 10 times in document-A and 7 times in document-B{ 'song': { 'document-A': 10, 'document-B': 7} }
def __init__(self):
self.documents = {}
self.terms = {}This method adds document to the TF-IDF rank. The document must be readable.
- Input Parameters
- document
- A string that stands for the name of the document.
- path
- A string that stands for the path of the document.
- Complexity
This algorighm is
O(l * t)wherelis the number of lines andtis the number of terms. Thatl * tis lower thatnso, complexity is alsoO(n), beingnthe number of characters in that document.
def add(self, document, path):
logger = logging.getLogger('devorame.tf_idf')
logger.debug('Adding %s at %s', document, path)
_, frequencies = collect_frequencies((document, path))
logger.debug('%s frequencies: %s', document, frequencies)
for term, frequency in frequencies.iteritems():
self._update_term_frequency(term, document, frequency)This method process all pending documents in parallel.
- Input parameters
- documents_and_paths
- it is a stream of pairs. Each pair consists of a document name and the path to that document.
- Returns
A stream of documents in the order it was processed.
def add_async(self, documents_and_paths):
logger = logging.getLogger('devorame.tf_idf')
logger.debug('Processing a stream of input documents')
pool = Pool(None, ignore_sigint)
try:
documents_and_frequencies = pool.imap_unordered(
collect_frequencies, documents_and_paths
)
logger.debug('Gathering document frequencies')
for document, frequencies in documents_and_frequencies:
for term, frequency in frequencies.iteritems():
self._update_term_frequency(term, document, frequency)
yield document
except KeyboardInterrupt:
logger.info("TF-IDF loop finished")
pool.terminate()
pool.join()
except Exception:
logger.info("TF-IDF loop finished")
pool.terminate()
pool.join()
return- The Pool class of the multiprocessing module is used
from multiprocessing import Pool
Since this process is IO intensive it would be interesting to use the multiprocessing.dummy module (threading based) to compare the multiprocessing vs the multithreading versions.
This method calculates the term frequency of a list of terms in a document.
- Input parameters
- document
- the document where to look for that term frequency.
- terms
- a list of term.
- Returns
The average of the sum of the frequencies of all terms.
- Complexity
Access to the
termsdictionary isO(1)so complexity isO(n)wherenis the lenght oftermsparameters.
def tf(self, document, terms):
if not terms: return 0.0
terms_in_document = float(self.documents.get(document, 0))
if not terms_in_document: return 0.0
result = sum(
self.terms.get(term, {}).get(document, 0)
for term in terms
) / terms_in_document
return resultThis method calculates the inverse document frequency of a list of terms over the analyzed documents.
- Input parameters
- terms
- a list of term.
- Returns
The average of the sum of the inverse document frequencies of all terms.
- Completixy
Access to the
termsdictionary isO(1)so complexity isO(n)wherenis the lenght oftermsparameters.
def idf(self, terms):
n_documents = float(len(self.documents))
if not n_documents: return 0.0
if not terms: return 0.0
logger = logging.getLogger('devorame.tfidf')
def _idf(term):
n_term = len(self.terms.get(term, {}))
result = math.log(n_documents / n_term)
logger.debug('log(%s/%s)= %s', n_documents, n_term, result)
return result
return sum(_idf(term) for term in terms) / len(terms)This method returns True if the document passed by parameter has been processed.
def has_document(self, document):
return document in self.documentsThis function updates the term frequency and the document frequency.
- Input parameters
- term
- the term at which to update the frequency.
- document
- The document where the term was found.
- frequency
- The number of times the term appears in the document.
- Returns
Nothing is returned. The internal
termsanddocumentsdictionaries are updated. - Complexity
Reading and writing a dictionary is
O(1)so this function is alsoO(1).
def _update_term_frequency(self, term, document, frequency):
self.terms.update({
term: dict( self.terms.get(term, {}),
**{ document: self.terms.get(
term, {}).get(
document, 0) + frequency})})
self.documents.update({
document: self.documents.get(document, 0) + frequency })