Sourcing data fromat from multiple different structures [on hold]












0














Problem



I want to read in the data to dictionary





person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}


The data comes from different formats:



Format A.



dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}


Format B.



dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}


There will ba additional sources added in the future with additional structures.



Background



For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



There could be potentially 5-10 different structures from which the data will be read from.



My ideas so far



One potential way to solve this by taking advantage of polymorphism, where I create a Person class



class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected


and subclass it to all the "data mappers" of different data structures, e.g.



class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')

class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')


Now let's say I want to store these objects with SQLAlchemy



from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)


So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



person = Person(**FormatA.__dict__)
person.add()
person.commit()


Question



I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




  1. This is a good solution? What could be the drawbacks and problems down the line?

  2. Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?










share|improve this question









New contributor




Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











put on hold as off-topic by Jamal 10 mins ago


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

If this question can be reworded to fit the rules in the help center, please edit the question.


















    0














    Problem



    I want to read in the data to dictionary





    person = {
    'name': 'John Doe',
    'email': 'johndoe@email.com',
    'age': 50,
    'connected': False
    }


    The data comes from different formats:



    Format A.



    dict_a = {
    'name': {
    'first_name': 'John',
    'last_name': 'Doe'
    },
    'workEmail': 'johndoe@email.com',
    'age': 50,
    'connected': False
    }


    Format B.



    dict_b = {
    'fullName': 'John Doe',
    'workEmail': 'johndoe@email.com',
    'age': 50,
    'connected': False
    }


    There will ba additional sources added in the future with additional structures.



    Background



    For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



    There could be potentially 5-10 different structures from which the data will be read from.



    My ideas so far



    One potential way to solve this by taking advantage of polymorphism, where I create a Person class



    class Person:
    def __init__(self, name, email, age, connected):
    self.name = name
    self.email = email
    self.age = age
    self.connected = connected


    and subclass it to all the "data mappers" of different data structures, e.g.



    class FormatA(Person):
    def __init__(self, dict_a):
    self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
    self.email = dict_a.get('workEmail')
    self.age = dict_a.get('age')
    self.connected = dict_a.get('connected')

    class FormatB(Person):
    def __init__(self, dict_b):
    self.name = dict_b.get('fullName')
    self.email = dict_b.get('workEmail')
    self.age = dict_b.get('age')
    self.connected = dict_b.get('connected')


    Now let's say I want to store these objects with SQLAlchemy



    from sqlalchemy import Column, Integer, String, Boolean
    class Person(Base):
    __tablename__ = 'Person'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    email = Column(String)
    age = Column(Integer)
    connected = Column(Boolean)


    So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



    person = Person(**FormatA.__dict__)
    person.add()
    person.commit()


    Question



    I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




    1. This is a good solution? What could be the drawbacks and problems down the line?

    2. Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?










    share|improve this question









    New contributor




    Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.











    put on hold as off-topic by Jamal 10 mins ago


    This question appears to be off-topic. The users who voted to close gave this specific reason:


    • "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

    If this question can be reworded to fit the rules in the help center, please edit the question.
















      0












      0








      0







      Problem



      I want to read in the data to dictionary





      person = {
      'name': 'John Doe',
      'email': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      The data comes from different formats:



      Format A.



      dict_a = {
      'name': {
      'first_name': 'John',
      'last_name': 'Doe'
      },
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      Format B.



      dict_b = {
      'fullName': 'John Doe',
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      There will ba additional sources added in the future with additional structures.



      Background



      For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



      There could be potentially 5-10 different structures from which the data will be read from.



      My ideas so far



      One potential way to solve this by taking advantage of polymorphism, where I create a Person class



      class Person:
      def __init__(self, name, email, age, connected):
      self.name = name
      self.email = email
      self.age = age
      self.connected = connected


      and subclass it to all the "data mappers" of different data structures, e.g.



      class FormatA(Person):
      def __init__(self, dict_a):
      self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
      self.email = dict_a.get('workEmail')
      self.age = dict_a.get('age')
      self.connected = dict_a.get('connected')

      class FormatB(Person):
      def __init__(self, dict_b):
      self.name = dict_b.get('fullName')
      self.email = dict_b.get('workEmail')
      self.age = dict_b.get('age')
      self.connected = dict_b.get('connected')


      Now let's say I want to store these objects with SQLAlchemy



      from sqlalchemy import Column, Integer, String, Boolean
      class Person(Base):
      __tablename__ = 'Person'
      id = Column(Integer, primary_key=True)
      name = Column(String)
      email = Column(String)
      age = Column(Integer)
      connected = Column(Boolean)


      So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



      person = Person(**FormatA.__dict__)
      person.add()
      person.commit()


      Question



      I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




      1. This is a good solution? What could be the drawbacks and problems down the line?

      2. Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?










      share|improve this question









      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      Problem



      I want to read in the data to dictionary





      person = {
      'name': 'John Doe',
      'email': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      The data comes from different formats:



      Format A.



      dict_a = {
      'name': {
      'first_name': 'John',
      'last_name': 'Doe'
      },
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      Format B.



      dict_b = {
      'fullName': 'John Doe',
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      There will ba additional sources added in the future with additional structures.



      Background



      For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



      There could be potentially 5-10 different structures from which the data will be read from.



      My ideas so far



      One potential way to solve this by taking advantage of polymorphism, where I create a Person class



      class Person:
      def __init__(self, name, email, age, connected):
      self.name = name
      self.email = email
      self.age = age
      self.connected = connected


      and subclass it to all the "data mappers" of different data structures, e.g.



      class FormatA(Person):
      def __init__(self, dict_a):
      self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
      self.email = dict_a.get('workEmail')
      self.age = dict_a.get('age')
      self.connected = dict_a.get('connected')

      class FormatB(Person):
      def __init__(self, dict_b):
      self.name = dict_b.get('fullName')
      self.email = dict_b.get('workEmail')
      self.age = dict_b.get('age')
      self.connected = dict_b.get('connected')


      Now let's say I want to store these objects with SQLAlchemy



      from sqlalchemy import Column, Integer, String, Boolean
      class Person(Base):
      __tablename__ = 'Person'
      id = Column(Integer, primary_key=True)
      name = Column(String)
      email = Column(String)
      age = Column(Integer)
      connected = Column(Boolean)


      So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



      person = Person(**FormatA.__dict__)
      person.add()
      person.commit()


      Question



      I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




      1. This is a good solution? What could be the drawbacks and problems down the line?

      2. Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?







      python design-patterns scrapy






      share|improve this question









      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 1 min ago





















      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 18 mins ago









      Maivel

      1




      1




      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      put on hold as off-topic by Jamal 10 mins ago


      This question appears to be off-topic. The users who voted to close gave this specific reason:


      • "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

      If this question can be reworded to fit the rules in the help center, please edit the question.




      put on hold as off-topic by Jamal 10 mins ago


      This question appears to be off-topic. The users who voted to close gave this specific reason:


      • "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

      If this question can be reworded to fit the rules in the help center, please edit the question.






















          0






          active

          oldest

          votes

















          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes

          Popular posts from this blog

          Costa Masnaga

          Fotorealismo

          Sidney Franklin