Sourcing data fromat from multiple different structures [on hold]

Problem

I want to read in the data to dictionary

person = {

    'name': 'John Doe',

    'email': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

The data comes from different formats:

Format A.

dict_a = {

    'name': {

        'first_name': 'John',

        'last_name': 'Doe'

    },

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

Format B.

dict_b = {

    'fullName': 'John Doe',

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

There will ba additional sources added in the future with additional structures.

Background

For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.

There could be potentially 5-10 different structures from which the data will be read from.

My ideas so far

One potential way to solve this by taking advantage of polymorphism, where I create a Person class

class Person:

    def __init__(self, name, email, age, connected):

        self.name = name

        self.email = email

        self.age = age

        self.connected = connected

and subclass it to all the "data mappers" of different data structures, e.g.

class FormatA(Person):

    def __init__(self, dict_a):

        self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])

        self.email = dict_a.get('workEmail')

        self.age = dict_a.get('age')

        self.connected = dict_a.get('connected')



class FormatB(Person):

    def __init__(self, dict_b):

        self.name = dict_b.get('fullName')

        self.email = dict_b.get('workEmail')

        self.age = dict_b.get('age')

        self.connected = dict_b.get('connected')

Now let's say I want to store these objects with SQLAlchemy

from sqlalchemy import Column, Integer, String, Boolean

class Person(Base):

    __tablename__ = 'Person'

    id = Column(Integer, primary_key=True)

    name = Column(String)

    email = Column(String)

    age = Column(Integer)

    connected = Column(Boolean)

So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:

person = Person(**FormatA.__dict__)

person.add()

person.commit()

Question

I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:

This is a good solution? What could be the drawbacks and problems down the line?

Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?

edited 1 min ago

asked 18 mins ago

Maivel

New contributor

put on hold as off-topic by Jamal♦ 10 mins ago

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

If this question can be reworded to fit the rules in the help center, please edit the question.

add a comment |

Problem

I want to read in the data to dictionary

person = {

    'name': 'John Doe',

    'email': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

The data comes from different formats:

Format A.

dict_a = {

    'name': {

        'first_name': 'John',

        'last_name': 'Doe'

    },

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

Format B.

dict_b = {

    'fullName': 'John Doe',

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

There will ba additional sources added in the future with additional structures.

Background

There could be potentially 5-10 different structures from which the data will be read from.

My ideas so far

One potential way to solve this by taking advantage of polymorphism, where I create a Person class

class Person:

    def __init__(self, name, email, age, connected):

        self.name = name

        self.email = email

        self.age = age

        self.connected = connected

and subclass it to all the "data mappers" of different data structures, e.g.

class FormatA(Person):

    def __init__(self, dict_a):

        self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])

        self.email = dict_a.get('workEmail')

        self.age = dict_a.get('age')

        self.connected = dict_a.get('connected')



class FormatB(Person):

    def __init__(self, dict_b):

        self.name = dict_b.get('fullName')

        self.email = dict_b.get('workEmail')

        self.age = dict_b.get('age')

        self.connected = dict_b.get('connected')

Now let's say I want to store these objects with SQLAlchemy

from sqlalchemy import Column, Integer, String, Boolean

class Person(Base):

    __tablename__ = 'Person'

    id = Column(Integer, primary_key=True)

    name = Column(String)

    email = Column(String)

    age = Column(Integer)

    connected = Column(Boolean)

So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:

person = Person(**FormatA.__dict__)

person.add()

person.commit()

Question

This is a good solution? What could be the drawbacks and problems down the line?

Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?

edited 1 min ago

asked 18 mins ago

Maivel

New contributor

put on hold as off-topic by Jamal♦ 10 mins ago

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

If this question can be reworded to fit the rules in the help center, please edit the question.

add a comment |

Problem

I want to read in the data to dictionary

person = {

    'name': 'John Doe',

    'email': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

The data comes from different formats:

Format A.

dict_a = {

    'name': {

        'first_name': 'John',

        'last_name': 'Doe'

    },

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

Format B.

dict_b = {

    'fullName': 'John Doe',

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

There will ba additional sources added in the future with additional structures.

Background

There could be potentially 5-10 different structures from which the data will be read from.

My ideas so far

One potential way to solve this by taking advantage of polymorphism, where I create a Person class

class Person:

    def __init__(self, name, email, age, connected):

        self.name = name

        self.email = email

        self.age = age

        self.connected = connected

and subclass it to all the "data mappers" of different data structures, e.g.

class FormatA(Person):

    def __init__(self, dict_a):

        self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])

        self.email = dict_a.get('workEmail')

        self.age = dict_a.get('age')

        self.connected = dict_a.get('connected')



class FormatB(Person):

    def __init__(self, dict_b):

        self.name = dict_b.get('fullName')

        self.email = dict_b.get('workEmail')

        self.age = dict_b.get('age')

        self.connected = dict_b.get('connected')

Now let's say I want to store these objects with SQLAlchemy

from sqlalchemy import Column, Integer, String, Boolean

class Person(Base):

    __tablename__ = 'Person'

    id = Column(Integer, primary_key=True)

    name = Column(String)

    email = Column(String)

    age = Column(Integer)

    connected = Column(Boolean)

So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:

person = Person(**FormatA.__dict__)

person.add()

person.commit()

Question

This is a good solution? What could be the drawbacks and problems down the line?

Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?

edited 1 min ago

asked 18 mins ago

Maivel

New contributor

Problem

I want to read in the data to dictionary

person = {

    'name': 'John Doe',

    'email': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

The data comes from different formats:

Format A.

dict_a = {

    'name': {

        'first_name': 'John',

        'last_name': 'Doe'

    },

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

Format B.

dict_b = {

    'fullName': 'John Doe',

    'workEmail': 'johndoe@email.com',

    'age': 50,

    'connected': False

}

There will ba additional sources added in the future with additional structures.

Background

There could be potentially 5-10 different structures from which the data will be read from.

My ideas so far

One potential way to solve this by taking advantage of polymorphism, where I create a Person class

class Person:

    def __init__(self, name, email, age, connected):

        self.name = name

        self.email = email

        self.age = age

        self.connected = connected

and subclass it to all the "data mappers" of different data structures, e.g.

class FormatA(Person):

    def __init__(self, dict_a):

        self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])

        self.email = dict_a.get('workEmail')

        self.age = dict_a.get('age')

        self.connected = dict_a.get('connected')



class FormatB(Person):

    def __init__(self, dict_b):

        self.name = dict_b.get('fullName')

        self.email = dict_b.get('workEmail')

        self.age = dict_b.get('age')

        self.connected = dict_b.get('connected')

Now let's say I want to store these objects with SQLAlchemy

from sqlalchemy import Column, Integer, String, Boolean

class Person(Base):

    __tablename__ = 'Person'

    id = Column(Integer, primary_key=True)

    name = Column(String)

    email = Column(String)

    age = Column(Integer)

    connected = Column(Boolean)

So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:

person = Person(**FormatA.__dict__)

person.add()

person.commit()

Question

This is a good solution? What could be the drawbacks and problems down the line?

Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a Person class. It would detect if there is an appropriate mapping and apply it, alternativel would raise NotImplementedError if it is an unkown mapping or KeyError when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?

python design-patterns scrapy

edited 1 min ago

asked 18 mins ago

Maivel

New contributor

edited 1 min ago

asked 18 mins ago

Maivel

New contributor

edited 1 min ago

asked 18 mins ago

Maivel

New contributor

asked 18 mins ago

Maivel

asked 18 mins ago

Maivel

New contributor

Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

put on hold as off-topic by Jamal♦ 10 mins ago

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

If this question can be reworded to fit the rules in the help center, please edit the question.

put on hold as off-topic by Jamal♦ 10 mins ago

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal

If this question can be reworded to fit the rules in the help center, please edit the question.

add a comment |

0

active

oldest

votes

0

active

oldest

votes

0

active

oldest

votes

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk

Sourcing data fromat from multiple different structures [on hold]

Problem

Background

My ideas so far

Question

put on hold as off-topic by Jamal♦ 10 mins ago

Problem

Background

My ideas so far

Question

put on hold as off-topic by Jamal♦ 10 mins ago

Problem

Background

My ideas so far

Question

Problem

Background

My ideas so far

Question

put on hold as off-topic by Jamal♦ 10 mins ago

put on hold as off-topic by Jamal♦ 10 mins ago

0

0

0

Popular posts from this blog

Ottavio Pratesi

Error adding annotation colours to pheatmap in R: “more elements supplied than there are to replace”

15 giugno