Sourcing data fromat from multiple different structures [on hold]
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a
Person
class. It would detect if there is an appropriate mapping and apply it, alternativel would raiseNotImplementedError
if it is an unkown mapping orKeyError
when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?
python design-patterns scrapy
New contributor
put on hold as off-topic by Jamal♦ 10 mins ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a
Person
class. It would detect if there is an appropriate mapping and apply it, alternativel would raiseNotImplementedError
if it is an unkown mapping orKeyError
when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?
python design-patterns scrapy
New contributor
put on hold as off-topic by Jamal♦ 10 mins ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a
Person
class. It would detect if there is an appropriate mapping and apply it, alternativel would raiseNotImplementedError
if it is an unkown mapping orKeyError
when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?
python design-patterns scrapy
New contributor
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Currently I've implemented different sublclasses for the mapping, but I suppose it would be better if I could just pass any dictionary when instatiating a
Person
class. It would detect if there is an appropriate mapping and apply it, alternativel would raiseNotImplementedError
if it is an unkown mapping orKeyError
when an existing mapping has been changed. Is there a convention or industry standard for these types of situations?
python design-patterns scrapy
python design-patterns scrapy
New contributor
New contributor
edited 1 min ago
New contributor
asked 18 mins ago
Maivel
1
1
New contributor
New contributor
put on hold as off-topic by Jamal♦ 10 mins ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
put on hold as off-topic by Jamal♦ 10 mins ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
add a comment |
0
active
oldest
votes
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes