Extract Python function source text from the source code string
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
|
show 6 more comments
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
I suppose you could just iterate lines, and when one matches^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
4 hours ago
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
4 hours ago
Oops, yes. You get the idea, anyway.
– Blorgbeard
4 hours ago
Hmm, doesn't work if the function has weird indentation inside, for exampledef baz():n return [n1,n ]
– pkpnd
4 hours ago
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
4 hours ago
|
show 6 more comments
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
python
edited 3 hours ago
pkpnd
asked 5 hours ago
pkpndpkpnd
4,6211140
4,6211140
I suppose you could just iterate lines, and when one matches^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
4 hours ago
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
4 hours ago
Oops, yes. You get the idea, anyway.
– Blorgbeard
4 hours ago
Hmm, doesn't work if the function has weird indentation inside, for exampledef baz():n return [n1,n ]
– pkpnd
4 hours ago
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
4 hours ago
|
show 6 more comments
I suppose you could just iterate lines, and when one matches^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
4 hours ago
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
4 hours ago
Oops, yes. You get the idea, anyway.
– Blorgbeard
4 hours ago
Hmm, doesn't work if the function has weird indentation inside, for exampledef baz():n return [n1,n ]
– pkpnd
4 hours ago
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
4 hours ago
I suppose you could just iterate lines, and when one matches
^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)
– Blorgbeard
4 hours ago
I suppose you could just iterate lines, and when one matches
^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)
– Blorgbeard
4 hours ago
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
4 hours ago
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
4 hours ago
Oops, yes. You get the idea, anyway.
– Blorgbeard
4 hours ago
Oops, yes. You get the idea, anyway.
– Blorgbeard
4 hours ago
Hmm, doesn't work if the function has weird indentation inside, for example
def baz():n return [n1,n ]
– pkpnd
4 hours ago
Hmm, doesn't work if the function has weird indentation inside, for example
def baz():n return [n1,n ]
– pkpnd
4 hours ago
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
4 hours ago
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
4 hours ago
|
show 6 more comments
3 Answers
3
active
oldest
votes
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# unmatched parenthesis: ( }
pass
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, start_column = token.start
end_line, _ = token.end
enclosures = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL: # ignore empty lines
continue
if token.type == tokenize.OP and token.string in '([{':
enclosures += 1
_, column = token.start
if column <= start_column and token.type != tokenize.INDENT and not enclosures:
tokens.appendleft(token)
break
if token.type == tokenize.OP and token.string in ')]}':
enclosures -= 1
end_line, _ = token.end
lines.append((start_line, end_line))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 26), (28, 32)]
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c_old=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if 'def ' in line and not func:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c_old=''
c_old2=''
continue
func += line + 'n'
check = False
for c in line:
if c == '#' and not string and not multiline:
break
if c_old != '\':
if c in ['"', "'"]:
if c_old2 == c_old == c == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c in string:
string = ''
else:
if not string:
string = c
if not string and not multiline:
if c in brackets:
brackets[c] += 1
if c in close:
b = close[c]
brackets[b] -= 1
c_old2=c_old
c_old=c
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54374296%2fextract-python-function-source-text-from-the-source-code-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# unmatched parenthesis: ( }
pass
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, start_column = token.start
end_line, _ = token.end
enclosures = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL: # ignore empty lines
continue
if token.type == tokenize.OP and token.string in '([{':
enclosures += 1
_, column = token.start
if column <= start_column and token.type != tokenize.INDENT and not enclosures:
tokens.appendleft(token)
break
if token.type == tokenize.OP and token.string in ')]}':
enclosures -= 1
end_line, _ = token.end
lines.append((start_line, end_line))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 26), (28, 32)]
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
add a comment |
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# unmatched parenthesis: ( }
pass
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, start_column = token.start
end_line, _ = token.end
enclosures = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL: # ignore empty lines
continue
if token.type == tokenize.OP and token.string in '([{':
enclosures += 1
_, column = token.start
if column <= start_column and token.type != tokenize.INDENT and not enclosures:
tokens.appendleft(token)
break
if token.type == tokenize.OP and token.string in ')]}':
enclosures -= 1
end_line, _ = token.end
lines.append((start_line, end_line))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 26), (28, 32)]
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
add a comment |
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# unmatched parenthesis: ( }
pass
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, start_column = token.start
end_line, _ = token.end
enclosures = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL: # ignore empty lines
continue
if token.type == tokenize.OP and token.string in '([{':
enclosures += 1
_, column = token.start
if column <= start_column and token.type != tokenize.INDENT and not enclosures:
tokens.appendleft(token)
break
if token.type == tokenize.OP and token.string in ')]}':
enclosures -= 1
end_line, _ = token.end
lines.append((start_line, end_line))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 26), (28, 32)]
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# unmatched parenthesis: ( }
pass
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, start_column = token.start
end_line, _ = token.end
enclosures = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL: # ignore empty lines
continue
if token.type == tokenize.OP and token.string in '([{':
enclosures += 1
_, column = token.start
if column <= start_column and token.type != tokenize.INDENT and not enclosures:
tokens.appendleft(token)
break
if token.type == tokenize.OP and token.string in ')]}':
enclosures -= 1
end_line, _ = token.end
lines.append((start_line, end_line))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 26), (28, 32)]
edited 23 mins ago
answered 2 hours ago
blhsingblhsing
29.9k41336
29.9k41336
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
add a comment |
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
2 hours ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
30 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
11 mins ago
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
answered 1 hour ago
ValentinoValentino
39929
39929
add a comment |
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c_old=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if 'def ' in line and not func:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c_old=''
c_old2=''
continue
func += line + 'n'
check = False
for c in line:
if c == '#' and not string and not multiline:
break
if c_old != '\':
if c in ['"', "'"]:
if c_old2 == c_old == c == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c in string:
string = ''
else:
if not string:
string = c
if not string and not multiline:
if c in brackets:
brackets[c] += 1
if c in close:
b = close[c]
brackets[b] -= 1
c_old2=c_old
c_old=c
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c_old=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if 'def ' in line and not func:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c_old=''
c_old2=''
continue
func += line + 'n'
check = False
for c in line:
if c == '#' and not string and not multiline:
break
if c_old != '\':
if c in ['"', "'"]:
if c_old2 == c_old == c == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c in string:
string = ''
else:
if not string:
string = c
if not string and not multiline:
if c in brackets:
brackets[c] += 1
if c in close:
b = close[c]
brackets[b] -= 1
c_old2=c_old
c_old=c
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c_old=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if 'def ' in line and not func:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c_old=''
c_old2=''
continue
func += line + 'n'
check = False
for c in line:
if c == '#' and not string and not multiline:
break
if c_old != '\':
if c in ['"', "'"]:
if c_old2 == c_old == c == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c in string:
string = ''
else:
if not string:
string = c
if not string and not multiline:
if c in brackets:
brackets[c] += 1
if c in close:
b = close[c]
brackets[b] -= 1
c_old2=c_old
c_old=c
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c_old=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if 'def ' in line and not func:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c_old=''
c_old2=''
continue
func += line + 'n'
check = False
for c in line:
if c == '#' and not string and not multiline:
break
if c_old != '\':
if c in ['"', "'"]:
if c_old2 == c_old == c == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c in string:
string = ''
else:
if not string:
string = c
if not string and not multiline:
if c in brackets:
brackets[c] += 1
if c in close:
b = close[c]
brackets[b] -= 1
c_old2=c_old
c_old=c
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
edited 1 hour ago
answered 2 hours ago
CrivellaCrivella
33627
33627
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
add a comment |
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with
"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).– pkpnd
2 hours ago
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with
"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).– pkpnd
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
2 hours ago
1
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
2 hours ago
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54374296%2fextract-python-function-source-text-from-the-source-code-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I suppose you could just iterate lines, and when one matches
^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
4 hours ago
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
4 hours ago
Oops, yes. You get the idea, anyway.
– Blorgbeard
4 hours ago
Hmm, doesn't work if the function has weird indentation inside, for example
def baz():n return [n1,n ]
– pkpnd
4 hours ago
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
4 hours ago